collate#

class Stack(axis=0, dtype=None)[源代码]#

基类:object

Stacks the input data samples to construct the batch. The N input samples must have the same shape/length and will be stacked to construct a batch.

参数:
  • axis (int, optional) -- The axis in the result data along which the input data are stacked. Default: 0.

  • dtype (str|numpy.dtype, optional) -- The value type of the output. If it is set to None, the type of input data is used. Default: None.

__call__(data)[源代码]#

Batchifies the input data by stacking.

参数:

data (list[numpy.ndarray]) -- The input data samples. It is a list. Each element is a numpy.ndarray or list.

返回:

Stacked batch data.

返回类型:

numpy.ndarray

示例

from paddlenlp.data import Stack
a = [1, 2, 3, 4]
b = [3, 4, 5, 6]
c = [5, 6, 7, 8]
result = Stack()([a, b, c])
'''
[[1, 2, 3, 4],
 [3, 4, 5, 6],
 [5, 6, 7, 8]]
'''
class Pad(pad_val=0, axis=0, ret_length=None, dtype=None, pad_right=True)[源代码]#

基类:object

Pads the input data samples to the largest length at axis.

参数:
  • pad_val (float|int, optional) -- The padding value. Default: 0.

  • axis (int, optional) -- The axis to pad the arrays. The arrays will be padded to the largest length at axis. For example, assume the input arrays have shape (10, 8, 5), (6, 8, 5), (3, 8, 5) and the axis is 0. Each input will be padded into (10, 8, 5) and then stacked to form the final output, which has shape (3, 10, 8, 5). Default: 0.

  • ret_length (bool|numpy.dtype, optional) -- If it is bool, indicate whether to return the valid length in the output, and the data type of returned length is int32 if True. If it is numpy.dtype, indicate the data type of returned length. Default: None.

  • dtype (numpy.dtype, optional) -- The value type of the output. If it is set to None, the input data type is used. Default: None.

  • pad_right (bool, optional) -- Whether the padding direction is right-side. If True, it indicates we pad to the right side, while False indicates we pad to the left side. Default: True.

__call__(data)[源代码]#

Batchifies the input data by padding. The input will be padded to the largest dimension at axis and then stacked to form the final output. In addition, the function will output the original dimensions at the axis if ret_length is not None or False.

参数:

data (list[numpy.ndarray|list]) -- The input data samples. It is a list. Each element is a numpy.ndarray or list.

返回:

If ret_length is False, it is a numpy.ndarray representing the padded batch data and the shape is (N, …). Otherwise, it is a tuple, besides the padded batch data, the tuple also includes a numpy.ndarray representing original length at axis of all input samples, which shaped (N,).

返回类型:

numpy.ndarray|tuple[numpy.ndarray]

示例

from paddlenlp.data import Pad
a = [1, 2, 3, 4]
b = [5, 6, 7]
c = [8, 9]
result = Pad(pad_val=0)([a, b, c])
'''
[[1, 2, 3, 4],
 [5, 6, 7, 0],
 [8, 9, 0, 0]]
'''
class Tuple(fn, *args)[源代码]#

基类:object

Wraps multiple batchify functions together. The input functions will be applied to the corresponding input fields.

Each sample should be a list or tuple containing multiple fields. The i'th batchify function stored in Tuple will be applied on the i'th field.

For example, when data sample is (nd_data, label), you can wrap two batchify functions using Tuple(DataBatchify, LabelBatchify) to batchify nd_data and label correspondingly.

参数:
  • fn (callable|list[callable]|tuple[callable]) -- The batchify functions to wrap. It is a callable function or a list/tuple of callable functions.

  • args (tuple[callable]) -- The additional batchify functions to wrap.

__call__(data)[源代码]#

Batchifies data samples by applying each function on the corresponding data field, and each data field is produced by stacking the field data of samples.

参数:

data (list|tuple) -- The samples to batchfy. Each sample in list/tuple should contain N fields.

返回:

A tuple composed of results from all including batchifying functions.

返回类型:

tuple

示例

from paddlenlp.data import Stack, Pad, Tuple
data = [
        [[1, 2, 3, 4], [1]],
        [[5, 6, 7], [0]],
        [[8, 9], [1]],
       ]
batchify_fn = Tuple(Pad(pad_val=0), Stack())
ids, label = batchify_fn(data)
'''
ids:
[[1, 2, 3, 4],
[5, 6, 7, 0],
[8, 9, 0, 0]]
label: [[1], [0], [1]]
'''
class Dict(fn)[源代码]#

基类:object

Wraps multiple batchify functions together. The input functions will be applied to the corresponding input fields.

Each sample should be a dict containing multiple fields. Each batchify function with key stored in Dict will be applied on the field which has the same key.

For example, when data sample is {'tokens': tokens, 'labels': labels}, you can wrap two batchify functions using Dict({'tokens': DataBatchify, 'labels': LabelBatchify}) to batchify tokens and labels correspondingly.

参数:

fn (dict) -- The batchify functions to wrap. It is a dict, which values is callable functions.

__call__(data)[源代码]#

Batchifies data samples by applying each function on the corresponding data field, and each data field is produced by stacking the field data with the same key as batchify functions of all samples.

参数:

data (list[dict]|tuple[dict]) -- The samples to batchfy. Each sample in list/tuple is a dict with N key-values.

返回:

A tuple composed of results from all including batchifying functions.

返回类型:

tuple

示例

from paddlenlp.data import Stack, Pad, Dict
data = [
        {'labels':[1], 'token_ids':[1, 2, 3, 4]},
        {'labels':[0], 'token_ids':[5, 6, 7]},
        {'labels':[1], 'token_ids':[8, 9]},
       ]
batchify_fn = Dict({'token_ids':Pad(pad_val=0), 'labels':Stack()})
ids, label = batchify_fn(data)
'''
ids:
[[1, 2, 3, 4],
[5, 6, 7, 0],
[8, 9, 0, 0]]
label: [[1], [0], [1]]
'''