dataset#
- class MapDataset(data, **kwargs)[source]#
Wraps a map-style dataset-like object as an instance of
MapDataset, and equips it withmapand other utility methods. All non-magic methods of the raw object are also accessible.- Parameters:
data (list|Dataset) – An object with
__getitem__and__len__methods. It could be a list or a subclass ofpaddle.io.Dataset.kwargs (dict, optional) – Other information to be passed to the dataset.
For examples of this class, please see dataset_self_defined.
- filter(fn, num_workers=0)[source]#
Filters samples by the filter function and uses the filtered data to update this dataset.
- Parameters:
fn (callable) – A filter function that takes a sample as input and returns a boolean. Samples that return False would be discarded.
num_workers (int, optional) – Number of processes for multiprocessing. If set to 0, it doesn’t use multiprocessing. Defaults to
0.
- map(fn, lazy=True, batched=False, num_workers=0)[source]#
Performs specific function on the dataset to transform and update every sample.
- Parameters:
fn (callable) – Transformations to be performed. It receives single sample as argument if batched is False. Else it receives all examples.
lazy (bool, optional) – If True, transformations would be delayed and performed on demand. Otherwise, transforms all samples at once. Note that if
fnis stochastic,lazyshould be True or you will get the same result on all epochs. Defaults to False.batched (bool, optional) – If True, transformations would take all examples as input and return a collection of transformed examples. Note that if set True,
lazyoption would be ignored. Defaults to False.num_workers (int, optional) – Number of processes for multiprocessing. If set to 0, it doesn’t use multiprocessing. Note that if set to positive value,
lazyoption would be ignored. Defaults to 0.
- class DatasetBuilder(lazy=None, name=None, **config)[source]#
A base class for all DatasetBuilder. It provides a
read()function to turn a data file into a MapDataset or IterDataset._get_data()function and_read()function should be implemented to download data file and read data file into aIterableof the examples.For how to define a custom
DatasetBuilder, please see contribute_dataset.- read(filename, split='train')[source]#
Returns a dataset containing all the examples that can be read from the file path.
If
self.lazyis False, this eagerly reads all instances fromself._read()and returns aMapDataset.If
self.lazyis True, this returns anIterDataset, which internally relies on the generator created fromself._read()to lazily produce examples. In this case your implementation of_read()must also be lazy (that is, not load all examples into memory at once).- Parameters:
filename (str) – Path of data file to read, usually provided by
_get_datafunction.split (str, optional) – The split name of selected dataset. This only makes a different when data files of different splits have different structures.
- Returns:
A
MapDataset|IterDataset.
- class IterDataset(data, **kwargs)[source]#
Wraps a dataset-like object as an instance of
IterDataset, and equips it withmapand other utility methods. All non-magic methods of the raw object also accessible.- Parameters:
data (Iterable) – An object with
__iter__function. It can be a Iterable or a subclass ofpaddle.io.IterableDataset.kwargs (dict, optional) – Other information to be passed to the dataset.
For examples of this class, please see dataset_self_defined.
- filter(fn)[source]#
Filters samples by the filter function and uses the filtered data to update this dataset.
- Parameters:
fn (callable) – A filter function that takes a sample as input and returns a boolean. Samples that return False are discarded.
- shard(num_shards=None, index=None)[source]#
Split the dataset into
num_shardspieces.- Parameters:
num_shards (int, optional) – An integer representing the number of data shards. If None,
num_shardswould be number of trainers. Defaults to None.index (int, optional) – An integer representing the index of the current shard. If None,
indexwould be the current trainer rank id. Defaults to None.
- load_dataset(path_or_read_func, name=None, data_files=None, splits=None, lazy=None, **kwargs)[source]#
This method will load a dataset, either form PaddleNLP library or from a self-defined data loading script, by calling functions in
DatasetBuilder.For all the names of datasets in PaddleNLP library, see here: dataset_list.
Either
splitsordata_filesmust be specified.- Parameters:
path_or_read_func (str|callable) – Name of the dataset processing script in PaddleNLP library or a custom data reading function.
name (str, optional) – Additional name to select a more specific dataset. Defaults to None.
data_files (str|list|tuple|dict, optional) – Defining the path of dataset files. If None.
splitsmust be specified. Defaults to None.splits (str|list|tuple, optional) – Which split of the data to load. If None.
data_filesmust be specified. Defaults to None.lazy (bool, optional) – Weather to return
MapDatasetor anIterDataset. True forIterDataset. False forMapDataset. If None, return the default type of this dataset. Defaults to None.kwargs (dict) – Other keyword arguments to be passed to the
DatasetBuilder.
- Returns:
A
MapDatasetorIterDatasetor a tuple of those.
For how to use this function, please see dataset_load and dataset_self_defined