dataset#

class MapDataset(data, **kwargs)[source]#

Wraps a map-style dataset-like object as an instance of MapDataset, and equips it with map and other utility methods. All non-magic methods of the raw object are also accessible.

Parameters:
  • data (list|Dataset) – An object with __getitem__ and __len__ methods. It could be a list or a subclass of paddle.io.Dataset.

  • kwargs (dict, optional) – Other information to be passed to the dataset.

For examples of this class, please see dataset_self_defined.

filter(fn, num_workers=0)[source]#

Filters samples by the filter function and uses the filtered data to update this dataset.

Parameters:
  • fn (callable) – A filter function that takes a sample as input and returns a boolean. Samples that return False would be discarded.

  • num_workers (int, optional) – Number of processes for multiprocessing. If set to 0, it doesn’t use multiprocessing. Defaults to 0.

map(fn, lazy=True, batched=False, num_workers=0)[source]#

Performs specific function on the dataset to transform and update every sample.

Parameters:
  • fn (callable) – Transformations to be performed. It receives single sample as argument if batched is False. Else it receives all examples.

  • lazy (bool, optional) – If True, transformations would be delayed and performed on demand. Otherwise, transforms all samples at once. Note that if fn is stochastic, lazy should be True or you will get the same result on all epochs. Defaults to False.

  • batched (bool, optional) – If True, transformations would take all examples as input and return a collection of transformed examples. Note that if set True, lazy option would be ignored. Defaults to False.

  • num_workers (int, optional) – Number of processes for multiprocessing. If set to 0, it doesn’t use multiprocessing. Note that if set to positive value, lazy option would be ignored. Defaults to 0.

class DatasetBuilder(lazy=None, name=None, **config)[source]#

A base class for all DatasetBuilder. It provides a read() function to turn a data file into a MapDataset or IterDataset.

_get_data() function and _read() function should be implemented to download data file and read data file into a Iterable of the examples.

For how to define a custom DatasetBuilder, please see contribute_dataset.

read(filename, split='train')[source]#

Returns a dataset containing all the examples that can be read from the file path.

If self.lazy is False, this eagerly reads all instances from self._read() and returns a MapDataset.

If self.lazy is True, this returns an IterDataset, which internally relies on the generator created from self._read() to lazily produce examples. In this case your implementation of _read() must also be lazy (that is, not load all examples into memory at once).

Parameters:
  • filename (str) – Path of data file to read, usually provided by _get_data function.

  • split (str, optional) – The split name of selected dataset. This only makes a different when data files of different splits have different structures.

Returns:

A MapDataset|IterDataset.

get_labels()[source]#

Returns list of class labels of the dataset if specified.

get_vocab()[source]#

Returns vocab file path of the dataset if specified.

class IterDataset(data, **kwargs)[source]#

Wraps a dataset-like object as an instance of IterDataset, and equips it with map and other utility methods. All non-magic methods of the raw object also accessible.

Parameters:
  • data (Iterable) – An object with __iter__ function. It can be a Iterable or a subclass of paddle.io.IterableDataset.

  • kwargs (dict, optional) – Other information to be passed to the dataset.

For examples of this class, please see dataset_self_defined.

filter(fn)[source]#

Filters samples by the filter function and uses the filtered data to update this dataset.

Parameters:

fn (callable) – A filter function that takes a sample as input and returns a boolean. Samples that return False are discarded.

shard(num_shards=None, index=None)[source]#

Split the dataset into num_shards pieces.

Parameters:
  • num_shards (int, optional) – An integer representing the number of data shards. If None, num_shards would be number of trainers. Defaults to None.

  • index (int, optional) – An integer representing the index of the current shard. If None, index would be the current trainer rank id. Defaults to None.

map(fn)[source]#

Performs specific function on the dataset to transform and update every sample.

Parameters:

fn (callable) – Transformations to be performed. It receives single sample as argument.

load_dataset(path_or_read_func, name=None, data_files=None, splits=None, lazy=None, **kwargs)[source]#

This method will load a dataset, either form PaddleNLP library or from a self-defined data loading script, by calling functions in DatasetBuilder.

For all the names of datasets in PaddleNLP library, see here: dataset_list.

Either splits or data_files must be specified.

Parameters:
  • path_or_read_func (str|callable) – Name of the dataset processing script in PaddleNLP library or a custom data reading function.

  • name (str, optional) – Additional name to select a more specific dataset. Defaults to None.

  • data_files (str|list|tuple|dict, optional) – Defining the path of dataset files. If None. splits must be specified. Defaults to None.

  • splits (str|list|tuple, optional) – Which split of the data to load. If None. data_files must be specified. Defaults to None.

  • lazy (bool, optional) – Weather to return MapDataset or an IterDataset. True for IterDataset. False for MapDataset. If None, return the default type of this dataset. Defaults to None.

  • kwargs (dict) – Other keyword arguments to be passed to the DatasetBuilder.

Returns:

A MapDataset or IterDataset or a tuple of those.

For how to use this function, please see dataset_load and dataset_self_defined