dataset#
- class MapDataset(data, **kwargs)[source]#
Wraps a map-style dataset-like object as an instance of
MapDataset
, and equips it withmap
and other utility methods. All non-magic methods of the raw object are also accessible.- Parameters:
data (list|Dataset) – An object with
__getitem__
and__len__
methods. It could be a list or a subclass ofpaddle.io.Dataset
.kwargs (dict, optional) – Other information to be passed to the dataset.
For examples of this class, please see dataset_self_defined.
- filter(fn, num_workers=0)[source]#
Filters samples by the filter function and uses the filtered data to update this dataset.
- Parameters:
fn (callable) – A filter function that takes a sample as input and returns a boolean. Samples that return False would be discarded.
num_workers (int, optional) – Number of processes for multiprocessing. If set to 0, it doesn’t use multiprocessing. Defaults to
0
.
- map(fn, lazy=True, batched=False, num_workers=0)[source]#
Performs specific function on the dataset to transform and update every sample.
- Parameters:
fn (callable) – Transformations to be performed. It receives single sample as argument if batched is False. Else it receives all examples.
lazy (bool, optional) – If True, transformations would be delayed and performed on demand. Otherwise, transforms all samples at once. Note that if
fn
is stochastic,lazy
should be True or you will get the same result on all epochs. Defaults to False.batched (bool, optional) – If True, transformations would take all examples as input and return a collection of transformed examples. Note that if set True,
lazy
option would be ignored. Defaults to False.num_workers (int, optional) – Number of processes for multiprocessing. If set to 0, it doesn’t use multiprocessing. Note that if set to positive value,
lazy
option would be ignored. Defaults to 0.
- class DatasetBuilder(lazy=None, name=None, **config)[source]#
A base class for all DatasetBuilder. It provides a
read()
function to turn a data file into a MapDataset or IterDataset._get_data()
function and_read()
function should be implemented to download data file and read data file into aIterable
of the examples.For how to define a custom
DatasetBuilder
, please see contribute_dataset.- read(filename, split='train')[source]#
Returns a dataset containing all the examples that can be read from the file path.
If
self.lazy
is False, this eagerly reads all instances fromself._read()
and returns aMapDataset
.If
self.lazy
is True, this returns anIterDataset
, which internally relies on the generator created fromself._read()
to lazily produce examples. In this case your implementation of_read()
must also be lazy (that is, not load all examples into memory at once).- Parameters:
filename (str) – Path of data file to read, usually provided by
_get_data
function.split (str, optional) – The split name of selected dataset. This only makes a different when data files of different splits have different structures.
- Returns:
A
MapDataset|IterDataset
.
- class IterDataset(data, **kwargs)[source]#
Wraps a dataset-like object as an instance of
IterDataset
, and equips it withmap
and other utility methods. All non-magic methods of the raw object also accessible.- Parameters:
data (Iterable) – An object with
__iter__
function. It can be a Iterable or a subclass ofpaddle.io.IterableDataset
.kwargs (dict, optional) – Other information to be passed to the dataset.
For examples of this class, please see dataset_self_defined.
- filter(fn)[source]#
Filters samples by the filter function and uses the filtered data to update this dataset.
- Parameters:
fn (callable) – A filter function that takes a sample as input and returns a boolean. Samples that return False are discarded.
- shard(num_shards=None, index=None)[source]#
Split the dataset into
num_shards
pieces.- Parameters:
num_shards (int, optional) – An integer representing the number of data shards. If None,
num_shards
would be number of trainers. Defaults to None.index (int, optional) – An integer representing the index of the current shard. If None,
index
would be the current trainer rank id. Defaults to None.
- load_dataset(path_or_read_func, name=None, data_files=None, splits=None, lazy=None, **kwargs)[source]#
This method will load a dataset, either form PaddleNLP library or from a self-defined data loading script, by calling functions in
DatasetBuilder
.For all the names of datasets in PaddleNLP library, see here: dataset_list.
Either
splits
ordata_files
must be specified.- Parameters:
path_or_read_func (str|callable) – Name of the dataset processing script in PaddleNLP library or a custom data reading function.
name (str, optional) – Additional name to select a more specific dataset. Defaults to None.
data_files (str|list|tuple|dict, optional) – Defining the path of dataset files. If None.
splits
must be specified. Defaults to None.splits (str|list|tuple, optional) – Which split of the data to load. If None.
data_files
must be specified. Defaults to None.lazy (bool, optional) – Weather to return
MapDataset
or anIterDataset
. True forIterDataset
. False forMapDataset
. If None, return the default type of this dataset. Defaults to None.kwargs (dict) – Other keyword arguments to be passed to the
DatasetBuilder
.
- Returns:
A
MapDataset
orIterDataset
or a tuple of those.
For how to use this function, please see dataset_load and dataset_self_defined