Creating DatasetBuilder#
Dataset contributions are achieved by defining a subclass of DatasetBuilder. A qualified DatasetBuilder needs to follow certain protocols and specifications.
Let’s take LCQMC as an example to understand the typical methods and parameters required in a DatasetBuilder.
Member Variables#
from paddle.dataset.common import md5file
from paddle.utils.download import get_path_from_url
from paddlenlp.utils.env import DATA_HOME
class LCQMC(DatasetBuilder):
"""
LCQMC: A Large-scale Chinese Question Matching Corpus
More information please refer to `https://www.aclweb.org/anthology/C18-1166/`
"""
lazy = False
URL = "https://bj.bcebos.com/paddlehub-dataset/lcqmc.tar.gz"
MD5 = "62a7ba36f786a82ae59bbde0b0a9af0c"
META_INFO = collections.namedtuple('META_INFO', ('file', 'md5'))
SPLITS = {
'train': META_INFO(
os.path.join('lcqmc', 'train.tsv'),
'2193c022439b038ac12c0ae918b211a1'),
'dev': META_INFO(
os.path.join('lcqmc', 'dev.tsv'),
'c5dcba253cb4105d914964fd8b3c0e94'),
'test': META_INFO(
os.path.join('lcqmc', 'test.tsv'),
'8f4b71e15e67696cc9e112a459ec42bd'),
}
First, the contributed dataset needs to inherit from the paddlenlp.datasets.DatasetBuilder class, with the class name in camel case format. Then you should add a docstring briefly describing the dataset’s origin and other information. The following member variables need to be defined:
lazy: The default dataset type.Falsecorresponds toMapDataset,Truecorresponds toIterDataset.:attr:`URL
URL: The download URL for the dataset archive, must provide a valid and stable link. If the dataset is not archived, this may be omitted.MD5: MD5 checksum of the dataset archive for file validation. If the dataset is not archived, this may be omitted.META_INFO: The format of dataset split information.SPLITS: Split information of the dataset, containing file locations, filenames, MD5 values, etc. after decompression. For non-archived datasets, download URLs are typically provided here. May also include parameters like file reading configurations.
Additionally, some datasets may require other member variables like VOCAB_INFO (refer to iwslt15.py). Member variables may have different formats. Contributors can adjust accordingly based on actual requirements.
Note
If the contributed dataset has no subsets, the
DatasetBuildermust include theSPLITSmember variable, which must be a dictionary with keys corresponding to the dataset’s splits.If the contributed dataset contains subsets, the
DatasetBuildermust include theBUILDER_CONFIGSmember variable. This must be a dictionary with keys corresponding to the subset’sname. The values should be dictionaries containing split information for the subset, with keys beingsplits. For specific formats, refer to glue.py.
_get_data() Method#
def _get_data(self, mode, **kwargs):
''' Check and download Dataset '''
default_root = os.path.join(DATA_HOME, self.__class__.__name__)
filename, data_hash = self.SPLITS[mode]
fullname = os.path.join(default_root, filename)
if not os.path.exists(fullname) or (data_hash and
not md5file(fullname) == data_hash):
get_path_from_url(self.URL, default_root, self.MD5)
return fullname
The _get_data() method locates the specific dataset file based on the input mode and split information. It first performs MD5 checksum validation on local files. If validation fails, it calls :func:`
paddle.utils.download.get_path_from_url method downloads and verifies dataset files, finally returns the local path of dataset file.
_read() method#
def _read(self, filename):
"""Reads data."""
with open(filename, 'r', encoding='utf-8') as f:
head = None
for line in f:
data = line.strip().split("\t")
if not head:
head = data
else:
query, title, label = data
yield {"query": query, "title": title, "label": label}
The _read() method reads data from given file path. This method must be a generator to ensure DatasetBuilder can construct both MapDataset and IterDataset. When different splits require distinct data reading approaches, this method should additionally support split parameter and handle different split configurations.
Note
Each example provided by this method should be a
Dictionaryobject.DatasetBuilderprovides label-to-id conversion during Dataset generation. To use this feature, users must set the label key in examples as “label” or “labels”, and properly implementget_labels()method in the class.
get_labels() method#
def get_labels(self):
"""
Return labels of the LCQMC object.
"""
return ["0", "1"]
The get_labels() method returns a list containing all labels in the dataset. This is used to convert class labels to ids, and this list will be passed as an instance variable to the generated dataset.
get_vocab() method#
If the dataset provides vocabulary files, the get_vocab() method and VOCAB_INFO variable need to be added.
This method returns a Dictionary object containing dataset vocabulary information based on VOCAB_INFO, which is passed as an instance variable to the generated dataset. Used to initialize paddlenlp.data.Vocab object during training. Refer to official implementation for method details.
iwslt15.py <PaddlePaddle/PaddleNLP>`__
Note
When contributing a dataset, the
get_labels()andget_vocab()methods are optional, depending on the specific dataset content. The_read()and_get_data()methods are required.If you do not wish to perform an md5 check during data retrieval, you may omit the relevant member variables and validation code.