Creating DatasetBuilder
#
Dataset contributions are achieved by defining a subclass of DatasetBuilder
. A qualified DatasetBuilder
needs to follow certain protocols and specifications.
Let’s take LCQMC
as an example to understand the typical methods and parameters required in a DatasetBuilder
.
Member Variables#
from paddle.dataset.common import md5file
from paddle.utils.download import get_path_from_url
from paddlenlp.utils.env import DATA_HOME
class LCQMC(DatasetBuilder):
"""
LCQMC: A Large-scale Chinese Question Matching Corpus
More information please refer to `https://www.aclweb.org/anthology/C18-1166/`
"""
lazy = False
URL = "https://bj.bcebos.com/paddlehub-dataset/lcqmc.tar.gz"
MD5 = "62a7ba36f786a82ae59bbde0b0a9af0c"
META_INFO = collections.namedtuple('META_INFO', ('file', 'md5'))
SPLITS = {
'train': META_INFO(
os.path.join('lcqmc', 'train.tsv'),
'2193c022439b038ac12c0ae918b211a1'),
'dev': META_INFO(
os.path.join('lcqmc', 'dev.tsv'),
'c5dcba253cb4105d914964fd8b3c0e94'),
'test': META_INFO(
os.path.join('lcqmc', 'test.tsv'),
'8f4b71e15e67696cc9e112a459ec42bd'),
}
First, the contributed dataset needs to inherit from the paddlenlp.datasets.DatasetBuilder
class, with the class name in camel case format. Then you should add a docstring briefly describing the dataset’s origin and other information. The following member variables need to be defined:
lazy
: The default dataset type.False
corresponds toMapDataset
,True
corresponds toIterDataset
.:attr:`URL
URL
: The download URL for the dataset archive, must provide a valid and stable link. If the dataset is not archived, this may be omitted.MD5
: MD5 checksum of the dataset archive for file validation. If the dataset is not archived, this may be omitted.META_INFO
: The format of dataset split information.SPLITS
: Split information of the dataset, containing file locations, filenames, MD5 values, etc. after decompression. For non-archived datasets, download URLs are typically provided here. May also include parameters like file reading configurations.
Additionally, some datasets may require other member variables like VOCAB_INFO
(refer to iwslt15.py). Member variables may have different formats. Contributors can adjust accordingly based on actual requirements.
Note
If the contributed dataset has no subsets, the
DatasetBuilder
must include theSPLITS
member variable, which must be a dictionary with keys corresponding to the dataset’s splits.If the contributed dataset contains subsets, the
DatasetBuilder
must include theBUILDER_CONFIGS
member variable. This must be a dictionary with keys corresponding to the subset’sname
. The values should be dictionaries containing split information for the subset, with keys beingsplits
. For specific formats, refer to glue.py.
_get_data()
Method#
def _get_data(self, mode, **kwargs):
''' Check and download Dataset '''
default_root = os.path.join(DATA_HOME, self.__class__.__name__)
filename, data_hash = self.SPLITS[mode]
fullname = os.path.join(default_root, filename)
if not os.path.exists(fullname) or (data_hash and
not md5file(fullname) == data_hash):
get_path_from_url(self.URL, default_root, self.MD5)
return fullname
The _get_data()
method locates the specific dataset file based on the input mode
and split information. It first performs MD5 checksum validation on local files. If validation fails, it calls :func:`
paddle.utils.download.get_path_from_url
method downloads and verifies dataset files, finally returns the local path of dataset file.
_read()
method#
def _read(self, filename):
"""Reads data."""
with open(filename, 'r', encoding='utf-8') as f:
head = None
for line in f:
data = line.strip().split("\t")
if not head:
head = data
else:
query, title, label = data
yield {"query": query, "title": title, "label": label}
The _read()
method reads data from given file path. This method must be a generator to ensure DatasetBuilder
can construct both MapDataset
and IterDataset
. When different splits require distinct data reading approaches, this method should additionally support split
parameter and handle different split configurations.
Note
Each example provided by this method should be a
Dictionary
object.DatasetBuilder
provides label-to-id conversion during Dataset generation. To use this feature, users must set the label key in examples as “label” or “labels”, and properly implementget_labels()
method in the class.
get_labels()
method#
def get_labels(self):
"""
Return labels of the LCQMC object.
"""
return ["0", "1"]
The get_labels()
method returns a list containing all labels in the dataset. This is used to convert class labels to ids, and this list will be passed as an instance variable to the generated dataset.
get_vocab()
method#
If the dataset provides vocabulary files, the get_vocab()
method and VOCAB_INFO
variable need to be added.
This method returns a Dictionary
object containing dataset vocabulary information based on VOCAB_INFO
, which is passed as an instance variable to the generated dataset. Used to initialize paddlenlp.data.Vocab
object during training. Refer to official implementation for method details.
iwslt15.py <PaddlePaddle/PaddleNLP>`__
Note
When contributing a dataset, the
get_labels()
andget_vocab()
methods are optional, depending on the specific dataset content. The_read()
and_get_data()
methods are required.If you do not wish to perform an md5 check during data retrieval, you may omit the relevant member variables and validation code.