Sharing Your Dataset#
In addition to using built-in datasets in PaddleNLP, we also encourage users to contribute their own datasets to PaddleNLP.
Below we detail the workflow for contributing a dataset:
Environment Setup#
Writing and testing PaddleNLP code requires Python 3.6+ and the latest version of PaddlePaddle. Please ensure these dependencies are properly installed.
Click the Fork button on PaddleNLP’s GitHub page to create a copy of the PaddleNLP repo under your GitHub account.
Clone your forked repo locally and add the official repo as a remote.
git clone https://github.com/USERNAME/PaddleNLP cd PaddleNLP git remote add upstream https://github.com/PaddlePaddle/PaddleNLP.git
Install the pre-commit hook, which helps format source code and automatically check for issues before submission. PRs that fail hook checks cannot be merged into PaddleNLP.
pip install pre-commit pre-commit install
Adding a DatasetBuilder#
Create a new local branch, typically branched from develop.
git checkout -b my-new-dataset
Navigate to the
PaddleNLP/paddlenlp/datasets/directory in your local repo - all dataset code is stored here.cd paddlenlp/datasets
Determine a
namefor your dataset (e.g.,squad,chnsenticorp). Thisnamewill be used when loading your dataset.Note
To facilitate usage, ensure the
nameis concise and semantically meaningful.The dataset
nameshould follow snake_case convention.
Create a Python file named after the dataset
name(e.g.,squad.py) in this directory. Implement your dataset’sDatasetBuilderclass in this file.Refer to the tutorial How to Write a DatasetBuilder for detailed guidelines on implementing
DatasetBuilder.We also recommend referencing existing
DatasetBuilderimplementations. The following examples may be helpful:squad.py
- iwslt15.py Translation dataset containing vocabulary files.
glue.py GLUE dataset containing multiple sub-datasets, file format is TSV.
squad.py Reading comprehension dataset, file format is JSON.
imdb.py IMDB dataset, each split contains multiple files.
ptb.py Corpus dataset.
msra_ner.py Sequence labeling dataset.
#. After development, you can use load_dataset to test whether the splits in your created dataset can be correctly identified. You can also use :attr:`
Check if the dataset reading format meets your expectations:
from paddlenlp.datasets import load_dataset
ds = load_dataset('your_dataset_name', splits='your_split')
print(ds[0])
Submit Your Work#
When you confirm the dataset code is ready, commit your changes locally:
git add PaddleNLP/paddlenlp/datasets/your_dataset_name.py git commit
Before submitting, it’s recommended to fetch the latest upstream code and update current branch:
git fetch upstream git pull upstream develop
Push local changes to GitHub and submit a Pull Request to PaddleNLP:
git push origin my-new-dataset
The above is the complete process for contributing datasets to PaddleNLP. We will review your PR promptly and provide feedback if needed. If everything looks good, we will merge it into the PaddleNLP repo, making your dataset available to others.
If you have any questions about contributing datasets, feel free to join our official QQ technical group: 973379845. We’ll address your inquiries promptly.