PaddleNLP Datasets API#

PaddleNLP provides easy-to-use APIs for the following datasets. Please add splits information as needed:

Reading Comprehension#

Dataset Name Description Method
SQuAD Stanford Question Answering Dataset (SQuAD1.1 and SQuAD2.0) paddlenlp.datasets.load_dataset('squad')
DuReader-yesno Dureader Yes-No: Polarity judgment for reading comprehension paddlenlp.datasets.load_dataset('dureader_yesno')
DuReader-robust Dureader Robust: Answer extraction for reading comprehension paddlenlp.datasets.load_dataset('dureader_robust')
CMRC2018 Chinese Machine Reading Comprehension 2018 paddlenlp.datasets.load_dataset('cmrc2018')
DRCD Delta Reading Comprehension Dataset paddlenlp.datasets.load_dataset('drcd')
TriviaQA Trivia Question Answering Dataset paddlenlp.datasets.load_dataset('triviaqa')
C3 Multiple-choice Reading Comprehension paddlenlp.datasets.load_dataset('c3')

Text Classification#

Dataset Name Description Method
CoLA Single-sentence classification, binary (grammatical correctness) paddlenlp.datasets.load_dataset('glue','cola')
SST-2 Single-sentence classification, binary (sentiment analysis) paddlenlp.datasets.load_dataset('glue','sst-2')
MRPC Sentence pair classification, binary (paraphrase detection) paddlenlp.datasets.load_dataset('glue','mrpc')
STSB Calculate sentence pair similarity, score ranges from 1 to 5 paddlenlp.datasets.load_dataset('glue','sts-b')
QQP Determine if sentence pairs are equivalent, with two categories: equivalent and non-equivalent (binary classification) paddlenlp.datasets.load_dataset('glue','qqp')
MNLI Sentence pairs with premise and hypothesis. Relationships between premise and hypothesis fall into three categories: entailment, contradiction, neutral (3-class classification) paddlenlp.datasets.load_dataset('glue','mnli')
QNLI Determine if question and sentence are entailed, with two categories: entailment and not_entailed (binary classification) paddlenlp.datasets.load_dataset('glue','qnli')
RTE Judge if sentence pairs entail each other, with two categories: entailment and not_entailed (binary classification) paddlenlp.datasets.load_dataset('glue','rte')
WNLI Determine if sentence pairs are related, with two categories: related and unrelated (binary classification) paddlenlp.datasets.load_dataset('glue','wnli')
LCQMC A Large-scale Chinese Question Matching Corpus (semantic matching dataset) paddlenlp.datasets.load_dataset('lcqmc')
ChnSentiCorp Chinese review sentiment analysis corpus paddlenlp.datasets.load_dataset('chnsenticorp')
COTE-DP Chinese opinion extraction corpus paddlenlp.datasets.load_dataset('cote', 'dp')
SE-ABSA16_PHNS Chinese Aspect-based Sentiment Analysis Corpus paddlenlp.datasets.load_dataset('seabsa16', 'phns')
AFQMC Ant Financial Semantic Similarity Dataset (1: similar, 0: dissimilar) paddlenlp.datasets.load_dataset('clue', 'afqmc')
TNEWS Toutiao Chinese News Headlines Classification (15 categories) paddlenlp.datasets.load_dataset('clue', 'tnews')
IFLYTEK Long Text Classification (119 categories) paddlenlp.datasets.load_dataset('clue', 'iflytek')
OCNLI Original Chinese Natural Language Inference (three-way classification) paddlenlp.datasets.load_dataset('clue', 'ocnli')
CMNLI Chinese Language Understanding and Inference (entailment/contradiction/neutral) paddlenlp.datasets.load_dataset('clue', 'cmnli')
CLUEWSC2020 Chinese Winograd Schema Challenge (coreference resolution) paddlenlp.datasets.load_dataset('clue', 'cluewsc2020')
CSL Chinese Scientific Literature Keyword Recognition (binary classification) paddlenlp.datasets.load_dataset('clue', 'csl')
EPRSTMT E-commerce Product Review Sentiment Analysis (Positive/Negative) paddlenlp.datasets.load_dataset('fewclue', 'eprstmt')
CSLDCP Chinese Scientific Literature Discipline Classification (67 categories) paddlenlp.datasets.load_dataset('fewclue', 'csldcp')
Dataset Description Command
--- --- ---
CSLDCP Chinese literature discipline classification from FewCLUE benchmark, 67 categories paddlenlp.datasets.load_dataset('fewclue', 'csldcp')
TNEWSF Today's Headlines Chinese news (short text) classification from FewCLUE, 15 categories paddlenlp.datasets.load_dataset('fewclue', 'tnews')
IFLYTEK Long text classification task from FewCLUE, 119 categories paddlenlp.datasets.load_dataset('fewclue', 'iflytek')
OCNLIF Chinese natural language inference dataset from FewCLUE, sentence pair ternary classification paddlenlp.datasets.load_dataset('fewclue', 'ocnli')
BUSTM Dialogue short text semantic matching dataset from FewCLUE, binary classification paddlenlp.datasets.load_dataset('fewclue', 'bustm')
CHIDF Chinese idiom reading comprehension cloze from FewCLUE, predict correct idiom from 7 candidates paddlenlp.datasets.load_dataset('fewclue', 'chid')
CSLF Paper keyword recognition from FewCLUE, binary classification for authentic keywords paddlenlp.datasets.load_dataset('fewclue', 'csl')
CLUEWSCF WSC Winograd schema challenge Chinese version from FewCLUE, pronoun disambiguation task paddlenlp.datasets.load_dataset('fewclue', 'cluewsc')
THUCNews THUCNews Chinese news category classification paddlenlp.datasets.load_dataset('thucnews')
HYP English political news sentiment classification corpus
Dataset Name Description How to Use
---- --------- ------
ChnSentiCorp Chinese sentiment analysis dataset paddlenlp.datasets.load_dataset('chnsenticorp')
LCQMC Chinese question matching corpus, binary classification task paddlenlp.datasets.load_dataset('lcqmc', splits=['test', 'dev'])
NLPCC-DBQA Chinese database question answering dataset, binary classification task paddlenlp.datasets.load_dataset('nlpcc_dbqa')

Natural Language Inference#

Dataset Name Description How to Use
CMNLI Chinese Multi-Genre NLI dataset paddlenlp.datasets.load_dataset('cmnli')
OCNLI Chinese Original NLI dataset paddlenlp.datasets.load_dataset('ocnli')
GLUE-MNLI English Multi-Genre NLI dataset paddlenlp.datasets.load_dataset('glue', 'mnli')
GLUE-QNLI English Question NLI dataset paddlenlp.datasets.load_dataset('glue', 'qnli')
GLUE-RTE English Recognizing Textual Entailment dataset paddlenlp.datasets.load_dataset('glue', 'rte')
XNLI 15-language NLI dataset, 3-class task paddlenlp.datasets.load_dataset('xnli', 'ar')
XNLI_CN Chinese subset of XNLI, 3-class task paddlenlp.datasets.load_dataset('xnli_cn')

Text Matching#

Dataset Name Description How to Use
CAIL2019-SCM Similar legal case matching paddlenlp.datasets.load_dataset('cail2019_scm')

Sequence Labeling#

Dataset Name Description How to Use
MSRA_NER MSRA named entity recognition dataset paddlenlp.datasets.load_dataset('msra_ner')
People's Daily People's Daily named entity recognition dataset paddlenlp.datasets.load_dataset('peoples_daily_ner')
CoNLL-2002 Spanish and Dutch NER datasets paddlenlp.datasets.load_dataset('conll2002', 'es')

Machine Translation#

Dataset Name Description How to Use
IWSLT15 IWSLT'15 English-Vietnamese translation dataset paddlenlp.datasets.load_dataset('iwslt15')
WMT14ENDE WMT14 EN-DE translation dataset with BPE tokenization paddlenlp.datasets.load_dataset('wmt14ende')
## Machine Simultaneous Translation
Dataset Name Description Loading Method
BSTC Baidu Speech Translation Corpus, including transcription_translation and ASR paddlenlp.datasets.load_dataset('bstc', 'asr')

Dialogue System#

Dataset Name Description Loading Method
DuConv Knowledge-aware Chinese Conversation Dataset paddlenlp.datasets.load_dataset('duconv')

Text Generation#

Dataset Name Description Loading Method
Poetry Classical Chinese Poetry Collection paddlenlp.datasets.load_dataset('poetry')
Couplet Chinese Couplet Dataset paddlenlp.datasets.load_dataset('couplet')
DuReaderQG Question Generation Dataset Based on DuReader paddlenlp.datasets.load_dataset('dureader_qg')
AdvertiseGen Chinese Advertising Copy Generation Dataset paddlenlp.datasets.load_dataset('advertisegen')
LCSTS_new Chinese Abstractive Summarization Dataset paddlenlp.datasets.load_dataset('lcsts_new')
CNN/Dailymail English Abstractive Summarization Dataset paddlenlp.datasets.load_dataset('cnn_dailymail')

Corpus#

Dataset Name Description Loading Method
PTB Penn Treebank Dataset paddlenlp.datasets.load_dataset('ptb')
Dataset Name Description Example Usage
-------------- ------------- ---------------
PTB Penn Treebank Dataset paddlenlp.datasets.load_dataset('ptb')
Yahoo Answer 100k 100k samples from Yahoo Answer paddlenlp.datasets.load_dataset('yahoo_answer_100k')