PaddleNLP Datasets API#

PaddleNLP provides easy-to-use APIs for the following datasets. Please add splits information as needed:

Reading Comprehension#

Dataset Name	Description	Method
SQuAD	Stanford Question Answering Dataset (SQuAD1.1 and SQuAD2.0)	`paddlenlp.datasets.load_dataset('squad')`
DuReader-yesno	Dureader Yes-No: Polarity judgment for reading comprehension	`paddlenlp.datasets.load_dataset('dureader_yesno')`
DuReader-robust	Dureader Robust: Answer extraction for reading comprehension	`paddlenlp.datasets.load_dataset('dureader_robust')`
CMRC2018	Chinese Machine Reading Comprehension 2018	`paddlenlp.datasets.load_dataset('cmrc2018')`
DRCD	Delta Reading Comprehension Dataset	`paddlenlp.datasets.load_dataset('drcd')`
TriviaQA	Trivia Question Answering Dataset	`paddlenlp.datasets.load_dataset('triviaqa')`
C3	Multiple-choice Reading Comprehension	`paddlenlp.datasets.load_dataset('c3')`

Text Classification#

Dataset Name	Description	Method
CoLA	Single-sentence classification, binary (grammatical correctness)	`paddlenlp.datasets.load_dataset('glue','cola')`
SST-2	Single-sentence classification, binary (sentiment analysis)	`paddlenlp.datasets.load_dataset('glue','sst-2')`
MRPC	Sentence pair classification, binary (paraphrase detection)	`paddlenlp.datasets.load_dataset('glue','mrpc')`
STSB	Calculate sentence pair similarity, score ranges from 1 to 5	`paddlenlp.datasets.load_dataset('glue','sts-b')`
QQP	Determine if sentence pairs are equivalent, with two categories: equivalent and non-equivalent (binary classification)	`paddlenlp.datasets.load_dataset('glue','qqp')`
MNLI	Sentence pairs with premise and hypothesis. Relationships between premise and hypothesis fall into three categories: entailment, contradiction, neutral (3-class classification)	`paddlenlp.datasets.load_dataset('glue','mnli')`
QNLI	Determine if question and sentence are entailed, with two categories: entailment and not_entailed (binary classification)	`paddlenlp.datasets.load_dataset('glue','qnli')`
RTE	Judge if sentence pairs entail each other, with two categories: entailment and not_entailed (binary classification)	`paddlenlp.datasets.load_dataset('glue','rte')`
WNLI	Determine if sentence pairs are related, with two categories: related and unrelated (binary classification)	`paddlenlp.datasets.load_dataset('glue','wnli')`
LCQMC	A Large-scale Chinese Question Matching Corpus (semantic matching dataset)	`paddlenlp.datasets.load_dataset('lcqmc')`
ChnSentiCorp	Chinese review sentiment analysis corpus	`paddlenlp.datasets.load_dataset('chnsenticorp')`
COTE-DP	Chinese opinion extraction corpus	`paddlenlp.datasets.load_dataset('cote', 'dp')`
SE-ABSA16_PHNS	Chinese Aspect-based Sentiment Analysis Corpus	`paddlenlp.datasets.load_dataset('seabsa16', 'phns')`
AFQMC	Ant Financial Semantic Similarity Dataset (1: similar, 0: dissimilar)	`paddlenlp.datasets.load_dataset('clue', 'afqmc')`
TNEWS	Toutiao Chinese News Headlines Classification (15 categories)	`paddlenlp.datasets.load_dataset('clue', 'tnews')`
IFLYTEK	Long Text Classification (119 categories)	`paddlenlp.datasets.load_dataset('clue', 'iflytek')`
OCNLI	Original Chinese Natural Language Inference (three-way classification)	`paddlenlp.datasets.load_dataset('clue', 'ocnli')`
CMNLI	Chinese Language Understanding and Inference (entailment/contradiction/neutral)	`paddlenlp.datasets.load_dataset('clue', 'cmnli')`
CLUEWSC2020	Chinese Winograd Schema Challenge (coreference resolution)	`paddlenlp.datasets.load_dataset('clue', 'cluewsc2020')`
CSL	Chinese Scientific Literature Keyword Recognition (binary classification)	`paddlenlp.datasets.load_dataset('clue', 'csl')`
EPRSTMT	E-commerce Product Review Sentiment Analysis (Positive/Negative)	`paddlenlp.datasets.load_dataset('fewclue', 'eprstmt')`
CSLDCP	Chinese Scientific Literature Discipline Classification (67 categories)	`paddlenlp.datasets.load_dataset('fewclue', 'csldcp')`
Dataset	Description	Command
---	---	---
CSLDCP	Chinese literature discipline classification from FewCLUE benchmark, 67 categories	`paddlenlp.datasets.load_dataset('fewclue', 'csldcp')`
TNEWSF	Today's Headlines Chinese news (short text) classification from FewCLUE, 15 categories	`paddlenlp.datasets.load_dataset('fewclue', 'tnews')`
IFLYTEK	Long text classification task from FewCLUE, 119 categories	`paddlenlp.datasets.load_dataset('fewclue', 'iflytek')`
OCNLIF	Chinese natural language inference dataset from FewCLUE, sentence pair ternary classification	`paddlenlp.datasets.load_dataset('fewclue', 'ocnli')`
BUSTM	Dialogue short text semantic matching dataset from FewCLUE, binary classification	`paddlenlp.datasets.load_dataset('fewclue', 'bustm')`
CHIDF	Chinese idiom reading comprehension cloze from FewCLUE, predict correct idiom from 7 candidates	`paddlenlp.datasets.load_dataset('fewclue', 'chid')`
CSLF	Paper keyword recognition from FewCLUE, binary classification for authentic keywords	`paddlenlp.datasets.load_dataset('fewclue', 'csl')`
CLUEWSCF	WSC Winograd schema challenge Chinese version from FewCLUE, pronoun disambiguation task	`paddlenlp.datasets.load_dataset('fewclue', 'cluewsc')`
THUCNews	THUCNews Chinese news category classification	`paddlenlp.datasets.load_dataset('thucnews')`
HYP	English political news sentiment classification corpus
Dataset Name	Description	How to Use
----	---------	------
ChnSentiCorp	Chinese sentiment analysis dataset	`paddlenlp.datasets.load_dataset('chnsenticorp')`
LCQMC	Chinese question matching corpus, binary classification task	`paddlenlp.datasets.load_dataset('lcqmc', splits=['test', 'dev'])`
NLPCC-DBQA	Chinese database question answering dataset, binary classification task	`paddlenlp.datasets.load_dataset('nlpcc_dbqa')`

Natural Language Inference#

Dataset Name	Description	How to Use
CMNLI	Chinese Multi-Genre NLI dataset	`paddlenlp.datasets.load_dataset('cmnli')`
OCNLI	Chinese Original NLI dataset	`paddlenlp.datasets.load_dataset('ocnli')`
GLUE-MNLI	English Multi-Genre NLI dataset	`paddlenlp.datasets.load_dataset('glue', 'mnli')`
GLUE-QNLI	English Question NLI dataset	`paddlenlp.datasets.load_dataset('glue', 'qnli')`
GLUE-RTE	English Recognizing Textual Entailment dataset	`paddlenlp.datasets.load_dataset('glue', 'rte')`
XNLI	15-language NLI dataset, 3-class task	`paddlenlp.datasets.load_dataset('xnli', 'ar')`
XNLI_CN	Chinese subset of XNLI, 3-class task	`paddlenlp.datasets.load_dataset('xnli_cn')`

Text Matching#

Dataset Name	Description	How to Use
CAIL2019-SCM	Similar legal case matching	`paddlenlp.datasets.load_dataset('cail2019_scm')`

Sequence Labeling#

Dataset Name	Description	How to Use
MSRA_NER	MSRA named entity recognition dataset	`paddlenlp.datasets.load_dataset('msra_ner')`
People's Daily	People's Daily named entity recognition dataset	`paddlenlp.datasets.load_dataset('peoples_daily_ner')`
CoNLL-2002	Spanish and Dutch NER datasets	`paddlenlp.datasets.load_dataset('conll2002', 'es')`

Machine Translation#

Dataset Name	Description	How to Use
IWSLT15	IWSLT'15 English-Vietnamese translation dataset	`paddlenlp.datasets.load_dataset('iwslt15')`
WMT14ENDE	WMT14 EN-DE translation dataset with BPE tokenization	`paddlenlp.datasets.load_dataset('wmt14ende')`
## Machine Simultaneous Translation

Dataset Name	Description	Loading Method
BSTC	Baidu Speech Translation Corpus, including transcription_translation and ASR	`paddlenlp.datasets.load_dataset('bstc', 'asr')`

Dialogue System#

Dataset Name	Description	Loading Method
DuConv	Knowledge-aware Chinese Conversation Dataset	`paddlenlp.datasets.load_dataset('duconv')`

Text Generation#

Dataset Name	Description	Loading Method
Poetry	Classical Chinese Poetry Collection	`paddlenlp.datasets.load_dataset('poetry')`
Couplet	Chinese Couplet Dataset	`paddlenlp.datasets.load_dataset('couplet')`
DuReaderQG	Question Generation Dataset Based on DuReader	`paddlenlp.datasets.load_dataset('dureader_qg')`
AdvertiseGen	Chinese Advertising Copy Generation Dataset	`paddlenlp.datasets.load_dataset('advertisegen')`
LCSTS_new	Chinese Abstractive Summarization Dataset	`paddlenlp.datasets.load_dataset('lcsts_new')`
CNN/Dailymail	English Abstractive Summarization Dataset	`paddlenlp.datasets.load_dataset('cnn_dailymail')`

Corpus#

Dataset Name	Description	Loading Method
PTB	Penn Treebank Dataset	`paddlenlp.datasets.load_dataset('ptb')`
Dataset Name	Description	Example Usage
--------------	-------------	---------------
PTB	Penn Treebank Dataset	`paddlenlp.datasets.load_dataset('ptb')`
Yahoo Answer 100k	100k samples from Yahoo Answer	`paddlenlp.datasets.load_dataset('yahoo_answer_100k')`

PaddleNLP Datasets API

Contents

PaddleNLP Datasets API#

Reading Comprehension#

Text Classification#

Natural Language Inference#

Text Matching#

Sequence Labeling#

Machine Translation#

Dialogue System#

Text Generation#

Corpus#