| CoLA |
Single-sentence classification, binary (grammatical correctness) |
paddlenlp.datasets.load_dataset('glue','cola') |
| SST-2 |
Single-sentence classification, binary (sentiment analysis) |
paddlenlp.datasets.load_dataset('glue','sst-2') |
| MRPC |
Sentence pair classification, binary (paraphrase detection) |
paddlenlp.datasets.load_dataset('glue','mrpc') |
| STSB |
Calculate sentence pair similarity, score ranges from 1 to 5 |
paddlenlp.datasets.load_dataset('glue','sts-b') |
| QQP |
Determine if sentence pairs are equivalent, with two categories: equivalent and non-equivalent (binary classification) |
paddlenlp.datasets.load_dataset('glue','qqp') |
| MNLI |
Sentence pairs with premise and hypothesis. Relationships between premise and hypothesis fall into three categories: entailment, contradiction, neutral (3-class classification) |
paddlenlp.datasets.load_dataset('glue','mnli') |
| QNLI |
Determine if question and sentence are entailed, with two categories: entailment and not_entailed (binary classification) |
paddlenlp.datasets.load_dataset('glue','qnli') |
| RTE |
Judge if sentence pairs entail each other, with two categories: entailment and not_entailed (binary classification) |
paddlenlp.datasets.load_dataset('glue','rte') |
| WNLI |
Determine if sentence pairs are related, with two categories: related and unrelated (binary classification) |
paddlenlp.datasets.load_dataset('glue','wnli') |
| LCQMC |
A Large-scale Chinese Question Matching Corpus (semantic matching dataset) |
paddlenlp.datasets.load_dataset('lcqmc') |
| ChnSentiCorp |
Chinese review sentiment analysis corpus |
paddlenlp.datasets.load_dataset('chnsenticorp') |
| COTE-DP |
Chinese opinion extraction corpus |
paddlenlp.datasets.load_dataset('cote', 'dp') |
| SE-ABSA16_PHNS |
Chinese Aspect-based Sentiment Analysis Corpus |
paddlenlp.datasets.load_dataset('seabsa16', 'phns') |
| AFQMC |
Ant Financial Semantic Similarity Dataset (1: similar, 0: dissimilar) |
paddlenlp.datasets.load_dataset('clue', 'afqmc') |
| TNEWS |
Toutiao Chinese News Headlines Classification (15 categories) |
paddlenlp.datasets.load_dataset('clue', 'tnews') |
| IFLYTEK |
Long Text Classification (119 categories) |
paddlenlp.datasets.load_dataset('clue', 'iflytek') |
| OCNLI |
Original Chinese Natural Language Inference (three-way classification) |
paddlenlp.datasets.load_dataset('clue', 'ocnli') |
| CMNLI |
Chinese Language Understanding and Inference (entailment/contradiction/neutral) |
paddlenlp.datasets.load_dataset('clue', 'cmnli') |
| CLUEWSC2020 |
Chinese Winograd Schema Challenge (coreference resolution) |
paddlenlp.datasets.load_dataset('clue', 'cluewsc2020') |
| CSL |
Chinese Scientific Literature Keyword Recognition (binary classification) |
paddlenlp.datasets.load_dataset('clue', 'csl') |
| EPRSTMT |
E-commerce Product Review Sentiment Analysis (Positive/Negative) |
paddlenlp.datasets.load_dataset('fewclue', 'eprstmt') |
| CSLDCP |
Chinese Scientific Literature Discipline Classification (67 categories) |
paddlenlp.datasets.load_dataset('fewclue', 'csldcp') |
| Dataset |
Description |
Command |
| --- |
--- |
--- |
| CSLDCP |
Chinese literature discipline classification from FewCLUE benchmark, 67 categories |
paddlenlp.datasets.load_dataset('fewclue', 'csldcp') |
| TNEWSF |
Today's Headlines Chinese news (short text) classification from FewCLUE, 15 categories |
paddlenlp.datasets.load_dataset('fewclue', 'tnews') |
| IFLYTEK |
Long text classification task from FewCLUE, 119 categories |
paddlenlp.datasets.load_dataset('fewclue', 'iflytek') |
| OCNLIF |
Chinese natural language inference dataset from FewCLUE, sentence pair ternary classification |
paddlenlp.datasets.load_dataset('fewclue', 'ocnli') |
| BUSTM |
Dialogue short text semantic matching dataset from FewCLUE, binary classification |
paddlenlp.datasets.load_dataset('fewclue', 'bustm') |
| CHIDF |
Chinese idiom reading comprehension cloze from FewCLUE, predict correct idiom from 7 candidates |
paddlenlp.datasets.load_dataset('fewclue', 'chid') |
| CSLF |
Paper keyword recognition from FewCLUE, binary classification for authentic keywords |
paddlenlp.datasets.load_dataset('fewclue', 'csl') |
| CLUEWSCF |
WSC Winograd schema challenge Chinese version from FewCLUE, pronoun disambiguation task |
paddlenlp.datasets.load_dataset('fewclue', 'cluewsc') |
| THUCNews |
THUCNews Chinese news category classification |
paddlenlp.datasets.load_dataset('thucnews') |
| HYP |
English political news sentiment classification corpus |
|
| Dataset Name |
Description |
How to Use |
| ---- |
--------- |
------ |
| ChnSentiCorp |
Chinese sentiment analysis dataset |
paddlenlp.datasets.load_dataset('chnsenticorp') |
| LCQMC |
Chinese question matching corpus, binary classification task |
paddlenlp.datasets.load_dataset('lcqmc', splits=['test', 'dev']) |
| NLPCC-DBQA |
Chinese database question answering dataset, binary classification task |
paddlenlp.datasets.load_dataset('nlpcc_dbqa') |