PaddleNLP Metrics API#

Currently PaddleNLP provides the following model evaluation metrics:

Metric	Description	API
Perplexity	Perplexity, commonly used to evaluate language models, also applicable to machine translation and text generation tasks.	`paddlenlp.metrics.Perplexity`
BLEU(BiLingual Evaluation Understudy)	Common machine translation evaluation metric	`paddlenlp.metrics.BLEU`
Rouge(Recall-Oriented Understudy for Gisting Evaluation)	Evaluation metrics for automatic summarization and machine translation	`paddlenlp.metrics.RougeL`, `paddlenlp.metrics.RougeN`
AccuracyAndF1	Accuracy and F1-score, applicable to MRPC and QQP tasks in GLUE	`paddlenlp.metrics.AccuracyAndF1`
PearsonAndSpearman	Pearson correlation coefficient and Spearman's rank correlation coefficient. Applicable to STS-B task in GLUE	`paddlenlp.metrics.PearsonAndSpearman`
Mcc(Matthews correlation coefficient)	Matthews correlation coefficient, measuring binary classification performance. Applicable to CoLA task in GLUE	`paddlenlp.metrics.Mcc`
ChunkEvaluator	Computes precision, recall and F1-score for chunk detection. Commonly used in sequence labeling tasks like Named Entity Recognition (NER)	`paddlenlp.metrics.ChunkEvaluator`
Squad Evaluation	Evaluation metrics for SQuAD and DuReader-robust	`paddlenlp.metrics.compute_predictions`, `paddlenlp.metrics.squad_evaluate`
Distinct	Diversity metric commonly used to measure the formal diversity of sentences generated by text generation models.	`paddlenlp.metrics.Distinct`