PaddleNLP Metrics API#
Currently PaddleNLP provides the following model evaluation metrics:
| Metric | Description | API |
|---|---|---|
| Perplexity | Perplexity, commonly used to evaluate language models, also applicable to machine translation and text generation tasks. | paddlenlp.metrics.Perplexity |
| BLEU(BiLingual Evaluation Understudy) | Common machine translation evaluation metric | paddlenlp.metrics.BLEU |
| Rouge(Recall-Oriented Understudy for Gisting Evaluation) | Evaluation metrics for automatic summarization and machine translation | paddlenlp.metrics.RougeL, paddlenlp.metrics.RougeN |
| AccuracyAndF1 | Accuracy and F1-score, applicable to MRPC and QQP tasks in GLUE | paddlenlp.metrics.AccuracyAndF1 |
| PearsonAndSpearman | Pearson correlation coefficient and Spearman's rank correlation coefficient. Applicable to STS-B task in GLUE | paddlenlp.metrics.PearsonAndSpearman |
| Mcc(Matthews correlation coefficient) | Matthews correlation coefficient, measuring binary classification performance. Applicable to CoLA task in GLUE | paddlenlp.metrics.Mcc |
| ChunkEvaluator | Computes precision, recall and F1-score for chunk detection. Commonly used in sequence labeling tasks like Named Entity Recognition (NER) | paddlenlp.metrics.ChunkEvaluator |
| Squad Evaluation | Evaluation metrics for SQuAD and DuReader-robust | paddlenlp.metrics.compute_predictions, paddlenlp.metrics.squad_evaluate |
| Distinct | Diversity metric commonly used to measure the formal diversity of sentences generated by text generation models. | paddlenlp.metrics.Distinct |