distinct

class Distinct(n_size=2, trans_func=None, name='distinct')[源代码]

基类:paddle.metric.metrics.Metric

Distinct is an algorithm for evaluating the textual diversity of the generated text by calculating the number of distinct n-grams. The larger the number of distinct n-grams, the higher the diversity of the text. See details at https://arxiv.org/abs/1510.03055.

Distinct could be used as a paddle.metric.Metric class, or an ordinary class. When Distinct is used as a paddle.metric.Metric class, a function is needed to transform the network output to a string list.

参数
  • trans_func (callable, optional) --

    trans_func transforms the network output to a string list. Default: None.

    注解

    When Distinct is used as a paddle.metric.Metric class, trans_func must be provided. Please note that the input of trans_func is numpy array.

  • n_size (int, optional) -- Number of gram for Distinct metric. Default: 2.

  • name (str, optional) -- Name of paddle.metric.Metric instance. Default: "distinct".

实际案例

  1. Using as a general evaluation object.

from paddlenlp.metrics import Distinct
distinct = Distinct()
cand = ["The","cat","The","cat","on","the","mat"]
distinct.add_inst(cand)
print(distinct.score()) # 0.8333333333333334
  1. Using as an instance of paddle.metric.Metric.

import numpy as np
from functools import partial
import paddle
from paddlenlp.transformers import BertTokenizer
from paddlenlp.metrics import Distinct

def trans_func(logits, tokenizer):
    '''Transform the network output `logits` to string list.'''
    # [batch_size, seq_len]
    token_ids = np.argmax(logits, axis=-1).tolist()
    cand_list = []
    for ids in token_ids:
        tokens = tokenizer.convert_ids_to_tokens(ids)
        strings = tokenizer.convert_tokens_to_string(tokens)
        cand_list.append(strings.split())
    return cand_list

paddle.seed(2021)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
distinct = Distinct(trans_func=partial(trans_func, tokenizer=tokenizer))
batch_size, seq_len, vocab_size = 4, 16, tokenizer.vocab_size
logits = paddle.rand([batch_size, seq_len, vocab_size])
distinct.update(logits.numpy())
print(distinct.accumulate()) # 1.0
update(output, *args)[源代码]

Updates the metrics states. This method firstly will use trans_func() method to process the output to get the tokenized candidate sentence list. Then call add_inst() method to process the candidate list one by one.

参数
  • output (numpy.ndarray|Tensor) -- The outputs of model.

  • args (tuple) -- The additional inputs.

add_inst(cand)[源代码]

Updates the states based on the candidate.

参数

cand (list) -- Tokenized candidate sentence generated by model.

reset()[源代码]

Resets states and result.

accumulate()[源代码]

Calculates the final distinct score.

返回

The final distinct score.

返回类型

float

score()[源代码]

The function is the same as accumulate() method.

返回

The final distinct score.

返回类型

float

name()[源代码]

Returns the metric name.

返回

The metric name.

返回类型

str