distinct#

class Distinct(n_size=2, trans_func=None, name='distinct')[source]#

Bases: Metric

Distinct is an algorithm for evaluating the textual diversity of the generated text by calculating the number of distinct n-grams. The larger the number of distinct n-grams, the higher the diversity of the text. See details at https://arxiv.org/abs/1510.03055.

Distinct could be used as a paddle.metric.Metric class, or an ordinary class. When Distinct is used as a paddle.metric.Metric class, a function is needed to transform the network output to a string list.

Parameters:
  • n_size (int, optional) – Number of gram for Distinct metric. Defaults to 2.

  • trans_func (callable, optional) –

    trans_func transforms the network output to a string list. Defaults to None.

    Note

    When Distinct is used as a paddle.metric.Metric class, trans_func must be provided. Please note that the input of trans_func is numpy array.

  • name (str, optional) – Name of paddle.metric.Metric instance. Defaults to “distinct”.

Examples

  1. Using as a general evaluation object.

from paddlenlp.metrics import Distinct
distinct = Distinct()
cand = ["The","cat","The","cat","on","the","mat"]
#update the states
distinct.add_inst(cand)
print(distinct.score())
# 0.8333333333333334
  1. Using as an instance of paddle.metric.Metric.

import numpy as np
from functools import partial
import paddle
from paddlenlp.transformers import BertTokenizer
from paddlenlp.metrics import Distinct

def trans_func(logits, tokenizer):
    '''Transform the network output `logits` to string list.'''
    # [batch_size, seq_len]
    token_ids = np.argmax(logits, axis=-1).tolist()
    cand_list = []
    for ids in token_ids:
        tokens = tokenizer.convert_ids_to_tokens(ids)
        strings = tokenizer.convert_tokens_to_string(tokens)
        cand_list.append(strings.split())
    return cand_list

paddle.seed(2021)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
distinct = Distinct(trans_func=partial(trans_func, tokenizer=tokenizer))
batch_size, seq_len, vocab_size = 4, 16, tokenizer.vocab_size
logits = paddle.rand([batch_size, seq_len, vocab_size])

distinct.update(logits.numpy())
print(distinct.accumulate()) # 1.0
update(output, *args)[source]#

Updates the metrics states. This method firstly will use trans_func() method to process the output to get the tokenized candidate sentence list. Then call add_inst() method to process the candidate list one by one.

Parameters:
  • output (numpy.ndarray|Tensor) – The outputs of model.

  • args (tuple) – The additional inputs.

add_inst(cand)[source]#

Updates the states based on the candidate.

Parameters:

cand (list) – Tokenized candidate sentence generated by model.

reset()[source]#

Resets states and result.

accumulate()[source]#

Calculates the final distinct score.

Returns:

The final distinct score.

Return type:

float

score()[source]#

The function is the same as accumulate() method.

Returns:

The final distinct score.

Return type:

float

name()[source]#

Returns the metric name.

Returns:

The metric name.

Return type:

str