text_correction_model#

class ErnieForCSC(ernie, pinyin_vocab_size, pad_pinyin_id=0)[source]#

ErnieForCSC is a model specified for Chinese Spelling Correction task.

It integrates phonetic features into language model by leveraging the powerful pre-training and fine-tuning method.

See more details on https://aclanthology.org/2021.findings-acl.198.pdf. :param ernie: An instance of paddlenlp.transformers.ErnieModel. :type ernie: ErnieModel :param pinyin_vocab_size: The vocab size of pinyin vocab. :type pinyin_vocab_size: int :param pad_pinyin_id: The pad token id of pinyin vocab. Defaults to 0. :type pad_pinyin_id: int, optional

forward(input_ids, pinyin_ids, token_type_ids=None, position_ids=None, attention_mask=None)[source]#
Parameters:
  • input_ids (Tensor) – Indices of input sequence tokens in the vocabulary. They are numerical representations of tokens that build the input sequence. It’s data type should be int64 and has a shape of [batch_size, sequence_length].

  • pinyin_ids (Tensor) – Indices of pinyin tokens of input sequence in the pinyin vocabulary. They are numerical representations of tokens that build the pinyin input sequence. It’s data type should be int64 and has a shape of [batch_size, sequence_length].

  • token_type_ids (Tensor, optional) –

    Segment token indices to indicate first and second portions of the inputs. Indices can be either 0 or 1:

    • 0 corresponds to a sentence A token,

    • 1 corresponds to a sentence B token.

    It’s data type should be int64 and has a shape of [batch_size, sequence_length]. Defaults to None, which means no segment embeddings is added to token embeddings.

  • position_ids (Tensor, optional) – Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1]. Defaults to None. Shape as (batch_sie, num_tokens) and dtype as int32 or int64.

  • attention_mask (Tensor, optional) –

    Mask to indicate whether to perform attention on each input token or not. The values should be either 0 or 1. The attention scores will be set to -infinity for any positions in the mask that are 0, and will be unchanged for positions that are 1.

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

    It’s data type should be float32 and has a shape of [batch_size, sequence_length]. Defaults to None.

Returns:

A Tensor of the detection prediction of each tokens.

Shape as (batch_size, sequence_length) and dtype as int.

char_preds (Tensor):

A Tensor of the correction prediction of each tokens. Shape as (batch_size, sequence_length) and dtype as int.

Return type:

det_preds (Tensor)