tokenizer

Tokenization class for UnifiedTransformer model.

class UnifiedTransformerTokenizer(vocab_file, sentencepiece_model_file, do_lower_case=False, unk_token='[UNK]', pad_token='[PAD]', cls_token='[CLS]', sep_token='[SEP]', mask_token='[MASK]', special_tokens_file='')[源代码]

基类:paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer

Constructs an UnifiedTransformer tokenizer based on SentencePiece.

This tokenizer inherits from PretrainedTokenizer which contains most of the main methods. For more information regarding those methods, please refer to this superclass.

参数
  • vocab_file (str) -- The path of file to construct vocabulary.

  • sentencepiece_model_file (str) -- The sentencepiece model file (ends with '.spm') required to instantiate a SentencePiece.

  • do_lower_case (bool, optional) -- Whether or not to lowercase the input when tokenizing. Defaults to False and does not lowercase the input.

  • unk_token (str, optional) -- A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be unk_token inorder to be converted to an ID. Defaults to "[UNK]".

  • pad_token (str, optional) -- A special token used to make arrays of tokens the same size for batching purposes. Defaults to "[PAD]".

  • cls_token (str, optional) -- A special token representing the beginning of a sequence. Defaults to "[CLS]".

  • sep_token (str, optional) -- A special token representing the end of a sequence or separating two different sentences in the same input. Defaults to "[SEP]".

  • mask_token (str, optional) -- A special token representing a masked token. Defaults to "[MASK]".

  • special_tokens_file (str, optional) -- The path of file that contains additional special tokens to be used by the tokenizer. Defaults to "".

property vocab_size

Returns the size of vocabulary.

示例

from paddlenlp.transformers import UnifiedTransformerTokenizer

tokenizer = UnifiedTransformerTokenizer.from_pretrained('plato-mini')
print(tokenizer.vocab_size)
# 30001
tokenize(text, is_split_into_words=True)[源代码]

Converts a string to a list of tokens.

参数
  • text (str) -- The text to be tokenized.

  • is_split_into_words (bool, optional) -- Whether or not the input text has been pretokenized. If False, the input text will be pretokenized by jieba firstly. Defaults to True.

返回

A list of string representing converted tokens.

返回类型

List(str)

示例

from paddlenlp.transformers import UnifiedTransformerTokenizer

tokenizer = UnifiedTransformerTokenizer.from_pretrained('plato-mini')
print(tokenizer.tokenize('欢迎使用百度飞桨!', is_split_into_words=False))
# ['▁欢迎', '▁使用', '▁百度', '▁飞', '桨', '▁!']
convert_tokens_to_string(tokens, keep_space=True)[源代码]

Converts a sequence of tokens (list of string) in a single string. Since the usage of WordPiece introducing __ to concat subwords, also remove __ when converting.

参数
  • tokens (list[str]) -- A list of string representing tokens to be converted.

  • keep_space (bool, optinal) -- Whether or not to keep the segmentation with space. Defaults to True.

返回

Converted string from tokens.

返回类型

str

示例

from paddlenlp.transformers import UnifiedTransformerTokenizer

tokenizer = UnifiedTransformerTokenizer.from_pretrained('plato-mini')
print(tokenizer.convert_tokens_to_string(['▁欢迎', '▁使用', '▁百度', '▁飞', '桨', '▁!']))
# 欢迎 使用 百度 飞桨 !
print(tokenizer.convert_tokens_to_string(['▁欢迎', '▁使用', '▁百度', '▁飞', '桨', '▁!'], keep_space=False))
# 欢迎使用百度飞桨!
convert_ids_to_string(ids, keep_space=True)[源代码]

Converts a single index or a sequence of indices to a token or a sequence of tokens.

参数
  • ids (int|list[int]) -- The token id (or token ids) to be converted to token(s).

  • keep_space (bool, optional) -- Whether or not to keep the segmentation with space. Defaults to True.

返回

The decoded token(s).

返回类型

str|list[str]

示例

from paddlenlp.transformers import UnifiedTransformerTokenizer

tokenizer = UnifiedTransformerTokenizer.from_pretrained('plato-mini')
tokens = tokenizer.tokenize('欢迎使用百度飞桨!', is_split_into_words=False)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)
# [6, 121, 26907, 25475]

print(tokenizer.convert_ids_to_string(ids))
# 我 爱祖国
print(tokenizer.convert_ids_to_string(ids, keep_space=False))
# 我爱祖国
num_special_tokens_to_add(pair=False)[源代码]

Returns the number of added tokens when encoding a sequence with special tokens.

参数

pair (bool, optional) -- Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence. Defaults to False.

返回

Number of special tokens added to sequences.

返回类型

int

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens.

Should be overridden in a subclass if the model has a special way of building those.

参数
  • token_ids_0 (List[int]) -- List of IDs to which the special tokens will be added.

  • token_ids_1 (List[int], optional) -- Optional second list of IDs for sequence pairs.

返回

List of input_id with the appropriate special tokens.

返回类型

List[int]

build_offset_mapping_with_special_tokens(offset_mapping_0, offset_mapping_1=None)[源代码]

Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.

Should be overridden in a subclass if the model has a special way of building those.

参数
  • offset_mapping_0 (List[tuple]) -- List of char offsets to which the special tokens will be added.

  • offset_mapping_1 (List[tuple], optional) -- Optional second list of char offsets for offset mapping pairs.

返回

List of char offsets with the appropriate offsets of special tokens.

返回类型

List[tuple]

create_token_type_ids_from_sequences(token_ids_0, token_ids_1=None)[源代码]

Create a mask from the two sequences passed to be used in a sequence-pair classification task.

Should be overridden in a subclass if the model has a special way of building those.

If token_ids_1 is None, this method only returns the first portion of the mask (0s).

参数
  • token_ids_0 (List[int]) -- List of IDs.

  • token_ids_1 (List[int], optional) -- Optional second list of IDs for sequence pairs.

返回

List of token_type_id according to the given sequence(s).

返回类型

List[int]

get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[源代码]

Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer encode methods.

参数
  • token_ids_0 (List[int]) -- List of ids of the first sequence.

  • token_ids_1 (List[int], optional) -- List of ids of the second sequence.

  • already_has_special_tokens (bool, optional) -- Whether or not the token list is already formatted with special tokens for the model. Defaults to None.

返回

The list of integers in the range [0, 1]:

1 for a special token, 0 for a sequence token.

返回类型

results (List[int])

save_resources(save_directory)[源代码]

Save tokenizer related resources to resource_files_names indicating files under save_directory by copying directly. Override it if necessary.

参数

save_directory (str) -- Directory to save files into.

static load_vocabulary(filepath, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs)[源代码]

Instantiate an instance of Vocab from a file reserving all tokens by using Vocab.from_dict. The file contains a token per line, and the line number would be the index of corresponding token.

参数
  • filepath (str) -- path of file to construct vocabulary.

  • unk_token (str) -- special token for unknown token. If no need, it also could be None. Defaults to None.

  • pad_token (str) -- special token for padding token. If no need, it also could be None. Defaults to None.

  • bos_token (str) -- special token for bos token. If no need, it also could be None. Defaults to None.

  • eos_token (str) -- special token for eos token. If no need, it also could be None. Defaults to None.

  • **kwargs (dict) -- keyword arguments for Vocab.from_dict.

返回

An instance of Vocab.

返回类型

Vocab

dialogue_encode(history, response=None, knowledge=None, task_type=None, max_seq_len=512, max_response_len=128, max_knowledge_len=128, return_position_ids=True, return_token_type_ids=True, return_attention_mask=True, return_length=False, add_start_token_as_response=False, pad_to_max_seq_len=False, return_tensors=False, is_split_into_words=True)[源代码]

Main method to encode the single-turn or multi-turn dialogue conversation. It will return a dictionary containing the encoded sequence and other relative informations which meets the input format requirements of the UnifiedTransformer model. See detail at https://github.com/PaddlePaddle/Knover/tree/luge-dialogue/luge-dialogue

参数
  • history (str|list|tuple) -- The history of dialogue conversation. It is an utterance or list of utterances to be encoded. Each utterance is a string.

  • response (str, optional) -- The response of dialogue conversation. It should be set when training the model. It should not be set when running inference. Defaults to None.

  • knowledge (str, optional) -- The knowledge information of dialogue conversation. It should be set if the task_type is "knowledge" or "recommend". Defaults to None.

  • task_type (str, optional) -- The type of dialogue conversation. It is one of "chitchat", "knowledge" and "recommend". They represent the chitchat dialogue, knowledge grounded dialogue and conversational recommendation respectively. Defaults to None, which means there is no special_token added in output sequence for identifying different conversation types.

  • max_seq_len (int, optional) -- The maximum encoded sequence length. Defaults to 512.

  • max_response_len (int, optional) -- The maximum encoded sequence length of the input response. Defaults to 128.

  • max_knowledge_len (int, optional) -- The maximum encoded sequence length of the input knowledge. Defaults to 128.

  • return_position_ids (bool, optional) -- Whether to return the position_ids. Defaults to True.

  • return_token_type_ids (bool, optional) -- Whether to return the token_type_ids. Defaults to True.

  • return_attention_mask (bool, optional) -- Whether to return the attention_mask. Defaults to True.

  • return_length (bool, optional) -- Whether to return the length of the encoded sequence. Defaults to False.

  • add_start_token_as_response (bool, optional) -- Whether to add the special token "[CLS]" at the end of sequence as the begining of the response when running inference to force the model to start generating response sequence. Defaults to False.

  • pad_to_max_seq_len (bool, optional) -- Whether to pad the returned sequences to the max_seq_len. Note that, in this method, returned sequences will be padded on the left. Defaults to False.

  • return_tensors (bool, optional) -- Whether to convert the returned sequences to Tensor. Defaults to False.

  • is_split_into_words (bool, optinal) -- Whether or not the input text (history, response and knowledge) has been pretokenized. Defaults to True.

返回

A dictionary containing the encoded sequence and other relative informations.

With the corresponding fields:

  • input_ids (list[int]|Tensor):

    A list of indices of input tokens to be feed to UnifiedTransformer model. If return_tensors is True, it is a Tensor with shape [1, sequence_length] and data type 'int64'.

  • token_type_ids (list[int]|Tensor, optional):

    A list of segment token indices to indicate whether the token belongs to the dialogue response. If return_tensors is True, it is a Tensor with shape [1, sequence_length] and data type 'int64'. Being returned when return_token_type_ids is set to True.

  • position_ids (list[int]|Tensor, optional):

    A list of The position indices. If return_tensors is True, it is a Tensor with shape [1, sequence_length] and data type 'int64'. Being returned when return_position_ids is set to True.

  • attention_mask (numpy.ndarray|Tensor, optional):

    A numpy.ndarray to prevents attention to some unwanted positions, with shape [sequence_length, sequence_length] and data type 'float32'. If return_tensors is True, it is a Tensor with shape [1, 1, sequence_length, sequence_length] and data type 'float32'. Being returned when return_attention_mask is set to True.

  • seq_len (int, optional):

    The actual length of the input_ids, excluding the pad token. Being returned when return_length is set to True.

返回类型

dict

示例

from paddlenlp.transformers import UnifiedTransformerTokenizer

tokenizer = UnifiedTransformerTokenizer.from_pretrained('plato-mini')

inputs = tokenizer.dialogue_encode('我爱祖国')
for key in inputs:
    print(key + ':')
    print(inputs[key])
# input_ids: [1, 6, 25445, 26907, 25475, 2]
# token_type_ids: [0, 0, 0, 0, 0, 0]
# position_ids: [0, 1, 2, 3, 4, 5]
# attention_mask: [[0. 0. 0. 0. 0. 0.]
# [0. 0. 0. 0. 0. 0.]
# [0. 0. 0. 0. 0. 0.]
# [0. 0. 0. 0. 0. 0.]
# [0. 0. 0. 0. 0. 0.]
# [0. 0. 0. 0. 0. 0.]]