tokenizer#

Tokenization class for UnifiedTransformer model.

class UnifiedTransformerTokenizer(vocab_file, sentencepiece_model_file, do_lower_case=False, unk_token='[UNK]', pad_token='[UNK]', cls_token='[CLS]', sep_token='[SEP]', mask_token='[MASK]', special_tokens_file='', **kwargs)[source]#

Bases: PretrainedTokenizer

Constructs an UnifiedTransformer tokenizer based on SentencePiece.

This tokenizer inherits from PretrainedTokenizer which contains most of the main methods. For more information regarding those methods, please refer to this superclass.

Parameters:
  • vocab_file (str) – The path of file to construct vocabulary.

  • sentencepiece_model_file (str) – The sentencepiece model file (ends with ‘.spm’) required to instantiate a SentencePiece.

  • do_lower_case (bool, optional) – Whether or not to lowercase the input when tokenizing. Defaults to False and does not lowercase the input.

  • unk_token (str, optional) – A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be unk_token inorder to be converted to an ID. Defaults to “[UNK]”.

  • pad_token (str, optional) – A special token used to make arrays of tokens the same size for batching purposes. Defaults to “[PAD]”.

  • cls_token (str, optional) – A special token representing the beginning of a sequence. Defaults to “[CLS]”.

  • sep_token (str, optional) – A special token representing the end of a sequence or separating two different sentences in the same input. Defaults to “[SEP]”.

  • mask_token (str, optional) – A special token representing a masked token. Defaults to “[MASK]”.

  • special_tokens_file (str, optional) – The path of file that contains additional special tokens to be used by the tokenizer. Defaults to “”.

property vocab_size#

Returns the size of vocabulary.

Example

from paddlenlp.transformers import UnifiedTransformerTokenizer

tokenizer = UnifiedTransformerTokenizer.from_pretrained('plato-mini')
print(tokenizer.vocab_size)
# 30001
get_vocab()[source]#

Returns the vocabulary as a dictionary of token to index.

tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab.

Returns:

The vocabulary.

Return type:

Dict[str, int]

convert_tokens_to_string(tokens, keep_space=True)[source]#

Converts a sequence of tokens (list of string) in a single string. Since the usage of WordPiece introducing __ to concat subwords, also remove __ when converting.

Parameters:
  • tokens (list[str]) – A list of string representing tokens to be converted.

  • keep_space (bool, optional) – Whether or not to keep the segmentation with space. Defaults to True.

Returns:

Converted string from tokens.

Return type:

str

Example

from paddlenlp.transformers import UnifiedTransformerTokenizer

tokenizer = UnifiedTransformerTokenizer.from_pretrained('plato-mini')
print(tokenizer.convert_tokens_to_string(['▁欢迎', '▁使用', '▁百度', '▁飞', '桨', '▁!']))
# 欢迎 使用 百度 飞桨 !
print(tokenizer.convert_tokens_to_string(['▁欢迎', '▁使用', '▁百度', '▁飞', '桨', '▁!'], keep_space=False))
# 欢迎使用百度飞桨!
convert_ids_to_string(ids, keep_space=True)[source]#

Converts a single index or a sequence of indices to a token or a sequence of tokens.

Parameters:
  • ids (int|list[int]) – The token id (or token ids) to be converted to token(s).

  • keep_space (bool, optional) – Whether or not to keep the segmentation with space. Defaults to True.

Returns:

The decoded token(s).

Return type:

str|list[str]

Example

from paddlenlp.transformers import UnifiedTransformerTokenizer

tokenizer = UnifiedTransformerTokenizer.from_pretrained('plato-mini')
tokens = tokenizer.tokenize('欢迎使用百度飞桨!', is_split_into_words=False)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)
# [6, 121, 26907, 25475]

print(tokenizer.convert_ids_to_string(ids))
# 我 爱祖国
print(tokenizer.convert_ids_to_string(ids, keep_space=False))
# 我爱祖国
num_special_tokens_to_add(pair=False)[source]#

Returns the number of added tokens when encoding a sequence with special tokens.

Parameters:

pair (bool, optional) – Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence. Defaults to False.

Returns:

Number of special tokens added to sequences.

Return type:

int

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[source]#

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens.

This implementation does not add special tokens and this method should be overridden in a subclass.

Parameters:
  • token_ids_0 (List[int]) – The first tokenized sequence.

  • token_ids_1 (List[int], optional) – The second tokenized sequence.

Returns:

The model input with special tokens.

Return type:

List[int]

build_offset_mapping_with_special_tokens(offset_mapping_0, offset_mapping_1=None)[source]#

Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.

Should be overridden in a subclass if the model has a special way of building those.

Parameters:
  • offset_mapping_0 (List[tuple]) – List of char offsets to which the special tokens will be added.

  • offset_mapping_1 (List[tuple], optional) – Optional second list of char offsets for offset mapping pairs.

Returns:

List of char offsets with the appropriate offsets of special tokens.

Return type:

List[tuple]

create_token_type_ids_from_sequences(token_ids_0, token_ids_1=None)[source]#

Create the token type IDs corresponding to the sequences passed. [What are token type IDs?](../glossary#token-type-ids)

Should be overridden in a subclass if the model has a special way of building those.

Parameters:
  • token_ids_0 (List[int]) – The first tokenized sequence.

  • token_ids_1 (List[int], optional) – The second tokenized sequence.

Returns:

The token type ids.

Return type:

List[int]

get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]#

Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer encode methods.

Parameters:
  • token_ids_0 (List[int]) – List of ids of the first sequence.

  • token_ids_1 (List[int], optional) – List of ids of the second sequence.

  • already_has_special_tokens (bool, optional) – Whether or not the token list is already formatted with special tokens for the model. Defaults to None.

Returns:

The list of integers in the range [0, 1]:

1 for a special token, 0 for a sequence token.

Return type:

results (List[int])

save_resources(save_directory)[source]#

Save tokenizer related resources to resource_files_names indicating files under save_directory by copying directly. Override it if necessary.

Parameters:

save_directory (str) – Directory to save files into.

static load_vocabulary(filepath, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs)[source]#

Instantiate an instance of Vocab from a file reserving all tokens by using Vocab.from_dict. The file contains a token per line, and the line number would be the index of corresponding token.

Parameters:
  • filepath (str) – path of file to construct vocabulary.

  • unk_token (str) – special token for unknown token. If no need, it also could be None. Defaults to None.

  • pad_token (str) – special token for padding token. If no need, it also could be None. Defaults to None.

  • bos_token (str) – special token for bos token. If no need, it also could be None. Defaults to None.

  • eos_token (str) – special token for eos token. If no need, it also could be None. Defaults to None.

  • **kwargs (dict) – keyword arguments for Vocab.from_dict.

Returns:

An instance of Vocab.

Return type:

Vocab

dialogue_encode(history, response=None, knowledge=None, task_type=None, max_seq_len=512, max_response_len=128, max_knowledge_len=128, return_position_ids=True, return_token_type_ids=True, return_role_ids=False, return_attention_mask=True, return_length=False, add_start_token_as_response=False, pad_to_max_seq_len=False, return_tensors=False, is_split_into_words=True, position_style='continuous')[source]#

Main method to encode the single-turn or multi-turn dialogue conversation. It will return a dictionary containing the encoded sequence and other relative informations which meets the input format requirements of the UnifiedTransformer model. See detail at PaddlePaddle/Knover

Parameters:
  • history (str|list|tuple) – The history of dialogue conversation. It is an utterance or list of utterances to be encoded. Each utterance is a string.

  • response (str, optional) – The response of dialogue conversation. It should be set when training the model. It should not be set when running inference. Defaults to None.

  • knowledge (str, optional) – The knowledge information of dialogue conversation. It should be set if the task_type is “knowledge” or “recommend”. Defaults to None.

  • task_type (str, optional) – The type of dialogue conversation. It is one of “chitchat”, “knowledge” and “recommend”. They represent the chitchat dialogue, knowledge grounded dialogue and conversational recommendation respectively. Defaults to None, which means there is no special_token added in output sequence for identifying different conversation types.

  • max_seq_len (int, optional) – The maximum encoded sequence length. Defaults to 512.

  • max_response_len (int, optional) – The maximum encoded sequence length of the input response. Defaults to 128.

  • max_knowledge_len (int, optional) – The maximum encoded sequence length of the input knowledge. Defaults to 128.

  • return_position_ids (bool, optional) – Whether to return the position_ids. Defaults to True.

  • return_token_type_ids (bool, optional) – Whether to return the token_type_ids. Defaults to True.

  • return_role_ids (bool, optional) – Whether to return the role_ids. Defaults to False.

  • return_attention_mask (bool, optional) – Whether to return the attention_mask. Defaults to True.

  • return_length (bool, optional) – Whether to return the length of the encoded sequence. Defaults to False.

  • add_start_token_as_response (bool, optional) – Whether to add the special token “[CLS]” at the end of sequence as the beginning of the response when running inference to force the model to start generating response sequence. Defaults to False.

  • pad_to_max_seq_len (bool, optional) – Whether to pad the returned sequences to the max_seq_len. Note that, in this method, returned sequences will be padded on the left. Defaults to False.

  • return_tensors (bool, optional) – Whether to convert the returned sequences to Tensor. Defaults to False.

  • is_split_into_words (bool, optional) – Whether or not the input text (history, response and knowledge) has been pretokenized. Defaults to True.

  • position_style (str, optional) – Specify the involved positional style which must be one of [relative, continuous]. Defaults to continuous which means start from 0 to maximum length of history.

Returns:

A dictionary containing the encoded sequence and other relative informations.

With the corresponding fields:

  • input_ids (list[int]|Tensor):

    A list of indices of input tokens to be feed to UnifiedTransformer model. If return_tensors is True, it is a Tensor with shape [1, sequence_length] and data type ‘int64’.

  • role_ids (list[int]|Tensor, optional):

    A list of indices of role indices. If return_role_ids is True, it is a Tensor with shape [1, sequence_length] and data type ‘int64’.

  • token_type_ids (list[int]|Tensor, optional):

    A list of segment token indices to indicate whether the token belongs to the dialogue response. If return_tensors is True, it is a Tensor with shape [1, sequence_length] and data type ‘int64’. Being returned when return_token_type_ids is set to True.

  • position_ids (list[int]|Tensor, optional):

    A list of The position indices. If return_tensors is True, it is a Tensor with shape [1, sequence_length] and data type ‘int64’. Being returned when return_position_ids is set to True.

  • attention_mask (numpy.ndarray|Tensor, optional):

    A numpy.ndarray to prevents attention to some unwanted positions, with shape [sequence_length, sequence_length] and data type ‘float32’. If return_tensors is True, it is a Tensor with shape [1, 1, sequence_length, sequence_length] and data type ‘float32’. Being returned when return_attention_mask is set to True.

  • seq_len (int, optional):

    The actual length of the input_ids, excluding the pad token. Being returned when return_length is set to True.

Return type:

dict

Example

from paddlenlp.transformers import UnifiedTransformerTokenizer

tokenizer = UnifiedTransformerTokenizer.from_pretrained('plato-mini')

inputs = tokenizer.dialogue_encode('我爱祖国')
for key in inputs:
    print(key + ':')
    print(inputs[key])
# input_ids: [1, 6, 25445, 26907, 25475, 2]
# token_type_ids: [0, 0, 0, 0, 0, 0]
# position_ids: [0, 1, 2, 3, 4, 5]
# attention_mask: [[0. 0. 0. 0. 0. 0.]
# [0. 0. 0. 0. 0. 0.]
# [0. 0. 0. 0. 0. 0.]
# [0. 0. 0. 0. 0. 0.]
# [0. 0. 0. 0. 0. 0.]
# [0. 0. 0. 0. 0. 0.]]