tokenizer¶
Tokenization class for UnifiedTransformer model.
-
class
UnifiedTransformerTokenizer
(vocab_file, sentencepiece_model_file, do_lower_case=False, unk_token='[UNK]', pad_token='[UNK]', cls_token='[CLS]', sep_token='[SEP]', mask_token='[MASK]', special_tokens_file='', **kwargs)[source]¶ Bases:
paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer
Constructs an UnifiedTransformer tokenizer based on SentencePiece.
This tokenizer inherits from
PretrainedTokenizer
which contains most of the main methods. For more information regarding those methods, please refer to this superclass.- Parameters
vocab_file (str) – The path of file to construct vocabulary.
sentencepiece_model_file (str) – The sentencepiece model file (ends with ‘.spm’) required to instantiate a SentencePiece.
do_lower_case (bool, optional) – Whether or not to lowercase the input when tokenizing. Defaults to False and does not lowercase the input.
unk_token (str, optional) – A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be
unk_token
inorder to be converted to an ID. Defaults to “[UNK]”.pad_token (str, optional) – A special token used to make arrays of tokens the same size for batching purposes. Defaults to “[PAD]”.
cls_token (str, optional) – A special token representing the beginning of a sequence. Defaults to “[CLS]”.
sep_token (str, optional) – A special token representing the end of a sequence or separating two different sentences in the same input. Defaults to “[SEP]”.
mask_token (str, optional) – A special token representing a masked token. Defaults to “[MASK]”.
special_tokens_file (str, optional) – The path of file that contains additional special tokens to be used by the tokenizer. Defaults to “”.
-
property
vocab_size
¶ Returns the size of vocabulary.
Example
from paddlenlp.transformers import UnifiedTransformerTokenizer tokenizer = UnifiedTransformerTokenizer.from_pretrained('plato-mini') print(tokenizer.vocab_size) # 30001
-
get_vocab
()[source]¶ Returns the vocabulary as a dictionary of token to index.
tokenizer.get_vocab()[token]
is equivalent totokenizer.convert_tokens_to_ids(token)
whentoken
is in the vocab.- Returns
The vocabulary.
- Return type
Dict[str, int]
-
convert_tokens_to_string
(tokens, keep_space=True)[source]¶ Converts a sequence of tokens (list of string) in a single string. Since the usage of WordPiece introducing
__
to concat subwords, also remove__
when converting.- Parameters
tokens (list[str]) – A list of string representing tokens to be converted.
keep_space (bool, optional) – Whether or not to keep the segmentation with space. Defaults to True.
- Returns
Converted string from tokens.
- Return type
str
Example
from paddlenlp.transformers import UnifiedTransformerTokenizer tokenizer = UnifiedTransformerTokenizer.from_pretrained('plato-mini') print(tokenizer.convert_tokens_to_string(['▁欢迎', '▁使用', '▁百度', '▁飞', '桨', '▁!'])) # 欢迎 使用 百度 飞桨 ! print(tokenizer.convert_tokens_to_string(['▁欢迎', '▁使用', '▁百度', '▁飞', '桨', '▁!'], keep_space=False)) # 欢迎使用百度飞桨!
-
convert_ids_to_string
(ids, keep_space=True)[source]¶ Converts a single index or a sequence of indices to a token or a sequence of tokens.
- Parameters
ids (int|list[int]) – The token id (or token ids) to be converted to token(s).
keep_space (bool, optional) – Whether or not to keep the segmentation with space. Defaults to True.
- Returns
The decoded token(s).
- Return type
str|list[str]
Example
from paddlenlp.transformers import UnifiedTransformerTokenizer tokenizer = UnifiedTransformerTokenizer.from_pretrained('plato-mini') tokens = tokenizer.tokenize('欢迎使用百度飞桨!', is_split_into_words=False) ids = tokenizer.convert_tokens_to_ids(tokens) print(ids) # [6, 121, 26907, 25475] print(tokenizer.convert_ids_to_string(ids)) # 我 爱祖国 print(tokenizer.convert_ids_to_string(ids, keep_space=False)) # 我爱祖国
-
num_special_tokens_to_add
(pair=False)[source]¶ Returns the number of added tokens when encoding a sequence with special tokens.
- Parameters
pair (bool, optional) – Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence. Defaults to
False
.- Returns
Number of special tokens added to sequences.
- Return type
int
-
build_inputs_with_special_tokens
(token_ids_0, token_ids_1=None)[source]¶ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens.
This implementation does not add special tokens and this method should be overridden in a subclass.
- Parameters
token_ids_0 (
List[int]
) – The first tokenized sequence.token_ids_1 (
List[int]
, optional) – The second tokenized sequence.
- Returns
The model input with special tokens.
- Return type
List[int]
-
build_offset_mapping_with_special_tokens
(offset_mapping_0, offset_mapping_1=None)[source]¶ Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
Should be overridden in a subclass if the model has a special way of building those.
- Parameters
offset_mapping_0 (List[tuple]) – List of char offsets to which the special tokens will be added.
offset_mapping_1 (List[tuple], optional) – Optional second list of char offsets for offset mapping pairs.
- Returns
List of char offsets with the appropriate offsets of special tokens.
- Return type
List[tuple]
-
create_token_type_ids_from_sequences
(token_ids_0, token_ids_1=None)[source]¶ Create the token type IDs corresponding to the sequences passed. [What are token type IDs?](../glossary#token-type-ids)
Should be overridden in a subclass if the model has a special way of building those.
- Parameters
token_ids_0 (
List[int]
) – The first tokenized sequence.token_ids_1 (
List[int]
, optional) – The second tokenized sequence.
- Returns
The token type ids.
- Return type
List[int]
-
get_special_tokens_mask
(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]¶ Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer
encode
methods.- Parameters
token_ids_0 (List[int]) – List of ids of the first sequence.
token_ids_1 (List[int], optional) – List of ids of the second sequence.
already_has_special_tokens (bool, optional) – Whether or not the token list is already formatted with special tokens for the model. Defaults to None.
- Returns
- The list of integers in the range [0, 1]:
1 for a special token, 0 for a sequence token.
- Return type
results (List[int])
-
save_resources
(save_directory)[source]¶ Save tokenizer related resources to
resource_files_names
indicating files undersave_directory
by copying directly. Override it if necessary.- Parameters
save_directory (str) – Directory to save files into.
-
static
load_vocabulary
(filepath, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs)[source]¶ Instantiate an instance of
Vocab
from a file reserving all tokens by usingVocab.from_dict
. The file contains a token per line, and the line number would be the index of corresponding token.- Parameters
filepath (str) – path of file to construct vocabulary.
unk_token (str) – special token for unknown token. If no need, it also could be
None
. Defaults toNone
.pad_token (str) – special token for padding token. If no need, it also could be
None
. Defaults toNone
.bos_token (str) – special token for bos token. If no need, it also could be
None
. Defaults toNone
.eos_token (str) – special token for eos token. If no need, it also could be
None
. Defaults toNone
.**kwargs (dict) – keyword arguments for
Vocab.from_dict
.
- Returns
An instance of
Vocab
.- Return type
-
dialogue_encode
(history, response=None, knowledge=None, task_type=None, max_seq_len=512, max_response_len=128, max_knowledge_len=128, return_position_ids=True, return_token_type_ids=True, return_role_ids=False, return_attention_mask=True, return_length=False, add_start_token_as_response=False, pad_to_max_seq_len=False, return_tensors=False, is_split_into_words=True, position_style='continuous')[source]¶ Main method to encode the single-turn or multi-turn dialogue conversation. It will return a dictionary containing the encoded sequence and other relative informations which meets the input format requirements of the UnifiedTransformer model. See detail at https://github.com/PaddlePaddle/Knover/tree/luge-dialogue/luge-dialogue
- Parameters
history (str|list|tuple) – The history of dialogue conversation. It is an utterance or list of utterances to be encoded. Each utterance is a string.
response (str, optional) – The response of dialogue conversation. It should be set when training the model. It should not be set when running inference. Defaults to None.
knowledge (str, optional) – The knowledge information of dialogue conversation. It should be set if the
task_type
is “knowledge” or “recommend”. Defaults to None.task_type (str, optional) – The type of dialogue conversation. It is one of “chitchat”, “knowledge” and “recommend”. They represent the chitchat dialogue, knowledge grounded dialogue and conversational recommendation respectively. Defaults to None, which means there is no
special_token
added in output sequence for identifying different conversation types.max_seq_len (int, optional) – The maximum encoded sequence length. Defaults to 512.
max_response_len (int, optional) – The maximum encoded sequence length of the input
response
. Defaults to 128.max_knowledge_len (int, optional) – The maximum encoded sequence length of the input
knowledge
. Defaults to 128.return_position_ids (bool, optional) – Whether to return the position_ids. Defaults to True.
return_token_type_ids (bool, optional) – Whether to return the token_type_ids. Defaults to True.
return_role_ids (bool, optional) – Whether to return the role_ids. Defaults to False.
return_attention_mask (bool, optional) – Whether to return the attention_mask. Defaults to True.
return_length (bool, optional) – Whether to return the length of the encoded sequence. Defaults to False.
add_start_token_as_response (bool, optional) – Whether to add the special token “[CLS]” at the end of sequence as the begining of the response when running inference to force the model to start generating response sequence. Defaults to False.
pad_to_max_seq_len (bool, optional) – Whether to pad the returned sequences to the
max_seq_len
. Note that, in this method, returned sequences will be padded on the left. Defaults to False.return_tensors (bool, optional) – Whether to convert the returned sequences to Tensor. Defaults to False.
is_split_into_words (bool, optional) – Whether or not the input text (
history
,response
andknowledge
) has been pretokenized. Defaults to True.position_style (str, optional) – Specify the involved positional style which must be one of [relative, continuous]. Defaults to continuous which means start from 0 to maximum length of history.
- Returns
A dictionary containing the encoded sequence and other relative informations.
With the corresponding fields:
- input_ids (list[int]|Tensor):
A list of indices of input tokens to be feed to UnifiedTransformer model. If
return_tensors
is True, it is a Tensor with shape [1, sequence_length] and data type ‘int64’.
- role_ids (list[int]|Tensor, optional):
A list of indices of role indices. If
return_role_ids
is True, it is a Tensor with shape [1, sequence_length] and data type ‘int64’.
- token_type_ids (list[int]|Tensor, optional):
A list of segment token indices to indicate whether the token belongs to the dialogue response. If
return_tensors
is True, it is a Tensor with shape [1, sequence_length] and data type ‘int64’. Being returned whenreturn_token_type_ids
is set to True.
- position_ids (list[int]|Tensor, optional):
A list of The position indices. If
return_tensors
is True, it is a Tensor with shape [1, sequence_length] and data type ‘int64’. Being returned whenreturn_position_ids
is set to True.
- attention_mask (numpy.ndarray|Tensor, optional):
A numpy.ndarray to prevents attention to some unwanted positions, with shape [sequence_length, sequence_length] and data type ‘float32’. If
return_tensors
is True, it is a Tensor with shape [1, 1, sequence_length, sequence_length] and data type ‘float32’. Being returned whenreturn_attention_mask
is set to True.
- seq_len (int, optional):
The actual length of the
input_ids
, excluding the pad token. Being returned whenreturn_length
is set to True.
- Return type
dict
Example
from paddlenlp.transformers import UnifiedTransformerTokenizer tokenizer = UnifiedTransformerTokenizer.from_pretrained('plato-mini') inputs = tokenizer.dialogue_encode('我爱祖国') for key in inputs: print(key + ':') print(inputs[key]) # input_ids: [1, 6, 25445, 26907, 25475, 2] # token_type_ids: [0, 0, 0, 0, 0, 0] # position_ids: [0, 1, 2, 3, 4, 5] # attention_mask: [[0. 0. 0. 0. 0. 0.] # [0. 0. 0. 0. 0. 0.] # [0. 0. 0. 0. 0. 0.] # [0. 0. 0. 0. 0. 0.] # [0. 0. 0. 0. 0. 0.] # [0. 0. 0. 0. 0. 0.]]