tokenizer¶
-
class
MBartTokenizer
(vocab_file, src_lang=None, tgt_lang=None, bos_token='<s>', eos_token='</s>', sep_token='</s>', cls_token='<s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', sp_model_kwargs=None, additional_special_tokens=None, **kwargs)[source]¶ Bases:
paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer
-
save_resources
(save_directory)[source]¶ Save tokenizer related resources to
resource_files_names
indicating files undersave_directory
by copying directly. Override it if necessary.- Parameters
save_directory (str) – Directory to save files into.
-
property
vocab_size
¶ Returns the size of vocabulary.
- Returns
The sum of size of vocabulary and the size of speical tokens.
- Return type
int
-
get_vocab
()[source]¶ Returns the vocabulary as a dictionary of token to index.
tokenizer.get_vocab()[token]
is equivalent totokenizer.convert_tokens_to_ids(token)
whentoken
is in the vocab.- Returns
The vocabulary.
- Return type
Dict[str, int]
-
convert_tokens_to_string
(tokens)[source]¶ Converts a sequence of tokens (strings for sub-words) in a single string.
-
convert_ids_to_string
(ids)[source]¶ Converts a sequence of tokens (strings for sub-words) in a single string.
-
get_special_tokens_mask
(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]¶ Retrieve sequence ids from a token list that has no special tokens added.
-
build_inputs_with_special_tokens
(token_ids_0, token_ids_1=None)[source]¶ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An MBART sequence has the following format, where
X
represents the sequence:input_ids
(for encoder)X [eos, src_lang_code]
decoder_input_ids
: (for decoder)X [eos, tgt_lang_code]
BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a separator.
-
build_offset_mapping_with_special_tokens
(offset_mapping_0, offset_mapping_1=None)[source]¶ Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
Should be overridden in a subclass if the model has a special way of building those.
- Parameters
offset_mapping_0 (List[tuple]) – List of char offsets to which the special tokens will be added.
offset_mapping_1 (List[tuple], optional) – Optional second list of char offsets for offset mapping pairs.
- Returns
List of char offsets with the appropriate offsets of special tokens.
- Return type
List[tuple]
-
-
class
MBart50Tokenizer
(vocab_file, src_lang=None, tgt_lang=None, bos_token='<s>', eos_token='</s>', sep_token='</s>', cls_token='<s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', sp_model_kwargs=None, additional_special_tokens=None, **kwargs)[source]¶ Bases:
paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer
-
save_resources
(save_directory)[source]¶ Save tokenizer related resources to
resource_files_names
indicating files undersave_directory
by copying directly. Override it if necessary.- Parameters
save_directory (str) – Directory to save files into.
-
get_vocab
()[source]¶ Returns the vocabulary as a dictionary of token to index.
tokenizer.get_vocab()[token]
is equivalent totokenizer.convert_tokens_to_ids(token)
whentoken
is in the vocab.- Returns
The vocabulary.
- Return type
Dict[str, int]
-
property
vocab_size
¶ Returns the size of vocabulary.
- Returns
The sum of size of vocabulary and the size of speical tokens.
- Return type
int
-
convert_tokens_to_string
(tokens)[source]¶ Converts a sequence of tokens (strings for sub-words) in a single string.
-
convert_ids_to_string
(ids)[source]¶ Converts a sequence of tokens (strings for sub-words) in a single string.
-
get_special_tokens_mask
(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]¶ Retrieve sequence ids from a token list that has no special tokens added.
-
build_inputs_with_special_tokens
(token_ids_0, token_ids_1=None)[source]¶ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An MBART50 sequence has the following format, where
X
represents the sequence:input_ids
(for encoder)[src_lang_code] X [eos]
labels
: (for decoder)[tgt_lang_code] X [eos]
BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a separator.
-
build_offset_mapping_with_special_tokens
(offset_mapping_0, offset_mapping_1=None)[source]¶ Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
Should be overridden in a subclass if the model has a special way of building those.
- Parameters
offset_mapping_0 (List[tuple]) – List of char offsets to which the special tokens will be added.
offset_mapping_1 (List[tuple], optional) – Optional second list of char offsets for offset mapping pairs.
- Returns
List of char offsets with the appropriate offsets of special tokens.
- Return type
List[tuple]
-