tokenizer#
- class MBartTokenizer(vocab_file, src_lang=None, tgt_lang=None, bos_token='<s>', eos_token='</s>', sep_token='</s>', cls_token='<s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', sp_model_kwargs=None, additional_special_tokens=None, **kwargs)[源代码]#
-
- save_resources(save_directory)[源代码]#
Save tokenizer related resources to
resource_files_names
indicating files undersave_directory
by copying directly. Override it if necessary.- 参数:
save_directory (str) -- Directory to save files into.
- property vocab_size#
Returns the size of vocabulary.
- 返回:
The sum of size of vocabulary and the size of speical tokens.
- 返回类型:
int
- get_vocab()[源代码]#
Returns the vocabulary as a dictionary of token to index.
tokenizer.get_vocab()[token]
is equivalent totokenizer.convert_tokens_to_ids(token)
whentoken
is in the vocab.- 返回:
The vocabulary.
- 返回类型:
Dict[str, int]
- convert_tokens_to_string(tokens)[源代码]#
Converts a sequence of tokens (strings for sub-words) in a single string.
- convert_ids_to_string(ids)[源代码]#
Converts a sequence of tokens (strings for sub-words) in a single string.
- get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[源代码]#
Retrieve sequence ids from a token list that has no special tokens added.
- build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]#
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An MBART sequence has the following format, where
X
represents the sequence:input_ids
(for encoder)X [eos, src_lang_code]
decoder_input_ids
: (for decoder)X [eos, tgt_lang_code]
BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a separator.
- build_offset_mapping_with_special_tokens(offset_mapping_0, offset_mapping_1=None)[源代码]#
Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
Should be overridden in a subclass if the model has a special way of building those.
- 参数:
offset_mapping_0 (List[tuple]) -- List of char offsets to which the special tokens will be added.
offset_mapping_1 (List[tuple], optional) -- Optional second list of char offsets for offset mapping pairs.
- 返回:
List of char offsets with the appropriate offsets of special tokens.
- 返回类型:
List[tuple]
- class MBart50Tokenizer(vocab_file, src_lang=None, tgt_lang=None, bos_token='<s>', eos_token='</s>', sep_token='</s>', cls_token='<s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', sp_model_kwargs=None, additional_special_tokens=None, **kwargs)[源代码]#
-
- save_resources(save_directory)[源代码]#
Save tokenizer related resources to
resource_files_names
indicating files undersave_directory
by copying directly. Override it if necessary.- 参数:
save_directory (str) -- Directory to save files into.
- get_vocab()[源代码]#
Returns the vocabulary as a dictionary of token to index.
tokenizer.get_vocab()[token]
is equivalent totokenizer.convert_tokens_to_ids(token)
whentoken
is in the vocab.- 返回:
The vocabulary.
- 返回类型:
Dict[str, int]
- property vocab_size#
Returns the size of vocabulary.
- 返回:
The sum of size of vocabulary and the size of speical tokens.
- 返回类型:
int
- convert_tokens_to_string(tokens)[源代码]#
Converts a sequence of tokens (strings for sub-words) in a single string.
- convert_ids_to_string(ids)[源代码]#
Converts a sequence of tokens (strings for sub-words) in a single string.
- get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[源代码]#
Retrieve sequence ids from a token list that has no special tokens added.
- build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]#
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An MBART50 sequence has the following format, where
X
represents the sequence:input_ids
(for encoder)[src_lang_code] X [eos]
labels
: (for decoder)[tgt_lang_code] X [eos]
BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a separator.
- build_offset_mapping_with_special_tokens(offset_mapping_0, offset_mapping_1=None)[源代码]#
Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
Should be overridden in a subclass if the model has a special way of building those.
- 参数:
offset_mapping_0 (List[tuple]) -- List of char offsets to which the special tokens will be added.
offset_mapping_1 (List[tuple], optional) -- Optional second list of char offsets for offset mapping pairs.
- 返回:
List of char offsets with the appropriate offsets of special tokens.
- 返回类型:
List[tuple]