tokenizer#

class MBartTokenizer(vocab_file, src_lang=None, tgt_lang=None, bos_token='<s>', eos_token='</s>', sep_token='</s>', cls_token='<s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', sp_model_kwargs=None, additional_special_tokens=None, **kwargs)[源代码]#

基类:PretrainedTokenizer

save_resources(save_directory)[源代码]#

Save tokenizer related resources to resource_files_names indicating files under save_directory by copying directly. Override it if necessary.

参数:

save_directory (str) -- Directory to save files into.

property vocab_size#

Returns the size of vocabulary.

返回:

The sum of size of vocabulary and the size of speical tokens.

返回类型:

int

get_vocab()[源代码]#

Returns the vocabulary as a dictionary of token to index.

tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab.

返回:

The vocabulary.

返回类型:

Dict[str, int]

convert_tokens_to_string(tokens)[源代码]#

Converts a sequence of tokens (strings for sub-words) in a single string.

convert_ids_to_string(ids)[源代码]#

Converts a sequence of tokens (strings for sub-words) in a single string.

get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[源代码]#

Retrieve sequence ids from a token list that has no special tokens added.

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]#

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An MBART sequence has the following format, where X represents the sequence:

  • input_ids (for encoder) X [eos, src_lang_code]

  • decoder_input_ids: (for decoder) X [eos, tgt_lang_code]

BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a separator.

build_offset_mapping_with_special_tokens(offset_mapping_0, offset_mapping_1=None)[源代码]#

Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.

Should be overridden in a subclass if the model has a special way of building those.

参数:
  • offset_mapping_0 (List[tuple]) -- List of char offsets to which the special tokens will be added.

  • offset_mapping_1 (List[tuple], optional) -- Optional second list of char offsets for offset mapping pairs.

返回:

List of char offsets with the appropriate offsets of special tokens.

返回类型:

List[tuple]

set_src_lang_special_tokens(src_lang)[源代码]#

Reset the special tokens to the source lang setting. No prefix and suffix=[eos, src_lang_code].

set_tgt_lang_special_tokens(tgt_lang)[源代码]#

Reset the special tokens to the target language setting. No prefix and suffix=[eos, tgt_lang_code].

class MBart50Tokenizer(vocab_file, src_lang=None, tgt_lang=None, bos_token='<s>', eos_token='</s>', sep_token='</s>', cls_token='<s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', sp_model_kwargs=None, additional_special_tokens=None, **kwargs)[源代码]#

基类:PretrainedTokenizer

save_resources(save_directory)[源代码]#

Save tokenizer related resources to resource_files_names indicating files under save_directory by copying directly. Override it if necessary.

参数:

save_directory (str) -- Directory to save files into.

get_vocab()[源代码]#

Returns the vocabulary as a dictionary of token to index.

tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab.

返回:

The vocabulary.

返回类型:

Dict[str, int]

property vocab_size#

Returns the size of vocabulary.

返回:

The sum of size of vocabulary and the size of speical tokens.

返回类型:

int

convert_tokens_to_string(tokens)[源代码]#

Converts a sequence of tokens (strings for sub-words) in a single string.

convert_ids_to_string(ids)[源代码]#

Converts a sequence of tokens (strings for sub-words) in a single string.

get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[源代码]#

Retrieve sequence ids from a token list that has no special tokens added.

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]#

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An MBART50 sequence has the following format, where X represents the sequence:

  • input_ids (for encoder) [src_lang_code] X [eos]

  • labels: (for decoder) [tgt_lang_code] X [eos]

BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a separator.

build_offset_mapping_with_special_tokens(offset_mapping_0, offset_mapping_1=None)[源代码]#

Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.

Should be overridden in a subclass if the model has a special way of building those.

参数:
  • offset_mapping_0 (List[tuple]) -- List of char offsets to which the special tokens will be added.

  • offset_mapping_1 (List[tuple], optional) -- Optional second list of char offsets for offset mapping pairs.

返回:

List of char offsets with the appropriate offsets of special tokens.

返回类型:

List[tuple]

set_src_lang_special_tokens(src_lang)[源代码]#

Reset the special tokens to the source lang setting. prefix=[src_lang_code] and suffix=[eos].

set_tgt_lang_special_tokens(tgt_lang)[源代码]#

Reset the special tokens to the target language setting. prefix=[tgt_lang_code] and suffix=[eos].