tokenizer

tokenizer#

class MBartTokenizer(vocab_file, src_lang=None, tgt_lang=None, bos_token='<s>', eos_token='</s>', sep_token='</s>', cls_token='<s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', sp_model_kwargs=None, additional_special_tokens=None, **kwargs)[source]#

Bases: PretrainedTokenizer

save_resources(save_directory)[source]#

Save tokenizer related resources to resource_files_names indicating files under save_directory by copying directly. Override it if necessary.

Parameters:: save_directory (str) – Directory to save files into.

property vocab_size#

Returns the size of vocabulary.

Returns:: The sum of size of vocabulary and the size of speical tokens.
Return type:: int

get_vocab()[source]#

Returns the vocabulary as a dictionary of token to index.

tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab.

Returns:: The vocabulary.
Return type:: Dict[str, int]

convert_tokens_to_string(tokens)[source]#: Converts a sequence of tokens (strings for sub-words) in a single string.

convert_ids_to_string(ids)[source]#: Converts a sequence of tokens (strings for sub-words) in a single string.

get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]#: Retrieve sequence ids from a token list that has no special tokens added.

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[source]#

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An MBART sequence has the following format, where X represents the sequence:

input_ids (for encoder) X [eos, src_lang_code]
decoder_input_ids: (for decoder) X [eos, tgt_lang_code]

BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a separator.

build_offset_mapping_with_special_tokens(offset_mapping_0, offset_mapping_1=None)[source]#

Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.

Should be overridden in a subclass if the model has a special way of building those.

Parameters:

offset_mapping_0 (List[tuple]) – List of char offsets to which the special tokens will be added.
offset_mapping_1 (List[tuple], optional) – Optional second list of char offsets for offset mapping pairs.

Returns:

List of char offsets with the appropriate offsets of special tokens.

Return type:

List[tuple]

set_src_lang_special_tokens(src_lang)[source]#: Reset the special tokens to the source lang setting. No prefix and suffix=[eos, src_lang_code].

set_tgt_lang_special_tokens(tgt_lang)[source]#: Reset the special tokens to the target language setting. No prefix and suffix=[eos, tgt_lang_code].

class MBart50Tokenizer(vocab_file, src_lang=None, tgt_lang=None, bos_token='<s>', eos_token='</s>', sep_token='</s>', cls_token='<s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', sp_model_kwargs=None, additional_special_tokens=None, **kwargs)[source]#