class MBartTokenizer(vocab_file, src_lang=None, tgt_lang=None, bos_token='<s>', eos_token='</s>', sep_token='</s>', cls_token='<s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', mbart_type='mbart', **kwargs)[source]

Bases: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer

property vocab_size

Return the size of vocabulary.


The size of vocabulary.

Return type



Tokenize a string.


Converts a sequence of tokens (strings for sub-words) in a single string.


Converts a sequence of tokens (strings for sub-words) in a single string.

get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]

Retrieve sequence ids from a token list that has no special tokens added.

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[source]

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An MBART sequence has the following format, where X represents the sequence:

  • input_ids (for encoder) X [eos, src_lang_code]

  • decoder_input_ids: (for decoder) X [eos, tgt_lang_code]

BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a separator.


Temporarily sets the tokenizer for encoding the targets. Useful for tokenizer associated to sequence-to-sequence models that need a slightly different processing for the labels.