tokenizer

class MBartTokenizer(vocab_file, src_lang=None, tgt_lang=None, bos_token='<s>', eos_token='</s>', sep_token='</s>', cls_token='<s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', mbart_type='mbart', **kwargs)[源代码]

基类:paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer

property vocab_size

Return the size of vocabulary.

返回

The size of vocabulary.

返回类型

int

tokenize(text)[源代码]

Tokenize a string.

convert_tokens_to_string(tokens)[源代码]

Converts a sequence of tokens (strings for sub-words) in a single string.

convert_ids_to_string(ids)[源代码]

Converts a sequence of tokens (strings for sub-words) in a single string.

get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[源代码]

Retrieve sequence ids from a token list that has no special tokens added.

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An MBART sequence has the following format, where X represents the sequence:

  • input_ids (for encoder) X [eos, src_lang_code]

  • decoder_input_ids: (for decoder) X [eos, tgt_lang_code]

BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a separator.

as_target_tokenizer()[源代码]

Temporarily sets the tokenizer for encoding the targets. Useful for tokenizer associated to sequence-to-sequence models that need a slightly different processing for the labels.