tokenizer¶
-
class
MBartTokenizer
(vocab_file, src_lang=None, tgt_lang=None, bos_token='<s>', eos_token='</s>', sep_token='</s>', cls_token='<s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', mbart_type='mbart', **kwargs)[源代码]¶ 基类:
paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer
-
property
vocab_size
¶ Return the size of vocabulary.
- 返回
The size of vocabulary.
- 返回类型
int
-
convert_tokens_to_string
(tokens)[源代码]¶ Converts a sequence of tokens (strings for sub-words) in a single string.
-
convert_ids_to_string
(ids)[源代码]¶ Converts a sequence of tokens (strings for sub-words) in a single string.
-
get_special_tokens_mask
(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[源代码]¶ Retrieve sequence ids from a token list that has no special tokens added.
-
build_inputs_with_special_tokens
(token_ids_0, token_ids_1=None)[源代码]¶ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An MBART sequence has the following format, where
X
represents the sequence:input_ids
(for encoder)X [eos, src_lang_code]
decoder_input_ids
: (for decoder)X [eos, tgt_lang_code]
BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a separator.
-
property