tokenizer

class MBartTokenizer(vocab_file, src_lang=None, tgt_lang=None, bos_token='<s>', eos_token='</s>', sep_token='</s>', cls_token='<s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', sp_model_kwargs=None, additional_special_tokens=None, **kwargs)[源代码]

基类:paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer

save_resources(save_directory)[源代码]

Save tokenizer related resources to resource_files_names indicating files under save_directory by copying directly. Override it if necessary.

参数

save_directory (str) -- Directory to save files into.

property vocab_size

Returns the size of vocabulary.

返回

The sum of size of vocabulary and the size of speical tokens.

返回类型

int

get_vocab()[源代码]

Returns the vocabulary as a dictionary of token to index.

tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab.

返回

The vocabulary.

返回类型

Dict[str, int]

convert_tokens_to_string(tokens)[源代码]

Converts a sequence of tokens (strings for sub-words) in a single string.

convert_ids_to_string(ids)[源代码]

Converts a sequence of tokens (strings for sub-words) in a single string.

get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[源代码]

Retrieve sequence ids from a token list that has no special tokens added.

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An MBART sequence has the following format, where X represents the sequence:

  • input_ids (for encoder) X [eos, src_lang_code]

  • decoder_input_ids: (for decoder) X [eos, tgt_lang_code]

BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a separator.

set_src_lang_special_tokens(src_lang)[源代码]

Reset the special tokens to the source lang setting. No prefix and suffix=[eos, src_lang_code].

set_tgt_lang_special_tokens(tgt_lang)[源代码]

Reset the special tokens to the target language setting. No prefix and suffix=[eos, tgt_lang_code].

class MBart50Tokenizer(vocab_file, src_lang=None, tgt_lang=None, bos_token='<s>', eos_token='</s>', sep_token='</s>', cls_token='<s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', sp_model_kwargs=None, additional_special_tokens=None, **kwargs)[源代码]

基类:paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer

save_resources(save_directory)[源代码]

Save tokenizer related resources to resource_files_names indicating files under save_directory by copying directly. Override it if necessary.

参数

save_directory (str) -- Directory to save files into.

get_vocab()[源代码]

Returns the vocabulary as a dictionary of token to index.

tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab.

返回

The vocabulary.

返回类型

Dict[str, int]

property vocab_size

Returns the size of vocabulary.

返回

The sum of size of vocabulary and the size of speical tokens.

返回类型

int

convert_tokens_to_string(tokens)[源代码]

Converts a sequence of tokens (strings for sub-words) in a single string.

convert_ids_to_string(ids)[源代码]

Converts a sequence of tokens (strings for sub-words) in a single string.

get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[源代码]

Retrieve sequence ids from a token list that has no special tokens added.

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An MBART50 sequence has the following format, where X represents the sequence:

  • input_ids (for encoder) [src_lang_code] X [eos]

  • labels: (for decoder) [tgt_lang_code] X [eos]

BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a separator.

set_src_lang_special_tokens(src_lang)[源代码]

Reset the special tokens to the source lang setting. prefix=[src_lang_code] and suffix=[eos].

set_tgt_lang_special_tokens(tgt_lang)[源代码]

Reset the special tokens to the target language setting. prefix=[tgt_lang_code] and suffix=[eos].