tokenizer#

class BartTokenizer(vocab_file, merges_file, errors='replace', bos_token='<s>', eos_token='</s>', cls_token='<s>', sep_token='</s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', **kwargs)[源代码]#

基类：PretrainedTokenizer

Construct a BART tokenizer based on byte-level Byte-Pair-Encoding.

This tokenizer inherits from GPTTokenizer. For more information regarding those methods, please refer to this superclass.

参数:

vocab_file (str) -- Path to the vocabulary file. The vocab file contains a mapping from vocabulary strings to indices.
merges_file (str) -- Path to the merge file. The merge file is used to split the input sentence into "subword" units. The vocab file is then used to encode those units as intices.
errors (str) -- Paradigm to follow when decoding bytes to UTF-8. Defaults to 'replace'.
max_len (int, optional) -- The maximum value of the input sequence length. Defaults to None.
bos_token (str, optional) -- The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. Defaults to "<s>".
eos_token (str, optional) -- A special token representing the end of a sequence that was used during pretraining. Defaults to "</s>".
cls_token (str, optional) -- A special token used for sequence classification. It is the last token of the sequence when built with special tokens. Defaults to "<s>".
sep_token (str, optional) -- A special token separating two different sentences in the same input. Defaults to "</s>".
unk_token (str, optional) -- A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be unk_token inorder to be converted to an ID. Defaults to "<unk>".
pad_token (str, optional) -- A special token used to make arrays of tokens the same size for batching purposes. Defaults to "<pad>".
mask_token (str, optional) -- A special token representing a masked token. This is the token used in the masked language modeling task which the model tries to predict the original unmasked ones. Defaults to "<mask>".

示例

from paddlenlp.transformers import BartTokenizer

tokenizer = BartTokenizer.from_pretrained('bart-base')
print(tokenizer('He was a puppeteer'))

'''
{'input_ids': [0, 894, 21, 10, 32986, 9306, 254, 2],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
'''

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]#: Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens.

get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[源代码]#: Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer encode methods.

create_token_type_ids_from_sequences(token_ids_0, token_ids_1=None)[源代码]#: Create a mask from the two sequences passed to be used in a sequence-pair classification task.

get_vocab()[源代码]#

Returns the vocabulary as a dictionary of token to index.

tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab.

返回:: The vocabulary.
返回类型:: Dict[str, int]

property vocab_size#

Returns the size of vocabulary.

返回:: The sum of size of vocabulary and the size of speical tokens.
返回类型:: int

convert_ids_to_string(ids)[源代码]#

Converts a single index or a sequence of indices to texts.

参数:: ids (int|List[int]) -- The token id (or token ids) to be converted to text.
返回:: The decoded text.
返回类型:: str

示例

from paddlenlp.transformers import GPTTokenizer
tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en')
print(tokenizer.convert_ids_to_string(tokenizer.convert_ids_to_string([14618, 284, 779, 350, 37382, 47, 37382, 290, 350, 37382, 45, 19930]))
# 'Welcome to use PaddlePaddle and PaddleNLP'

save_resources(save_directory)[源代码]#

Saves SentencePiece file (ends with '.spm') under save_directory.

参数:: save_directory (str) -- Directory to save files into.

convert_tokens_to_string(tokens)[源代码]#: Converts a sequence of tokens (string) in a single string.

build_offset_mapping_with_special_tokens(offset_mapping_0, offset_mapping_1=None)[源代码]#

Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.

A BERT offset_mapping has the following format:

single sequence: (0,0) X (0,0)
pair of sequences: (0,0) A (0,0) B (0,0)

参数:

offset_mapping_ids_0 (List[tuple]) -- List of wordpiece offsets to which the special tokens will be added.
offset_mapping_ids_1 (List[tuple], optional) -- Optional second list of wordpiece offsets for offset mapping pairs. Defaults to None.

返回:

A list of wordpiece offsets with the appropriate offsets of special tokens.

返回类型:

List[tuple]

tokenizer

目录

tokenizer#