tokenizer¶
-
class
BartTokenizer
(vocab_file, merges_file, errors='replace', max_len=None, bos_token='<s>', eos_token='</s>', cls_token='<s>', sep_token='</s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', **kwargs)[源代码]¶ 基类:
paddlenlp.transformers.gpt.tokenizer.GPTTokenizer
Construct a BART tokenizer based on byte-level Byte-Pair-Encoding.
This tokenizer inherits from
GPTTokenizer
. For more information regarding those methods, please refer to this superclass.- 参数
vocab_file (str) -- Path to the vocabulary file. The vocab file contains a mapping from vocabulary strings to indices.
merges_file (str) -- Path to the merge file. The merge file is used to split the input sentence into "subword" units. The vocab file is then used to encode those units as intices.
errors (str) -- Paradigm to follow when decoding bytes to UTF-8. Defaults to
'replace'
.max_len (int, optional) -- The maximum value of the input sequence length. Defaults to
None
.bos_token (str, optional) -- The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. Defaults to
"<s>"
.eos_token (str, optional) -- A special token representing the end of a sequence that was used during pretraining. Defaults to
"</s>"
.cls_token (str, optional) -- A special token used for sequence classification. It is the last token of the sequence when built with special tokens. Defaults to
"<s>"
.sep_token (str, optional) -- A special token separating two different sentences in the same input. Defaults to
"</s>"
.unk_token (str, optional) -- A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be
unk_token
inorder to be converted to an ID. Defaults to"<unk>"
.pad_token (str, optional) -- A special token used to make arrays of tokens the same size for batching purposes. Defaults to
"<pad>"
.mask_token (str, optional) -- A special token representing a masked token. This is the token used in the masked language modeling task which the model tries to predict the original unmasked ones. Defaults to
"<mask>"
.
实际案例
from paddlenlp.transformers import BartTokenizer tokenizer = BartTokenizer.from_pretrained('bart-base') print(tokenizer('He was a puppeteer')) ''' {'input_ids': [0, 894, 21, 10, 32986, 9306, 254, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]} '''
-
build_inputs_with_special_tokens
(token_ids_0, token_ids_1=None)[源代码]¶ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens.