tokenizer#
- class ErnieDocTokenizer(vocab_file, do_lower_case=True, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', **kwargs)[源代码]#
-
Constructs an ERNIE-Doc tokenizer. It uses a basic tokenizer to do punctuation splitting, lower casing and so on, and follows a WordPiece tokenizer to tokenize as subwords.
This tokenizer inherits from
ErnieTokenizer. For more information regarding those methods, please refer to this superclass.- 参数:
vocab_file (str) -- The vocabulary file path (ends with '.txt') required to instantiate a
WordpieceTokenizer.do_lower_case (str, optional) -- Whether or not to lowercase the input when tokenizing. Defaults to`True`.
unk_token (str, optional) -- A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be
unk_tokeninorder to be converted to an ID. Defaults to "[UNK]".sep_token (str, optional) -- A special token separating two different sentences in the same input. Defaults to "[SEP]".
pad_token (str, optional) -- A special token used to make arrays of tokens the same size for batching purposes. Defaults to "[PAD]".
cls_token (str, optional) -- A special token used for sequence classification. It is the last token of the sequence when built with special tokens. Defaults to "[CLS]".
mask_token (str, optional) -- A special token representing a masked token. This is the token used in the masked language modeling task which the model tries to predict the original unmasked ones. Defaults to "[MASK]".
示例
from paddlenlp.transformers import ErnieDocTokenizer tokenizer = ErnieDocTokenizer.from_pretrained('ernie-doc-base-zh') encoded_inputs = tokenizer('He was a puppeteer')
- class ErnieDocBPETokenizer(vocab_file, encoder_json_path='./configs/encoder.json', vocab_bpe_path='./configs/vocab.bpe', unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', **kwargs)[源代码]#
基类:
BPETokenizerConstructs an ERNIE-Doc BPE tokenizer. It uses a bpe tokenizer to do punctuation splitting, lower casing and so on, then tokenize words as subwords.
This tokenizer inherits from
BPETokenizer. For more information regarding those methods, please refer to this superclass.- 参数:
vocab_file (str) -- File path of the vocabulary.
encoder_json_path (str, optional) -- File path of the id to vocab.
vocab_bpe_path (str, optional) -- File path of word merge text.
unk_token (str, optional) -- A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be
unk_tokeninorder to be converted to an ID. Defaults to "[UNK]".sep_token (str, optional) -- A special token separating two different sentences in the same input. Defaults to "[SEP]".
pad_token (str, optional) -- A special token used to make arrays of tokens the same size for batching purposes. Defaults to "[PAD]".
cls_token (str, optional) -- A special token used for sequence classification. It is the last token of the sequence when built with special tokens. Defaults to "[CLS]".
mask_token (str, optional) -- A special token representing a masked token. This is the token used in the masked language modeling task which the model tries to predict the original unmasked ones. Defaults to "[MASK]".
示例
from paddlenlp.transformers import ErnieDocBPETokenizer tokenizer = ErnieDocBPETokenizer.from_pretrained('ernie-doc-base-en') encoded_inputs = tokenizer('He was a puppeteer')
- property vocab_size#
Return the size of vocabulary.
- 返回:
The size of vocabulary.
- 返回类型:
int