tokenizer#

class ErnieDocTokenizer(vocab_file, do_lower_case=True, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', **kwargs)[源代码]#

基类:ErnieTokenizer

Constructs an ERNIE-Doc tokenizer. It uses a basic tokenizer to do punctuation splitting, lower casing and so on, and follows a WordPiece tokenizer to tokenize as subwords.

This tokenizer inherits from ErnieTokenizer. For more information regarding those methods, please refer to this superclass.

参数:
  • vocab_file (str) -- The vocabulary file path (ends with '.txt') required to instantiate a WordpieceTokenizer.

  • do_lower_case (str, optional) -- Whether or not to lowercase the input when tokenizing. Defaults to`True`.

  • unk_token (str, optional) -- A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be unk_token inorder to be converted to an ID. Defaults to "[UNK]".

  • sep_token (str, optional) -- A special token separating two different sentences in the same input. Defaults to "[SEP]".

  • pad_token (str, optional) -- A special token used to make arrays of tokens the same size for batching purposes. Defaults to "[PAD]".

  • cls_token (str, optional) -- A special token used for sequence classification. It is the last token of the sequence when built with special tokens. Defaults to "[CLS]".

  • mask_token (str, optional) -- A special token representing a masked token. This is the token used in the masked language modeling task which the model tries to predict the original unmasked ones. Defaults to "[MASK]".

示例

from paddlenlp.transformers import ErnieDocTokenizer
tokenizer = ErnieDocTokenizer.from_pretrained('ernie-doc-base-zh')
encoded_inputs = tokenizer('He was a puppeteer')
class ErnieDocBPETokenizer(vocab_file, encoder_json_path='./configs/encoder.json', vocab_bpe_path='./configs/vocab.bpe', unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', **kwargs)[源代码]#

基类:BPETokenizer

Constructs an ERNIE-Doc BPE tokenizer. It uses a bpe tokenizer to do punctuation splitting, lower casing and so on, then tokenize words as subwords.

This tokenizer inherits from BPETokenizer. For more information regarding those methods, please refer to this superclass.

参数:
  • vocab_file (str) -- File path of the vocabulary.

  • encoder_json_path (str, optional) -- File path of the id to vocab.

  • vocab_bpe_path (str, optional) -- File path of word merge text.

  • unk_token (str, optional) -- A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be unk_token inorder to be converted to an ID. Defaults to "[UNK]".

  • sep_token (str, optional) -- A special token separating two different sentences in the same input. Defaults to "[SEP]".

  • pad_token (str, optional) -- A special token used to make arrays of tokens the same size for batching purposes. Defaults to "[PAD]".

  • cls_token (str, optional) -- A special token used for sequence classification. It is the last token of the sequence when built with special tokens. Defaults to "[CLS]".

  • mask_token (str, optional) -- A special token representing a masked token. This is the token used in the masked language modeling task which the model tries to predict the original unmasked ones. Defaults to "[MASK]".

示例

from paddlenlp.transformers import ErnieDocBPETokenizer
tokenizer = ErnieDocBPETokenizer.from_pretrained('ernie-doc-base-en')
encoded_inputs = tokenizer('He was a puppeteer')
property vocab_size#

Return the size of vocabulary.

返回:

The size of vocabulary.

返回类型:

int