tokenizer#

class ArtistTokenizer(vocab_file, do_lower_case=True, image_vocab_size=16384, do_basic_tokenize=True, never_split=None, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', tokenize_chinese_chars=True, strip_accents=None, **kwargs)[源代码]#

基类:BertTokenizer

Constructs an Artist tokenizer. ArtistTokenizer is almost identical to BertTokenizer.

参数:
  • vocab_file (str) -- The vocabulary file path (ends with '.txt') required to instantiate a WordpieceTokenizer.

  • do_lower_case (bool, optional) -- Whether to lowercase the input when tokenizing. Defaults to True.

  • image_vocab_size (int, optional) -- The vocabulary size of image. Defaults to 16384.

  • do_basic_tokenize (bool, optional) -- Whether to use a basic tokenizer before a WordPiece tokenizer. Defaults to True.

  • never_split (Iterable, optional) -- Collection of tokens which will never be split during tokenization. Only has an effect when do_basic_tokenize=True. Defaults to None.

  • unk_token (str, optional) -- A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be unk_token inorder to be converted to an ID. Defaults to "[UNK]".

  • sep_token (str, optional) -- A special token separating two different sentences in the same input. Defaults to "[SEP]".

  • pad_token (str, optional) -- A special token used to make arrays of tokens the same size for batching purposes. Defaults to "[PAD]".

  • cls_token (str, optional) -- A special token used for sequence classification. It is the last token of the sequence when built with special tokens. Defaults to "[CLS]".

  • mask_token (str, optional) -- A special token representing a masked token. This is the token used in the masked language modeling task which the model tries to predict the original unmasked ones. Defaults to "[MASK]".

  • tokenize_chinese_chars (bool, optional) -- Whether to tokenize Chinese characters. Defaults to True.

  • strip_accents -- (bool, optional): Whether to strip all accents. If this option is not specified, then it will be determined by the value for lowercase (as in the original BERT). Defaults to None.

示例

from paddlenlp.transformers import ArtistTokenizer
tokenizer = ArtistTokenizer.from_pretrained('pai-painter-painting-base-zh')

inputs = tokenizer('风阁水帘今在眼,且来先看早梅红', return_token_type_ids=False)
print(inputs)

'''
{'input_ids': [23983, 23707, 20101, 18750, 17175, 18146, 21090, 24408, 17068,
               19725, 17428, 21076, 19577, 19833, 21657]}
'''
build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]#

Build model inputs from a sequence (we don't add special tokens).

An Artist sequence has the following format:

  • single sequence: X

参数:
  • token_ids_0 (List[int]) -- List of IDs to which the special tokens will be added.

  • token_ids_1 (List[int], optional) -- Optional second list of IDs for sequence pairs. We do'nt use sequence pairs. Defaults to None.

返回:

List of input_id.

返回类型:

List[int]