tokenizer#
- class ArtistTokenizer(vocab_file, do_lower_case=True, image_vocab_size=16384, do_basic_tokenize=True, never_split=None, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', tokenize_chinese_chars=True, strip_accents=None, **kwargs)[源代码]#
-
Constructs an Artist tokenizer.
ArtistTokenizer
is almost identical toBertTokenizer
.- 参数:
vocab_file (str) -- The vocabulary file path (ends with '.txt') required to instantiate a
WordpieceTokenizer
.do_lower_case (bool, optional) -- Whether to lowercase the input when tokenizing. Defaults to
True
.image_vocab_size (int, optional) -- The vocabulary size of image. Defaults to
16384
.do_basic_tokenize (bool, optional) -- Whether to use a basic tokenizer before a WordPiece tokenizer. Defaults to
True
.never_split (Iterable, optional) -- Collection of tokens which will never be split during tokenization. Only has an effect when
do_basic_tokenize=True
. Defaults toNone
.unk_token (str, optional) -- A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be
unk_token
inorder to be converted to an ID. Defaults to "[UNK]".sep_token (str, optional) -- A special token separating two different sentences in the same input. Defaults to "[SEP]".
pad_token (str, optional) -- A special token used to make arrays of tokens the same size for batching purposes. Defaults to "[PAD]".
cls_token (str, optional) -- A special token used for sequence classification. It is the last token of the sequence when built with special tokens. Defaults to "[CLS]".
mask_token (str, optional) -- A special token representing a masked token. This is the token used in the masked language modeling task which the model tries to predict the original unmasked ones. Defaults to "[MASK]".
tokenize_chinese_chars (bool, optional) -- Whether to tokenize Chinese characters. Defaults to
True
.strip_accents -- (bool, optional): Whether to strip all accents. If this option is not specified, then it will be determined by the value for
lowercase
(as in the original BERT). Defaults toNone
.
示例
from paddlenlp.transformers import ArtistTokenizer tokenizer = ArtistTokenizer.from_pretrained('pai-painter-painting-base-zh') inputs = tokenizer('风阁水帘今在眼,且来先看早梅红', return_token_type_ids=False) print(inputs) ''' {'input_ids': [23983, 23707, 20101, 18750, 17175, 18146, 21090, 24408, 17068, 19725, 17428, 21076, 19577, 19833, 21657]} '''
- build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]#
Build model inputs from a sequence (we don't add special tokens).
An Artist sequence has the following format:
single sequence:
X
- 参数:
token_ids_0 (List[int]) -- List of IDs to which the special tokens will be added.
token_ids_1 (List[int], optional) -- Optional second list of IDs for sequence pairs. We do'nt use sequence pairs. Defaults to None.
- 返回:
List of input_id.
- 返回类型:
List[int]