tokenizer¶
-
class
ArtistTokenizer
(vocab_file, do_lower_case=True, image_vocab_size=16384, do_basic_tokenize=True, never_split=None, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', tokenize_chinese_chars=True, strip_accents=None, **kwargs)[source]¶ Bases:
paddlenlp.transformers.bert.tokenizer.BertTokenizer
Constructs an Artist tokenizer.
ArtistTokenizer
is almost identical toBertTokenizer
.- Parameters
vocab_file (str) – The vocabulary file path (ends with ‘.txt’) required to instantiate a
WordpieceTokenizer
.do_lower_case (bool, optional) – Whether to lowercase the input when tokenizing. Defaults to
True
.image_vocab_size (int, optional) – The vocabulary size of image. Defaults to
16384
.do_basic_tokenize (bool, optional) – Whether to use a basic tokenizer before a WordPiece tokenizer. Defaults to
True
.never_split (Iterable, optional) – Collection of tokens which will never be split during tokenization. Only has an effect when
do_basic_tokenize=True
. Defaults toNone
.unk_token (str, optional) – A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be
unk_token
inorder to be converted to an ID. Defaults to “[UNK]”.sep_token (str, optional) – A special token separating two different sentences in the same input. Defaults to “[SEP]”.
pad_token (str, optional) – A special token used to make arrays of tokens the same size for batching purposes. Defaults to “[PAD]”.
cls_token (str, optional) – A special token used for sequence classification. It is the last token of the sequence when built with special tokens. Defaults to “[CLS]”.
mask_token (str, optional) – A special token representing a masked token. This is the token used in the masked language modeling task which the model tries to predict the original unmasked ones. Defaults to “[MASK]”.
tokenize_chinese_chars (bool, optional) – Whether to tokenize Chinese characters. Defaults to
True
.strip_accents – (bool, optional): Whether to strip all accents. If this option is not specified, then it will be determined by the value for
lowercase
(as in the original BERT). Defaults toNone
.
Examples
from paddlenlp.transformers import ArtistTokenizer tokenizer = ArtistTokenizer.from_pretrained('pai-painter-painting-base-zh') inputs = tokenizer('风阁水帘今在眼,且来先看早梅红', return_token_type_ids=False) print(inputs) ''' {'input_ids': [23983, 23707, 20101, 18750, 17175, 18146, 21090, 24408, 17068, 19725, 17428, 21076, 19577, 19833, 21657]} '''
-
build_inputs_with_special_tokens
(token_ids_0, token_ids_1=None)[source]¶ Build model inputs from a sequence (we don’t add special tokens).
An Artist sequence has the following format:
single sequence:
X
- Parameters
token_ids_0 (List[int]) – List of IDs to which the special tokens will be added.
token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs. We do’nt use sequence pairs. Defaults to None.
- Returns
List of input_id.
- Return type
List[int]