tokenizer#
- class ArtistTokenizer(vocab_file, do_lower_case=True, image_vocab_size=16384, do_basic_tokenize=True, never_split=None, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', tokenize_chinese_chars=True, strip_accents=None, **kwargs)[source]#
Bases:
BertTokenizerConstructs an Artist tokenizer.
ArtistTokenizeris almost identical toBertTokenizer.- Parameters:
vocab_file (str) – The vocabulary file path (ends with ‘.txt’) required to instantiate a
WordpieceTokenizer.do_lower_case (bool, optional) – Whether to lowercase the input when tokenizing. Defaults to
True.image_vocab_size (int, optional) – The vocabulary size of image. Defaults to
16384.do_basic_tokenize (bool, optional) – Whether to use a basic tokenizer before a WordPiece tokenizer. Defaults to
True.never_split (Iterable, optional) – Collection of tokens which will never be split during tokenization. Only has an effect when
do_basic_tokenize=True. Defaults toNone.unk_token (str, optional) – A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be
unk_tokeninorder to be converted to an ID. Defaults to “[UNK]”.sep_token (str, optional) – A special token separating two different sentences in the same input. Defaults to “[SEP]”.
pad_token (str, optional) – A special token used to make arrays of tokens the same size for batching purposes. Defaults to “[PAD]”.
cls_token (str, optional) – A special token used for sequence classification. It is the last token of the sequence when built with special tokens. Defaults to “[CLS]”.
mask_token (str, optional) – A special token representing a masked token. This is the token used in the masked language modeling task which the model tries to predict the original unmasked ones. Defaults to “[MASK]”.
tokenize_chinese_chars (bool, optional) – Whether to tokenize Chinese characters. Defaults to
True.strip_accents – (bool, optional): Whether to strip all accents. If this option is not specified, then it will be determined by the value for
lowercase(as in the original BERT). Defaults toNone.
Examples
from paddlenlp.transformers import ArtistTokenizer tokenizer = ArtistTokenizer.from_pretrained('pai-painter-painting-base-zh') inputs = tokenizer('风阁水帘今在眼,且来先看早梅红', return_token_type_ids=False) print(inputs) ''' {'input_ids': [23983, 23707, 20101, 18750, 17175, 18146, 21090, 24408, 17068, 19725, 17428, 21076, 19577, 19833, 21657]} '''
- build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[source]#
Build model inputs from a sequence (we don’t add special tokens).
An Artist sequence has the following format:
single sequence:
X
- Parameters:
token_ids_0 (List[int]) – List of IDs to which the special tokens will be added.
token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs. We do’nt use sequence pairs. Defaults to None.
- Returns:
List of input_id.
- Return type:
List[int]