tokenizer¶
-
class
GPTTokenizer
(vocab_file, merges_file, errors='replace', max_len=None, pad_token='<|endoftext|>', eos_token='<|endoftext|>', unk_token='<|endoftext|>', eol_token='Ċ', add_prefix_space=False, add_bos_token=False, **kwargs)[source]¶ Bases:
paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer
Constructs a GPT tokenizer based on byte-level Byte-Pair-Encoding.
This tokenizer inherits from
PretrainedTokenizer
which contains most of the main methods. For more information regarding those methods, please refer to this superclass.- Parameters
vocab_file (str) – Path to the vocab file. The vocab file contains a mapping from vocabulary strings to indices.
merges_file (str) – Path to the merge file. The merge file is used to split the input sentence into “subword” units. The vocab file is then used to encode those units as intices.
errors (str) – Paradigm to follow when decoding bytes to UTF-8. Defaults to
'replace'
.max_len (int, optional) – The maximum value of the input sequence length. Defaults to
None
.
Examples
from paddlenlp.transformers import GPTTokenizer tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en') print(tokenizer('Welcome to use PaddlePaddle and PaddleNLP')) ''' {'input_ids': [14618, 284, 779, 350, 37382, 47, 37382, 290, 350, 37382, 45, 19930], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]} '''
-
property
vocab_size
¶ Returns the size of vocabulary.
- Returns
The sum of size of vocabulary and the size of speical tokens.
- Return type
int
-
convert_ids_to_string
(ids)[source]¶ Converts a single index or a sequence of indices to texts.
- Parameters
ids (int|List[int]) – The token id (or token ids) to be converted to text.
- Returns
The decoded text.
- Return type
str
Example
from paddlenlp.transformers import GPTTokenizer tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en') print(tokenizer.convert_ids_to_string(tokenizer.convert_ids_to_string([14618, 284, 779, 350, 37382, 47, 37382, 290, 350, 37382, 45, 19930])) # 'Welcome to use PaddlePaddle and PaddleNLP'
-
save_resources
(save_directory)[source]¶ Saves SentencePiece file (ends with ‘.spm’) under
save_directory
.- Parameters
save_directory (str) – Directory to save files into.
-
convert_tokens_to_string
(tokens)[source]¶ Converts a sequence of tokens (string) in a single string.
-
get_vocab
()[source]¶ Returns the vocabulary as a dictionary of token to index.
tokenizer.get_vocab()[token]
is equivalent totokenizer.convert_tokens_to_ids(token)
whentoken
is in the vocab.- Returns
The vocabulary.
- Return type
Dict[str, int]
-
prepare_for_tokenization
(text, is_split_into_words=False, **kwargs)[source]¶ Performs any necessary transformations before tokenization.
This method should pop the arguments from kwargs and return the remaining
kwargs
as well. We test thekwargs
at the end of the encoding process to be sure all the arguments have been used.- Parameters
text (
str
) – The text to prepare.is_split_into_words (
bool
, optional, defaults toFalse
) – Whether or not the input is already pre-tokenized (e.g., split into words). If set toTrue
, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.kwargs – Keyword arguments to use for the tokenization.
- Returns
The prepared text and the unused kwargs.
- Return type
Tuple[str, Dict[str, Any]]
-
build_inputs_with_special_tokens
(token_ids_0, token_ids_1=None)[source]¶ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens.
This implementation does not add special tokens and this method should be overridden in a subclass.
- Parameters
token_ids_0 (
List[int]
) – The first tokenized sequence.token_ids_1 (
List[int]
, optional) – The second tokenized sequence.
- Returns
The model input with special tokens.
- Return type
List[int]
-
class
GPTChineseTokenizer
(model_file, max_len=512, unk_token='<unk>', bos_token='<bod>', eos_token='<eod>', eol_token='▃', **kwargs)[source]¶ Bases:
paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer
Constructs a GPT Chinese tokenizer based on SentencePiece.
This tokenizer inherits from
PretrainedTokenizer
which contains most of the main methods. For more information regarding those methods, please refer to this superclass.- Parameters
vocab_file (str) – The vocabulary file required to instantiate a SentencePiece tokenizer.
max_len (int) – The maximum value of the input sequence length. Defaults to
512
.unk_token (str) – A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be
unk_token
inorder to be converted to an ID. Defaults to “[UNK]”.
Examples
from paddlenlp.transformers import GPTChineseTokenizer tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn') print(tokenizer('欢迎使用百度飞桨!')) ''' {'input_ids': [2092, 260, 1014, 1596, 17620, 45], 'token_type_ids': [0, 0, 0, 0, 0, 0]} '''
-
convert_ids_to_tokens
(ids)[source]¶ Converts a single index or a sequence of indices to a token or a sequence of tokens.
- Parameters
ids (int|List[int]|tuple(int)) – The token id (or token ids) to be converted to token(s).
- Returns
The converted token or sequence of tokens.
- Return type
str|List[str]
Example
from paddlenlp.transformers import GPTChineseTokenizer tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn') print(tokenizer.convert_ids_to_tokens([2092, 260, 1014, 1596, 17620, 45])) #['▁欢迎', '▁使用', '▁百度', '▁飞', '桨', '▁!']
-
property
vocab_size
¶ Returns the size of vocabulary.
- Returns
The size of vocabulary.
- Return type
int
Example
from paddlenlp.transformers import GPTChineseTokenizer tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn') print(tokenizer.vocab_size) # 50257
-
convert_ids_to_string
(ids)[source]¶ Converts a single index or a sequence of indices to texts.
- Parameters
ids (int|List[int]) – The token id (or token ids) to be converted to text.
- Returns
The decoded text.
- Return type
str
Example
from paddlenlp.transformers import GPTChineseTokenizer tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn') print(tokenizer.convert_ids_to_string([2092, 260, 1014, 1596, 17620, 45])) # '欢迎使用百度飞桨!'