tokenizer¶
-
class
GPTTokenizer
(vocab_file, merges_file, errors='replace', max_len=None, pad_token='<|endoftext|>', eos_token='<|endoftext|>', unk_token='<|endoftext|>', eol_token='Ċ', **kwargs)[源代码]¶ 基类:
paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer
Constructs a GPT tokenizer based on byte-level Byte-Pair-Encoding.
This tokenizer inherits from
PretrainedTokenizer
which contains most of the main methods. For more information regarding those methods, please refer to this superclass.- 参数
vocab_file (str) -- Path to the vocab file. The vocab file contains a mapping from vocabulary strings to indices.
merges_file (str) -- Path to the merge file. The merge file is used to split the input sentence into "subword" units. The vocab file is then used to encode those units as intices.
errors (str) -- Paradigm to follow when decoding bytes to UTF-8. Defaults to
'replace'
.max_len (int, optional) -- The maximum value of the input sequence length. Defaults to
None
.
实际案例
from paddlenlp.transformers import GPTTokenizer tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en') print(tokenizer('Welcome to use PaddlePaddle and PaddleNLP')) ''' {'input_ids': [14618, 284, 779, 350, 37382, 47, 37382, 290, 350, 37382, 45, 19930], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]} '''
-
property
vocab_size
¶ Returns the size of vocabulary.
- 返回
The sum of size of vocabulary and the size of speical tokens.
- 返回类型
int
-
convert_ids_to_string
(ids)[源代码]¶ Converts a single index or a sequence of indices to texts.
- 参数
ids (int|List[int]) -- The token id (or token ids) to be converted to text.
- 返回
The decoded text.
- 返回类型
str
示例
from paddlenlp.transformers import GPTTokenizer tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en') print(tokenizer.convert_ids_to_string(tokenizer.convert_ids_to_string([14618, 284, 779, 350, 37382, 47, 37382, 290, 350, 37382, 45, 19930])) # 'Welcome to use PaddlePaddle and PaddleNLP'
-
save_resources
(save_directory)[源代码]¶ Saves SentencePiece file (ends with '.spm') under
save_directory
.- 参数
save_directory (str) -- Directory to save files into.
-
class
GPTChineseTokenizer
(model_file, max_len=512, unk_token='<unk>', bos_token='<bod>', eos_token='<eod>', eol_token='▃', **kwargs)[源代码]¶ 基类:
paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer
Constructs a GPT Chinese tokenizer based on SentencePiece.
This tokenizer inherits from
PretrainedTokenizer
which contains most of the main methods. For more information regarding those methods, please refer to this superclass.- 参数
vocab_file (str) -- The vocabulary file required to instantiate a SentencePiece tokenizer.
max_len (int) -- The maximum value of the input sequence length. Defaults to
512
.unk_token (str) -- A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be
unk_token
inorder to be converted to an ID. Defaults to "[UNK]".
实际案例
from paddlenlp.transformers import GPTChineseTokenizer tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn') print(tokenizer('欢迎使用百度飞桨!')) ''' {'input_ids': [2092, 260, 1014, 1596, 17620, 45], 'token_type_ids': [0, 0, 0, 0, 0, 0]} '''
-
convert_ids_to_tokens
(ids)[源代码]¶ Converts a single index or a sequence of indices to a token or a sequence of tokens.
- 参数
ids (int|List[int]|tuple(int)) -- The token id (or token ids) to be converted to token(s).
- 返回
The converted token or sequence of tokens.
- 返回类型
str|List[str]
示例
from paddlenlp.transformers import GPTChineseTokenizer tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn') print(tokenizer.convert_ids_to_tokens([2092, 260, 1014, 1596, 17620, 45])) #['▁欢迎', '▁使用', '▁百度', '▁飞', '桨', '▁!']
-
property
vocab_size
¶ Returns the size of vocabulary.
- 返回
The size of vocabulary.
- 返回类型
int
示例
from paddlenlp.transformers import GPTChineseTokenizer tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn') print(tokenizer.vocab_size) # 50257
-
convert_ids_to_string
(ids)[源代码]¶ Converts a single index or a sequence of indices to texts.
- 参数
ids (int|List[int]) -- The token id (or token ids) to be converted to text.
- 返回
The decoded text.
- 返回类型
str
示例
from paddlenlp.transformers import GPTChineseTokenizer tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn') print(tokenizer.convert_ids_to_string([2092, 260, 1014, 1596, 17620, 45])) # '欢迎使用百度飞桨!'