tokenizer#

基类：PretrainedTokenizer

Constructs a GPT tokenizer based on byte-level Byte-Pair-Encoding.

This tokenizer inherits from PretrainedTokenizer which contains most of the main methods. For more information regarding those methods, please refer to this superclass.

参数:

vocab_file (str) -- Path to the vocab file. The vocab file contains a mapping from vocabulary strings to indices.
merges_file (str) -- Path to the merge file. The merge file is used to split the input sentence into "subword" units. The vocab file is then used to encode those units as intices.
errors (str) -- Paradigm to follow when decoding bytes to UTF-8. Defaults to 'replace'.
max_len (int, optional) -- The maximum value of the input sequence length. Defaults to None.

示例

from paddlenlp.transformers import GPTTokenizer

tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en')
print(tokenizer('Welcome to use PaddlePaddle and PaddleNLP'))

'''
{'input_ids': [14618, 284, 779, 350, 37382, 47, 37382, 290, 350, 37382, 45, 19930],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
'''

property vocab_size#

Returns the size of vocabulary.

返回:: The sum of size of vocabulary and the size of speical tokens.
返回类型:: int

convert_ids_to_string(ids)[源代码]#

Converts a single index or a sequence of indices to texts.

参数:: ids (int|List[int]) -- The token id (or token ids) to be converted to text.
返回:: The decoded text.
返回类型:: str

示例

from paddlenlp.transformers import GPTTokenizer
tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en')
print(tokenizer.convert_ids_to_string(tokenizer.convert_ids_to_string([14618, 284, 779, 350, 37382, 47, 37382, 290, 350, 37382, 45, 19930]))
# 'Welcome to use PaddlePaddle and PaddleNLP'

save_resources(save_directory)[源代码]#

Saves SentencePiece file (ends with '.spm') under save_directory.

参数:: save_directory (str) -- Directory to save files into.

convert_tokens_to_string(tokens)[源代码]#: Converts a sequence of tokens (string) in a single string.

get_vocab()[源代码]#

Returns the vocabulary as a dictionary of token to index.

tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab.

返回:: The vocabulary.
返回类型:: Dict[str, int]

prepare_for_tokenization(text, is_split_into_words=False, **kwargs)[源代码]#

Performs any necessary transformations before tokenization.

This method should pop the arguments from kwargs and return the remaining kwargs as well. We test the kwargs at the end of the encoding process to be sure all the arguments have been used.

参数:

text (str) -- The text to prepare.
is_split_into_words (bool, optional, defaults to False) -- Whether or not the input is already pre-tokenized (e.g., split into words). If set to True, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.
kwargs -- Keyword arguments to use for the tokenization.

返回:

The prepared text and the unused kwargs.

返回类型:

Tuple[str, Dict[str, Any]]

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]#

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens.

This implementation does not add special tokens and this method should be overridden in a subclass.

参数:

token_ids_0 (List[int]) -- The first tokenized sequence.
token_ids_1 (List[int], optional) -- The second tokenized sequence.

返回:

The model input with special tokens.

返回类型:

List[int]

class GPTChineseTokenizer(model_file, max_len=512, unk_token='<unk>', bos_token='<bod>', eos_token='<eod>', eol_token='▃', **kwargs)[源代码]#

基类：PretrainedTokenizer

Constructs a GPT Chinese tokenizer based on SentencePiece.

This tokenizer inherits from PretrainedTokenizer which contains most of the main methods. For more information regarding those methods, please refer to this superclass.

参数:

vocab_file (str) -- The vocabulary file required to instantiate a SentencePiece tokenizer.
max_len (int) -- The maximum value of the input sequence length. Defaults to 512.
unk_token (str) -- A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be unk_token inorder to be converted to an ID. Defaults to "[UNK]".

示例

from paddlenlp.transformers import GPTChineseTokenizer

tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn')
print(tokenizer('欢迎使用百度飞桨！'))
'''
{'input_ids': [2092, 260, 1014, 1596, 17620, 45], 'token_type_ids': [0, 0, 0, 0, 0, 0]}
'''

convert_ids_to_tokens(ids, skip_special_tokens=False)[源代码]#

Converts a single index or a sequence of indices to a token or a sequence of tokens.

参数:: ids (int|List[int]|tuple(int)) -- The token id (or token ids) to be converted to token(s).
返回:: The converted token or sequence of tokens.
返回类型:: str|List[str]

示例

from paddlenlp.transformers import GPTChineseTokenizer

tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn')
print(tokenizer.convert_ids_to_tokens([2092, 260, 1014, 1596, 17620, 45]))
#['▁欢迎', '▁使用', '▁百度', '▁飞', '桨', '▁!']

property vocab_size#

Returns the size of vocabulary.

返回:: The size of vocabulary.
返回类型:: int

示例

from paddlenlp.transformers import GPTChineseTokenizer
tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn')
print(tokenizer.vocab_size)
# 50257

get_vocab()[源代码]#

Returns the vocabulary as a dictionary of token to index.

tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab.

返回:: The vocabulary.
返回类型:: Dict[str, int]

convert_ids_to_string(ids)[源代码]#

Converts a single index or a sequence of indices to texts.

参数:: ids (int|List[int]) -- The token id (or token ids) to be converted to text.
返回:: The decoded text.
返回类型:: str

示例

from paddlenlp.transformers import GPTChineseTokenizer
tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn')
print(tokenizer.convert_ids_to_string([2092, 260, 1014, 1596, 17620, 45]))
# '欢迎使用百度飞桨!'

save_resources(save_directory)[源代码]#

Save tokenizer related resources to files under save_directory.

参数:: save_directory (str) -- Directory to save files into.

tokenizer

目录

tokenizer#