tokenizer

class GPTTokenizer(vocab_file, merges_file, errors='replace', max_len=None, special_tokens=None, pad_token='<|endoftext|>', eos_token='<|endoftext|>', eol_token='Ċ')[源代码]

基类:paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer

Constructs a GPT tokenizer based on byte-level Byte-Pair-Encoding.

This tokenizer inherits from PretrainedTokenizer which contains most of the main methods. For more information regarding those methods, please refer to this superclass.

参数
  • vocab_file (str) -- Path to the vocab file. The vocab file contains a mapping from vocabulary strings to indices.

  • merges_file (str) -- Path to the merge file. The merge file is used to split the input sentence into "subword" units. The vocab file is then used to encode those units as intices.

  • errors (str) -- Paradigm to follow when decoding bytes to UTF-8. Defaults to 'replace'.

  • max_len (int, optional) -- The maximum value of the input sequence length. Defaults to None.

  • special_tokens (list, optional) -- A list of special tokens not in the vocabulary. Defaults to None.

实际案例

from paddlenlp.transformers import GPTTokenizer

tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en')
print(tokenizer('Welcome to use PaddlePaddle and PaddleNLP'))

'''
{'input_ids': [14618, 284, 779, 350, 37382, 47, 37382, 290, 350, 37382, 45, 19930],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
'''
property vocab_size

Returns the size of vocabulary.

返回

The sum of size of vocabulary and the size of speical tokens.

返回类型

int

set_special_tokens(special_tokens)[源代码]

Add a list of additional tokens to the encoder. The additional tokens are indexed starting from the last index of the current vocabulary in the order of the special_tokens list.

tokenize(text)[源代码]

Converts a string to a list of tokens.

参数

text (str) -- The text to be tokenized.

返回

A list of string representing converted tokens.

返回类型

List[str]

示例

from paddlenlp.transformers import GPTTokenizer

tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en')
print(tokenizer.tokenize('Welcome to use PaddlePaddle and PaddleNLP'))
# ['Welcome', 'Ġto', 'Ġuse', 'ĠP', 'addle', 'P', 'addle', 'Ġand', 'ĠP', 'addle', 'N', 'LP']
convert_tokens_to_ids(tokens)[源代码]

Converts a single token or a sequence of tokens to an index or a sequence of indices using the vocab.

参数

tokens (str|List[str]|tuple(str)) -- A single token or a sequence of tokens.

返回

The converted token id or token ids.

返回类型

int|List[int]

示例

from paddlenlp.transformers import GPTTokenizer

tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en')
print(tokenizer.convert_tokens_to_ids(['Welcome', 'Ġto', 'Ġuse', 'ĠP', 'addle', 'P', 'addle', 'Ġand', 'ĠP', 'addle', 'N', 'LP']))
# [14618, 284, 779, 350, 37382, 47, 37382, 290, 350, 37382, 45, 19930]
convert_ids_to_tokens(ids, skip_special_tokens=False)[源代码]

Converts an index or a sequence indices to a single token or a sequence of tokens.

参数
  • ids (int|List[int]) -- The token id (or token ids) to be converted to text.

  • skip_special_tokens (bool, optional) -- Whether or not to skip the special tokens. Defaults to False, which means we don't skip the special tokens.

返回

The converted token or the sequence of tokens.

返回类型

str|List[str]

示例

from paddlenlp.transformers import GPTTokenizer

tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en')
print(tokenizer.convert_ids_to_tokens([14618, 284, 779, 350, 37382, 47, 37382, 290, 350, 37382, 45, 19930]))
# ['Welcome', 'Ġto', 'Ġuse', 'ĠP', 'addle', 'P', 'addle', 'Ġand', 'ĠP', 'addle', 'N', 'LP']
convert_ids_to_string(ids)[源代码]

Converts a single index or a sequence of indices to texts.

参数

ids (int|List[int]) -- The token id (or token ids) to be converted to text.

返回

The decoded text.

返回类型

str

示例

from paddlenlp.transformers import GPTTokenizer
tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en')
print(tokenizer.convert_ids_to_string(tokenizer.convert_ids_to_string([14618, 284, 779, 350, 37382, 47, 37382, 290, 350, 37382, 45, 19930]))
# 'Welcome to use PaddlePaddle and PaddleNLP'
save_resources(save_directory)[源代码]

Saves SentencePiece file (ends with '.spm') under save_directory.

参数

save_directory (str) -- Directory to save files into.

class GPTChineseTokenizer(model_file, max_len=512, unk_token='<unk>', bos_token='<bod>', eos_token='<eod>', eol_token='▃')[源代码]

基类:paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer

Constructs a GPT Chinese tokenizer based on SentencePiece.

This tokenizer inherits from PretrainedTokenizer which contains most of the main methods. For more information regarding those methods, please refer to this superclass.

参数
  • vocab_file (str) -- The vocabulary file required to instantiate a SentencePiece tokenizer.

  • max_len (int) -- The maximum value of the input sequence length. Defaults to 512.

  • unk_token (str) -- A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be unk_token inorder to be converted to an ID. Defaults to "[UNK]".

实际案例

from paddlenlp.transformers import GPTChineseTokenizer

tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn')
print(tokenizer('欢迎使用百度飞桨!'))
'''
{'input_ids': [2092, 260, 1014, 1596, 17620, 45], 'token_type_ids': [0, 0, 0, 0, 0, 0]}
'''
tokenize(text)[源代码]

Converts a string to a list of tokens.

参数

text (str) -- The text to be tokenized.

返回

A list of string representing converted tokens.

返回类型

List[str]

示例

from paddlenlp.transformers import GPTChineseTokenizer

tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn')
print(tokenizer.tokenize('欢迎使用百度飞桨!'))
# ['▁欢迎', '▁使用', '▁百度', '▁飞', '桨', '▁!']
convert_tokens_to_ids(tokens)[源代码]

Converts a single token or a sequence of tokens to an index or a sequence of indices.

参数

tokens (str|List[str]|tuple(str)) -- A single token or a sequence of tokens.

返回

The converted token id or token ids.

返回类型

int|List[int]

示例

from paddlenlp.transformers import GPTChineseTokenizer

tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn')
print(tokenizer.convert_tokens_to_ids(['▁欢迎', '▁使用', '▁百度', '▁飞', '桨', '▁!']))
# [2092, 260, 1014, 1596, 17620, 45]
convert_ids_to_tokens(ids)[源代码]

Converts a single index or a sequence of indices to a token or a sequence of tokens.

参数

ids (int|List[int]|tuple(int)) -- The token id (or token ids) to be converted to token(s).

返回

The converted token or sequence of tokens.

返回类型

str|List[str]

示例

from paddlenlp.transformers import GPTChineseTokenizer

tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn')
print(tokenizer.convert_ids_to_tokens([2092, 260, 1014, 1596, 17620, 45]))
#['▁欢迎', '▁使用', '▁百度', '▁飞', '桨', '▁!']
property vocab_size

Returns the size of vocabulary.

返回

The size of vocabulary.

返回类型

int

示例

from paddlenlp.transformers import GPTChineseTokenizer
tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn')
print(tokenizer.vocab_size)
# 50257
convert_ids_to_string(ids)[源代码]

Converts a single index or a sequence of indices to texts.

参数

ids (int|List[int]) -- The token id (or token ids) to be converted to text.

返回

The decoded text.

返回类型

str

示例

from paddlenlp.transformers import GPTChineseTokenizer
tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn')
print(tokenizer.convert_ids_to_string([2092, 260, 1014, 1596, 17620, 45]))
# '欢迎使用百度飞桨!'
save_resources(save_directory)[源代码]

Save tokenizer related resources to files under save_directory.

参数

save_directory (str) -- Directory to save files into.