tokenizer

class GPTTokenizer(vocab_file, merges_file, errors='replace', max_len=None, pad_token='<|endoftext|>', eos_token='<|endoftext|>', unk_token='<|endoftext|>', eol_token='Ċ', **kwargs)[source]

Bases: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer

Constructs a GPT tokenizer based on byte-level Byte-Pair-Encoding.

This tokenizer inherits from PretrainedTokenizer which contains most of the main methods. For more information regarding those methods, please refer to this superclass.

Parameters
  • vocab_file (str) – Path to the vocab file. The vocab file contains a mapping from vocabulary strings to indices.

  • merges_file (str) – Path to the merge file. The merge file is used to split the input sentence into “subword” units. The vocab file is then used to encode those units as intices.

  • errors (str) – Paradigm to follow when decoding bytes to UTF-8. Defaults to 'replace'.

  • max_len (int, optional) – The maximum value of the input sequence length. Defaults to None.

Examples

from paddlenlp.transformers import GPTTokenizer

tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en')
print(tokenizer('Welcome to use PaddlePaddle and PaddleNLP'))

'''
{'input_ids': [14618, 284, 779, 350, 37382, 47, 37382, 290, 350, 37382, 45, 19930],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
'''
property vocab_size

Returns the size of vocabulary.

Returns

The sum of size of vocabulary and the size of speical tokens.

Return type

int

convert_ids_to_string(ids)[source]

Converts a single index or a sequence of indices to texts.

Parameters

ids (int|List[int]) – The token id (or token ids) to be converted to text.

Returns

The decoded text.

Return type

str

Example

from paddlenlp.transformers import GPTTokenizer
tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en')
print(tokenizer.convert_ids_to_string(tokenizer.convert_ids_to_string([14618, 284, 779, 350, 37382, 47, 37382, 290, 350, 37382, 45, 19930]))
# 'Welcome to use PaddlePaddle and PaddleNLP'
save_resources(save_directory)[source]

Saves SentencePiece file (ends with ‘.spm’) under save_directory.

Parameters

save_directory (str) – Directory to save files into.

class GPTChineseTokenizer(model_file, max_len=512, unk_token='<unk>', bos_token='<bod>', eos_token='<eod>', eol_token='▃', **kwargs)[source]

Bases: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer

Constructs a GPT Chinese tokenizer based on SentencePiece.

This tokenizer inherits from PretrainedTokenizer which contains most of the main methods. For more information regarding those methods, please refer to this superclass.

Parameters
  • vocab_file (str) – The vocabulary file required to instantiate a SentencePiece tokenizer.

  • max_len (int) – The maximum value of the input sequence length. Defaults to 512.

  • unk_token (str) – A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be unk_token inorder to be converted to an ID. Defaults to “[UNK]”.

Examples

from paddlenlp.transformers import GPTChineseTokenizer

tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn')
print(tokenizer('欢迎使用百度飞桨!'))
'''
{'input_ids': [2092, 260, 1014, 1596, 17620, 45], 'token_type_ids': [0, 0, 0, 0, 0, 0]}
'''
convert_ids_to_tokens(ids)[source]

Converts a single index or a sequence of indices to a token or a sequence of tokens.

Parameters

ids (int|List[int]|tuple(int)) – The token id (or token ids) to be converted to token(s).

Returns

The converted token or sequence of tokens.

Return type

str|List[str]

Example

from paddlenlp.transformers import GPTChineseTokenizer

tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn')
print(tokenizer.convert_ids_to_tokens([2092, 260, 1014, 1596, 17620, 45]))
#['▁欢迎', '▁使用', '▁百度', '▁飞', '桨', '▁!']
property vocab_size

Returns the size of vocabulary.

Returns

The size of vocabulary.

Return type

int

Example

from paddlenlp.transformers import GPTChineseTokenizer
tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn')
print(tokenizer.vocab_size)
# 50257
convert_ids_to_string(ids)[source]

Converts a single index or a sequence of indices to texts.

Parameters

ids (int|List[int]) – The token id (or token ids) to be converted to text.

Returns

The decoded text.

Return type

str

Example

from paddlenlp.transformers import GPTChineseTokenizer
tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn')
print(tokenizer.convert_ids_to_string([2092, 260, 1014, 1596, 17620, 45]))
# '欢迎使用百度飞桨!'
save_resources(save_directory)[source]

Save tokenizer related resources to files under save_directory.

Parameters

save_directory (str) – Directory to save files into.