tokenizer

class CTRLTokenizer(vocab_file, merges_file, max_len=None, unk_token='<unk>', **kwargs)[source]

Bases: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer

Constructs a CTRL tokenizer based on byte-level Byte-Pair-Encoding.

This tokenizer inherits from PretrainedTokenizer which contains most of the main methods. For more information regarding those methods, please refer to this superclass.

Parameters
  • vocab_file (str) – Path to the vocab file. The vocab file contains a mapping from vocabulary strings to indices.

  • merges_file (str) – Path to the merge file. The merge file is used to split the input sentence into “subword” units. The vocab file is then used to encode those units as intices.

  • max_len (int, optional) – The maximum value of the input sequence length. Defaults to None.

  • unk_token (str) – A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be unk_token inorder to be converted to an ID. Defaults to “<unk>”.

property vocab_size

Size of the base vocabulary (without the added tokens).

Type

int

get_vocab()[source]

Returns the vocabulary as a dictionary of token to index.

tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab.

Returns

The vocabulary.

Return type

Dict[str, int]

tokenize(text)[source]

Converts a string to a list of tokens.

Parameters

text (str) – The text to be tokenized.

Returns

A list of string representing converted tokens.

Return type

List[str]

Example

from paddlenlp.transformers import CTRLTokenizer

tokenizer = CTRLTokenizer.from_pretrained('ctrl')
print(tokenizer.tokenize('Welcome to use PaddlePaddle and PaddleNLP'))
# ['Welcome', 'to', 'use', 'Padd@@', 'le@@', 'Padd@@', 'le', 'and', 'Padd@@', 'le@@', 'N@@', 'LP']
convert_tokens_to_string(tokens)[source]

Converts a sequence of tokens (list of string) to a single string.

Parameters

tokens (List[str]) – A sequence of tokens.

Returns

Converted string.

Return type

str

Example

from paddlenlp.transformers import CTRLTokenizer

tokenizer = CTRLTokenizer.from_pretrained('crtl')
print(tokenizer.convert_tokens_to_string(['Welcome', 'to', 'use', 'Padd@@', 'le@@', 'Padd@@', 'le', 'and', 'Padd@@', 'le@@', 'N@@', 'LP']))
# 'Welcome to use PaddlePaddle and PaddleNLP'
convert_tokens_to_ids(tokens)[source]

Converts a single token or a sequence of tokens to an index or a sequence of indices using the vocab.

Parameters

tokens (str|List[str]|tuple(str)) – A single token or a sequence of tokens.

Returns

The converted token id or token ids.

Return type

int|List[int]

Example

from paddlenlp.transformers import CTRLTokenizer

tokenizer = CTRLTokenizer.from_pretrained('crtl')
print(tokenizer.convert_tokens_to_ids(['Welcome', 'to', 'use', 'Padd@@', 'le@@', 'Padd@@', 'le', 'and', 'Padd@@', 'le@@', 'N@@', 'LP']))
# [41116, 3, 191, 40324, 1162, 40324, 992, 2, 40324, 1162, 633, 11135]
convert_ids_to_tokens(ids, skip_special_tokens=False)[source]

Converts an index or a sequence indices to a single token or a sequence of tokens.

Parameters
  • ids (int|List[int]) – The token id (or token ids) to be converted to text.

  • skip_special_tokens (bool, optional) – Whether or not to skip the special tokens. Defaults to False, which means we don’t skip the special tokens.

Returns

The converted token or the sequence of tokens.

Return type

str|List[str]

Example

from paddlenlp.transformers import CTRLTokenizer

tokenizer = CTRLTokenizer.from_pretrained('ctrl')
print(tokenizer.convert_ids_to_tokens([41116, 3, 191, 40324, 1162, 40324, 992, 2, 40324, 1162, 633, 11135]))
# ['Welcome', 'to', 'use', 'Padd@@', 'le@@', 'Padd@@', 'le', 'and', 'Padd@@', 'le@@', 'N@@', 'LP']
save_resources(save_directory)[source]

Save tokenizer related resources to files under save_directory.

Parameters

save_directory (str) – Directory to save files into.