tokenizer#

class CTRLTokenizer(vocab_file, merges_file, max_len=None, unk_token='<unk>', **kwargs)[source]#

Bases: PretrainedTokenizer

Constructs a CTRL tokenizer based on byte-level Byte-Pair-Encoding.

This tokenizer inherits from PretrainedTokenizer which contains most of the main methods. For more information regarding those methods, please refer to this superclass.

Parameters:
  • vocab_file (str) – Path to the vocab file. The vocab file contains a mapping from vocabulary strings to indices.

  • merges_file (str) – Path to the merge file. The merge file is used to split the input sentence into “subword” units. The vocab file is then used to encode those units as intices.

  • max_len (int, optional) – The maximum value of the input sequence length. Defaults to None.

  • unk_token (str) – A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be unk_token inorder to be converted to an ID. Defaults to “<unk>”.

property vocab_size#

Size of the base vocabulary (without the added tokens).

Type:

int

get_vocab()[source]#

Returns the vocabulary as a dictionary of token to index.

tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab.

Returns:

The vocabulary.

Return type:

Dict[str, int]

tokenize(text)[source]#

Converts a string to a list of tokens.

Parameters:

text (str) – The text to be tokenized.

Returns:

A list of string representing converted tokens.

Return type:

List[str]

Example

from paddlenlp.transformers import CTRLTokenizer

tokenizer = CTRLTokenizer.from_pretrained('ctrl')
print(tokenizer.tokenize('Welcome to use PaddlePaddle and PaddleNLP'))
# ['Welcome', 'to', 'use', 'Padd@@', 'le@@', 'Padd@@', 'le', 'and', 'Padd@@', 'le@@', 'N@@', 'LP']
convert_tokens_to_string(tokens)[source]#

Converts a sequence of tokens (list of string) to a single string.

Parameters:

tokens (List[str]) – A sequence of tokens.

Returns:

Converted string.

Return type:

str

Example

from paddlenlp.transformers import CTRLTokenizer

tokenizer = CTRLTokenizer.from_pretrained('crtl')
print(tokenizer.convert_tokens_to_string(['Welcome', 'to', 'use', 'Padd@@', 'le@@', 'Padd@@', 'le', 'and', 'Padd@@', 'le@@', 'N@@', 'LP']))
# 'Welcome to use PaddlePaddle and PaddleNLP'
convert_tokens_to_ids(tokens)[source]#

Converts a single token or a sequence of tokens to an index or a sequence of indices using the vocab.

Parameters:

tokens (str|List[str]|tuple(str)) – A single token or a sequence of tokens.

Returns:

The converted token id or token ids.

Return type:

int|List[int]

Example

from paddlenlp.transformers import CTRLTokenizer

tokenizer = CTRLTokenizer.from_pretrained('crtl')
print(tokenizer.convert_tokens_to_ids(['Welcome', 'to', 'use', 'Padd@@', 'le@@', 'Padd@@', 'le', 'and', 'Padd@@', 'le@@', 'N@@', 'LP']))
# [41116, 3, 191, 40324, 1162, 40324, 992, 2, 40324, 1162, 633, 11135]
convert_ids_to_tokens(ids, skip_special_tokens=False)[source]#

Converts an index or a sequence indices to a single token or a sequence of tokens.

Parameters:
  • ids (int|List[int]) – The token id (or token ids) to be converted to text.

  • skip_special_tokens (bool, optional) – Whether or not to skip the special tokens. Defaults to False, which means we don’t skip the special tokens.

Returns:

The converted token or the sequence of tokens.

Return type:

str|List[str]

Example

from paddlenlp.transformers import CTRLTokenizer

tokenizer = CTRLTokenizer.from_pretrained('ctrl')
print(tokenizer.convert_ids_to_tokens([41116, 3, 191, 40324, 1162, 40324, 992, 2, 40324, 1162, 633, 11135]))
# ['Welcome', 'to', 'use', 'Padd@@', 'le@@', 'Padd@@', 'le', 'and', 'Padd@@', 'le@@', 'N@@', 'LP']
save_resources(save_directory)[source]#

Save tokenizer related resources to files under save_directory.

Parameters:

save_directory (str) – Directory to save files into.