tokenizer

class CTRLTokenizer(vocab_file, merges_file, max_len=None, unk_token='<unk>', **kwargs)[源代码]

基类:paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer

Constructs a CTRL tokenizer based on byte-level Byte-Pair-Encoding.

This tokenizer inherits from PretrainedTokenizer which contains most of the main methods. For more information regarding those methods, please refer to this superclass.

参数
  • vocab_file (str) -- Path to the vocab file. The vocab file contains a mapping from vocabulary strings to indices.

  • merges_file (str) -- Path to the merge file. The merge file is used to split the input sentence into "subword" units. The vocab file is then used to encode those units as intices.

  • max_len (int, optional) -- The maximum value of the input sequence length. Defaults to None.

  • unk_token (str) -- A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be unk_token inorder to be converted to an ID. Defaults to "<unk>".

property vocab_size

Size of the base vocabulary (without the added tokens).

Type

int

get_vocab()[源代码]

Returns the vocabulary as a dictionary of token to index.

tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab.

返回

The vocabulary.

返回类型

Dict[str, int]

tokenize(text)[源代码]

Converts a string to a list of tokens.

参数

text (str) -- The text to be tokenized.

返回

A list of string representing converted tokens.

返回类型

List[str]

示例

from paddlenlp.transformers import CTRLTokenizer

tokenizer = CTRLTokenizer.from_pretrained('ctrl')
print(tokenizer.tokenize('Welcome to use PaddlePaddle and PaddleNLP'))
# ['Welcome', 'to', 'use', 'Padd@@', 'le@@', 'Padd@@', 'le', 'and', 'Padd@@', 'le@@', 'N@@', 'LP']
convert_tokens_to_string(tokens)[源代码]

Converts a sequence of tokens (list of string) to a single string.

参数

tokens (List[str]) -- A sequence of tokens.

返回

Converted string.

返回类型

str

示例

from paddlenlp.transformers import CTRLTokenizer

tokenizer = CTRLTokenizer.from_pretrained('crtl')
print(tokenizer.convert_tokens_to_string(['Welcome', 'to', 'use', 'Padd@@', 'le@@', 'Padd@@', 'le', 'and', 'Padd@@', 'le@@', 'N@@', 'LP']))
# 'Welcome to use PaddlePaddle and PaddleNLP'
convert_tokens_to_ids(tokens)[源代码]

Converts a single token or a sequence of tokens to an index or a sequence of indices using the vocab.

参数

tokens (str|List[str]|tuple(str)) -- A single token or a sequence of tokens.

返回

The converted token id or token ids.

返回类型

int|List[int]

示例

from paddlenlp.transformers import CTRLTokenizer

tokenizer = CTRLTokenizer.from_pretrained('crtl')
print(tokenizer.convert_tokens_to_ids(['Welcome', 'to', 'use', 'Padd@@', 'le@@', 'Padd@@', 'le', 'and', 'Padd@@', 'le@@', 'N@@', 'LP']))
# [41116, 3, 191, 40324, 1162, 40324, 992, 2, 40324, 1162, 633, 11135]
convert_ids_to_tokens(ids, skip_special_tokens=False)[源代码]

Converts an index or a sequence indices to a single token or a sequence of tokens.

参数
  • ids (int|List[int]) -- The token id (or token ids) to be converted to text.

  • skip_special_tokens (bool, optional) -- Whether or not to skip the special tokens. Defaults to False, which means we don't skip the special tokens.

返回

The converted token or the sequence of tokens.

返回类型

str|List[str]

示例

from paddlenlp.transformers import CTRLTokenizer

tokenizer = CTRLTokenizer.from_pretrained('ctrl')
print(tokenizer.convert_ids_to_tokens([41116, 3, 191, 40324, 1162, 40324, 992, 2, 40324, 1162, 633, 11135]))
# ['Welcome', 'to', 'use', 'Padd@@', 'le@@', 'Padd@@', 'le', 'and', 'Padd@@', 'le@@', 'N@@', 'LP']
save_resources(save_directory)[源代码]

Save tokenizer related resources to files under save_directory.

参数

save_directory (str) -- Directory to save files into.