tokenizer#
- class CTRLTokenizer(vocab_file, merges_file, max_len=None, unk_token='<unk>', **kwargs)[源代码]#
-
Constructs a CTRL tokenizer based on byte-level Byte-Pair-Encoding.
This tokenizer inherits from
PretrainedTokenizer
which contains most of the main methods. For more information regarding those methods, please refer to this superclass.- 参数:
vocab_file (str) -- Path to the vocab file. The vocab file contains a mapping from vocabulary strings to indices.
merges_file (str) -- Path to the merge file. The merge file is used to split the input sentence into "subword" units. The vocab file is then used to encode those units as intices.
max_len (int, optional) -- The maximum value of the input sequence length. Defaults to
None
.unk_token (str) -- A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be
unk_token
inorder to be converted to an ID. Defaults to "<unk>".
- property vocab_size#
Size of the base vocabulary (without the added tokens).
- Type:
int
- get_vocab()[源代码]#
Returns the vocabulary as a dictionary of token to index.
tokenizer.get_vocab()[token]
is equivalent totokenizer.convert_tokens_to_ids(token)
whentoken
is in the vocab.- 返回:
The vocabulary.
- 返回类型:
Dict[str, int]
- tokenize(text)[源代码]#
Converts a string to a list of tokens.
- 参数:
text (str) -- The text to be tokenized.
- 返回:
A list of string representing converted tokens.
- 返回类型:
List[str]
示例
from paddlenlp.transformers import CTRLTokenizer tokenizer = CTRLTokenizer.from_pretrained('ctrl') print(tokenizer.tokenize('Welcome to use PaddlePaddle and PaddleNLP')) # ['Welcome', 'to', 'use', 'Padd@@', 'le@@', 'Padd@@', 'le', 'and', 'Padd@@', 'le@@', 'N@@', 'LP']
- convert_tokens_to_string(tokens)[源代码]#
Converts a sequence of tokens (list of string) to a single string.
- 参数:
tokens (List[str]) -- A sequence of tokens.
- 返回:
Converted string.
- 返回类型:
str
示例
from paddlenlp.transformers import CTRLTokenizer tokenizer = CTRLTokenizer.from_pretrained('crtl') print(tokenizer.convert_tokens_to_string(['Welcome', 'to', 'use', 'Padd@@', 'le@@', 'Padd@@', 'le', 'and', 'Padd@@', 'le@@', 'N@@', 'LP'])) # 'Welcome to use PaddlePaddle and PaddleNLP'
- convert_tokens_to_ids(tokens)[源代码]#
Converts a single token or a sequence of tokens to an index or a sequence of indices using the vocab.
- 参数:
tokens (str|List[str]|tuple(str)) -- A single token or a sequence of tokens.
- 返回:
The converted token id or token ids.
- 返回类型:
int|List[int]
示例
from paddlenlp.transformers import CTRLTokenizer tokenizer = CTRLTokenizer.from_pretrained('crtl') print(tokenizer.convert_tokens_to_ids(['Welcome', 'to', 'use', 'Padd@@', 'le@@', 'Padd@@', 'le', 'and', 'Padd@@', 'le@@', 'N@@', 'LP'])) # [41116, 3, 191, 40324, 1162, 40324, 992, 2, 40324, 1162, 633, 11135]
- convert_ids_to_tokens(ids, skip_special_tokens=False)[源代码]#
Converts an index or a sequence indices to a single token or a sequence of tokens.
- 参数:
ids (int|List[int]) -- The token id (or token ids) to be converted to text.
skip_special_tokens (bool, optional) -- Whether or not to skip the special tokens. Defaults to
False
, which means we don't skip the special tokens.
- 返回:
The converted token or the sequence of tokens.
- 返回类型:
str|List[str]
示例
from paddlenlp.transformers import CTRLTokenizer tokenizer = CTRLTokenizer.from_pretrained('ctrl') print(tokenizer.convert_ids_to_tokens([41116, 3, 191, 40324, 1162, 40324, 992, 2, 40324, 1162, 633, 11135])) # ['Welcome', 'to', 'use', 'Padd@@', 'le@@', 'Padd@@', 'le', 'and', 'Padd@@', 'le@@', 'N@@', 'LP']