tokenizer#

class BlenderbotSmallTokenizer(vocab_file, merges_file, errors='replace', max_len=None, special_tokens=None, bos_token='__start__', eos_token='__end__', unk_token='__unk__', pad_token='__null__', eol_token='__newln__', **kwargs)[source]#

Bases: GPTTokenizer

Constructs a BlenderbotSmall tokenizer based on Byte-Pair-Encoding.

This tokenizer inherits from GPTTokenizer, which contains most of the main methods. Please should refer to the superclass for more information regarding methods. :param vocab_file: file path of the vocabulary :type vocab_file: str :param merges_file: file path of the merges file. :type merges_file: str :param errors: The method to handle errors in decoding :type errors: str :param max_len: The specified maximum sequence length. Default: “None”. :type max_len: int :param special_tokens: The additional special tokens. Default: “None”. :type special_tokens: dict :param bos_token: The special token for beginning of sequence token. Default: “__start__”. :type bos_token: str :param eos_token: The special token for end of sequence token. Default: “__end__”. :type eos_token: str :param unk_token: The special token for unknown tokens. Default: “__unk__” :type unk_token: str :param pad_token: The special token for padding. Default: “__null__”. :type pad_token: str :param eol_token: The special token for newline. Default: “__newln__”. :type eol_token: str

Examples

bpe(token)[source]#

Apply Byte-Pair-Encoding on token. The process of bpe in BlenderbotSmall is different from Blenderbot. :param token: The token to be converted. :type token: str

Returns:

Converted token.

Return type:

str

convert_tokens_to_string(tokens)[source]#

Converts a sequence of tokens (list of string) to a single string. :param tokens: A sequence of tokens. :type tokens: list[str]

Returns:

Converted string.

Return type:

str

convert_ids_to_string(ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)[source]#

Converts a sequence of ids (list of integers) to a single string. :param ids: A sequence of ids corresponding to tokens. :type ids: list[int] :param skip_special_tokens: Whether to skip and not decode special tokens when converting. Defaults to False. :type skip_special_tokens: bool, optional :param clean_up_tokenization_spaces: Whether to Clean up a list of simple English tokenization artifacts

like spaces before punctuations and abbreviated forms.

Returns:

Converted string.

Return type:

str