tokenizer#
- class BlenderbotSmallTokenizer(vocab_file, merges_file, errors='replace', max_len=None, special_tokens=None, bos_token='__start__', eos_token='__end__', unk_token='__unk__', pad_token='__null__', eol_token='__newln__', **kwargs)[源代码]#
基类:
GPTTokenizer
Constructs a BlenderbotSmall tokenizer based on Byte-Pair-Encoding.
This tokenizer inherits from
GPTTokenizer
, which contains most of the main methods. Please should refer to the superclass for more information regarding methods. :param vocab_file: file path of the vocabulary :type vocab_file: str :param merges_file: file path of the merges file. :type merges_file: str :param errors: The method to handle errors in decoding :type errors: str :param max_len: The specified maximum sequence length. Default: "None". :type max_len: int :param special_tokens: The additional special tokens. Default: "None". :type special_tokens: dict :param bos_token: The special token for beginning of sequence token. Default: "__start__". :type bos_token: str :param eos_token: The special token for end of sequence token. Default: "__end__". :type eos_token: str :param unk_token: The special token for unknown tokens. Default: "__unk__" :type unk_token: str :param pad_token: The special token for padding. Default: "__null__". :type pad_token: str :param eol_token: The special token for newline. Default: "__newln__". :type eol_token: str示例
- bpe(token)[源代码]#
Apply Byte-Pair-Encoding on token. The process of bpe in BlenderbotSmall is different from Blenderbot. :param token: The token to be converted. :type token: str
- 返回:
Converted token.
- 返回类型:
str
- convert_tokens_to_string(tokens)[源代码]#
Converts a sequence of tokens (list of string) to a single string. :param tokens: A sequence of tokens. :type tokens: list[str]
- 返回:
Converted string.
- 返回类型:
str
- convert_ids_to_string(ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)[源代码]#
Converts a sequence of ids (list of integers) to a single string. :param ids: A sequence of ids corresponding to tokens. :type ids: list[int] :param skip_special_tokens: Whether to skip and not decode special tokens when converting. Defaults to
False
. :type skip_special_tokens: bool, optional :param clean_up_tokenization_spaces: Whether to Clean up a list of simple English tokenization artifactslike spaces before punctuations and abbreviated forms.
- 返回:
Converted string.
- 返回类型:
str