tokenizer#

class BlenderbotTokenizer(vocab_file, merges_file, errors='replace', max_len=None, special_tokens=None, bos_token='<s>', eos_token='</s>', cls_token='<s>', sep_token='</s>', pad_token='<pad>', unk_token='<unk>', mask_token='<mask>', eol_token='Ċ', add_prefix=True, **kwargs)[源代码]#

基类:GPTTokenizer

Construct a Blenderbot tokenizer, derived from the GPT tokenizer, using byte-level Byte-Pair-Encoding.

This tokenizer inherits from GPTTokenizer, which contains most of the main methods. Please should refer to the superclass for more information regarding methods. :param vocab_file: file path of the vocabulary :type vocab_file: str :param merges_file: file path of the merges file. :type merges_file: str :param errors: The method to handle errors in decoding :type errors: str :param max_len: The specified maximum sequence length. Default: "None". :type max_len: int :param special_tokens: The additional special tokens. Default: "None". :type special_tokens: dict :param bos_token: The special token for beginning of sequence token. Default: "<s>". :type bos_token: str :param eos_token: The special token for end of sequence token. Default: "</s>". :type eos_token: str :param cls_token: The special token for cls. Default: "<s>". :type cls_token: str :param sep_token: The special token for separator token . Default: "</s>". :type sep_token: str :param pad_token: The special token for padding. Default: "<pad>". :type pad_token: str :param eol_token: The special token for newline. Default: "u010a". :type eol_token: str :param add_prefix: Whether or not to add an initial space to the input.

This allows to treat the leading word just as any other word. (Blenderbot adds an initial space when tokenizes input text, which

is differnt from BlenderbotSmall)

示例

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]#

A Blenderbot sequence has the following format:

- single sequence: ``X </s>``
参数:
  • token_ids_0 (List[int]) -- List of IDs to which the special tokens will be added.

  • token_ids_1 (List[int], optional) -- token_ids_1 Will be ignored

返回:

List of input_id with the appropriate special tokens.

返回类型:

List[int]

prepare_for_tokenization(text, is_split_into_words=False, **kwargs)[源代码]#

Performs any necessary transformations before tokenization.

This method should pop the arguments from kwargs and return the remaining kwargs as well. We test the kwargs at the end of the encoding process to be sure all the arguments have been used.

参数:
  • text (str) -- The text to prepare.

  • is_split_into_words (bool, optional, defaults to False) -- Whether or not the input is already pre-tokenized (e.g., split into words). If set to True, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.

  • kwargs -- Keyword arguments to use for the tokenization.

返回:

The prepared text and the unused kwargs.

返回类型:

Tuple[str, Dict[str, Any]]