tokenizer_utils#

class PretrainedTokenizer(**kwargs)[源代码]#

基类:ChatTemplateMixin, PretrainedTokenizerBase

Base class for all tokenizers.

Inherits from [PretrainedTokenizerBase].

Handle all the shared methods for tokenization and special tokens as well as methods downloading/caching/loading pretrained tokenizers as well as adding tokens to the vocabulary.

This class also contain the added tokens in a unified way on top of all tokenizers so we don't have to handle the specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece...).

  • resource_files_names (Dict[str, str]) -- A dictionary with, as keys, the __init__ keyword name of each

    vocabulary file required by the model, and as associated values, the filename for saving the associated file (string).

  • pretrained_resource_files_map (Dict[str, Dict[str, str]]) -- A dictionary of dictionaries, with the

    high-level keys being the __init__ keyword name of each vocabulary file required by the model, the low-level being the short-cut-names of the pretrained models with, as associated values, the url to the associated pretrained vocabulary file.

  • max_model_input_sizes (Dict[str, Optional[int]]) -- A dictionary with, as keys, the short-cut-names

    of the pretrained models, and as associated values, the maximum length of the sequence inputs of this model, or None if the model has no maximum input size.

  • pretrained_init_configuration (Dict[str, Dict[str, Any]]) -- A dictionary with, as keys, the

    short-cut-names of the pretrained models, and as associated values, a dictionary of specific arguments to pass to the __init__ method of the tokenizer class for this pretrained model when loading the tokenizer with the [from_pretrained] method.

  • model_input_names (List[str]) -- A list of inputs expected in the forward pass of the model.

  • padding_side (str) -- The default value for the side on which the model should have padding applied.

    Should be 'right' or 'left'.

  • truncation_side (str) -- The default value for the side on which the model should have truncation

    applied. Should be 'right' or 'left'.

Moreover, methods common to tokenizers for tokenization, token/id conversion and encoding as model inputs are also provided here.

Besides, metaclass InitTrackerMeta is used to create PretrainedTokenizer, by which subclasses can track arguments for initialization automatically and expose special tokens initialization used as attributes.

property vocab_size: int#

Size of the base vocabulary (without the added tokens).

Type:

int

get_added_vocab() Dict[str, int][源代码]#

Returns the added tokens in the vocabulary as a dictionary of token to index.

返回:

The added tokens.

返回类型:

Dict[str, int]

prepare_for_tokenization(text, is_split_into_words=False, **kwargs)[源代码]#

Performs any necessary transformations before tokenization.

This method should pop the arguments from kwargs and return the remaining kwargs as well. We test the kwargs at the end of the encoding process to be sure all the arguments have been used.

参数:
  • text (str) -- The text to prepare.

  • is_split_into_words (bool, optional, defaults to False) -- Whether or not the input is already pre-tokenized (e.g., split into words). If set to True, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.

  • kwargs -- Keyword arguments to use for the tokenization.

返回:

The prepared text and the unused kwargs.

返回类型:

Tuple[str, Dict[str, Any]]

tokenize(text: str, **kwargs) List[str][源代码]#

Converts a string in a sequence of tokens, using the tokenizer.

Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces). Takes care of added tokens.

参数:
  • text (str) -- The sequence to be encoded.

  • **kwargs (additional keyword arguments) -- Passed along to the model-specific prepare_for_tokenization preprocessing method.

返回:

The list of tokens.

返回类型:

List[str]

convert_tokens_to_string(tokens)[源代码]#

Converts a sequence of tokens (list of string) to a single string by using ' '.join(tokens) .

参数:

tokens (list[str]) -- A sequence of tokens.

返回:

Converted string.

返回类型:

str

static load_vocabulary(filepath, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs)[源代码]#

Instantiate an instance of Vocab from a file reserving all tokens by using Vocab.from_dict. The file contains a token per line, and the line number would be the index of corresponding token.

参数:
  • filepath (str) -- path of file to construct vocabulary.

  • unk_token (str) -- special token for unknown token. If no need, it also could be None. Defaults to None.

  • pad_token (str) -- special token for padding token. If no need, it also could be None. Defaults to None.

  • bos_token (str) -- special token for bos token. If no need, it also could be None. Defaults to None.

  • eos_token (str) -- special token for eos token. If no need, it also could be None. Defaults to None.

  • **kwargs (dict) -- keyword arguments for Vocab.from_dict.

返回:

An instance of Vocab.

返回类型:

Vocab

static save_vocabulary(filepath, vocab)[源代码]#

Save all tokens to a vocabulary file. The file contains a token per line, and the line number would be the index of corresponding token.

参数:
  • filepath (str) -- File path to be saved to.

  • vocab (Vocab|dict) -- The Vocab or dict instance to be saved.

get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[源代码]#

Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer encode methods.

参数:
  • token_ids_0 (List[int]) -- List of ids of the first sequence.

  • token_ids_1 (List[int], optional) -- List of ids of the second sequence.

  • already_has_special_tokens (bool, optional) -- Whether or not the token list is already formatted with special tokens for the model. Defaults to None.

返回:

The list of integers in the range [0, 1]:

1 for a special token, 0 for a sequence token.

返回类型:

results (List[int])

num_special_tokens_to_add(pair)[源代码]#

Returns the number of added tokens when encoding a sequence with special tokens.

参数:

pair (bool, optional) -- Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence. Defaults to False.

返回:

Number of special tokens added to sequences.

返回类型:

int

get_offset_mapping(text: str, split_tokens: List[str] | None = None)[源代码]#

Returns the map of tokens and the start and end index of their start and end character. Modified from bojone/bert4keras :param text: Input text. :type text: str :param split_tokens: the tokens which has been split which can accelerate the operation. :type split_tokens: Optional[List[str]]

返回:

The offset map of input text.

返回类型:

list

decode_token(all_input_ids: List[int], prefix_offset: int = 0, read_offset: int = 0) Tuple[str, int, int][源代码]#

tokenizer decoding for the streaming generation use case. This method can be overrided for tokenizer that doesn't follow this API

class BPETokenizer(vocab_file, encoder_json_path='./configs/encoder.json', vocab_bpe_path='./configs/vocab.bpe', unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]')[源代码]#

基类:PretrainedTokenizer

The base class for all bpe tokenizers. It mainly provides common tokenize methods for bpe type tokenizer.

参数:
  • vocab_file (str) -- file path of the vocabulary.

  • encoder_json_path (str, optional) -- file path of the id to vocab.

  • vocab_bpe_path (str, optional) -- file path of word merge text.

  • unk_token (str, optional) -- The special token for unknown words. Defaults to "[UNK]".

  • sep_token (str, optional) -- The special token for separator token. Defaults to "[SEP]".

  • pad_token (str, optional) -- The special token for padding. Defaults to "[PAD]".

  • cls_token (str, optional) -- The special token for cls. Defaults to "[CLS]".

  • mask_token (str, optional) -- The special token for mask. Defaults to "[MASK]".

tokenize_chinese_chars(text)[源代码]#

Adds whitespace around any CJK character.

is_chinese_char(cp)[源代码]#

Checks whether CP is the codepoint of a CJK character.

normalize_chars(text)[源代码]#

Normalize the text for multiligual and chinese models. Unicode range: https://www.ling.upenn.edu/courses/Spring_2003/ling538/UnicodeRanges.html

tokenize_special_chars(text)[源代码]#

Adds whitespace around any special character.

convert_to_unicode(text)[源代码]#

Converts text to Unicode (if it's not already), assuming utf-8 input. :param text: Text to be converted to unicode. :type text: str|bytes

返回:

converted text.

返回类型:

str