tokenizer_utils

tokenizer_utils#

class PretrainedTokenizer(**kwargs)[source]#

Bases: ChatTemplateMixin, PretrainedTokenizerBase

Base class for all tokenizers.

Inherits from [PretrainedTokenizerBase].

Handle all the shared methods for tokenization and special tokens as well as methods downloading/caching/loading pretrained tokenizers as well as adding tokens to the vocabulary.

This class also contain the added tokens in a unified way on top of all tokenizers so we don’t have to handle the specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece…).

resource_files_names (Dict[str, str]) – A dictionary with, as keys, the __init__ keyword name of each
vocabulary file required by the model, and as associated values, the filename for saving the associated file (string).
pretrained_resource_files_map (Dict[str, Dict[str, str]]) – A dictionary of dictionaries, with the
high-level keys being the __init__ keyword name of each vocabulary file required by the model, the low-level being the short-cut-names of the pretrained models with, as associated values, the url to the associated pretrained vocabulary file.
max_model_input_sizes (Dict[str, Optional[int]]) – A dictionary with, as keys, the short-cut-names
of the pretrained models, and as associated values, the maximum length of the sequence inputs of this model, or None if the model has no maximum input size.
pretrained_init_configuration (Dict[str, Dict[str, Any]]) – A dictionary with, as keys, the
short-cut-names of the pretrained models, and as associated values, a dictionary of specific arguments to pass to the __init__ method of the tokenizer class for this pretrained model when loading the tokenizer with the [from_pretrained] method.
model_input_names (List[str]) – A list of inputs expected in the forward pass of the model.
padding_side (str) – The default value for the side on which the model should have padding applied.
Should be 'right' or 'left'.
truncation_side (str) – The default value for the side on which the model should have truncation
applied. Should be 'right' or 'left'.

Moreover, methods common to tokenizers for tokenization, token/id conversion and encoding as model inputs are also provided here.

Besides, metaclass InitTrackerMeta is used to create PretrainedTokenizer, by which subclasses can track arguments for initialization automatically and expose special tokens initialization used as attributes.

property vocab_size: int#

Size of the base vocabulary (without the added tokens).

Type:: int

get_added_vocab() → Dict[str, int][source]#

Returns the added tokens in the vocabulary as a dictionary of token to index.

Returns:: The added tokens.
Return type:: Dict[str, int]

prepare_for_tokenization(text, is_split_into_words=False, **kwargs)[source]#

Performs any necessary transformations before tokenization.

This method should pop the arguments from kwargs and return the remaining kwargs as well. We test the kwargs at the end of the encoding process to be sure all the arguments have been used.

Parameters:

text (str) – The text to prepare.
is_split_into_words (bool, optional, defaults to False) – Whether or not the input is already pre-tokenized (e.g., split into words). If set to True, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.
kwargs – Keyword arguments to use for the tokenization.

Returns:

The prepared text and the unused kwargs.

Return type:

Tuple[str, Dict[str, Any]]

tokenize(text: str, **kwargs) → List[str][source]#

Converts a string in a sequence of tokens, using the tokenizer.

Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces). Takes care of added tokens.

Parameters:

text (str) – The sequence to be encoded.
**kwargs (additional keyword arguments) – Passed along to the model-specific prepare_for_tokenization preprocessing method.

Returns:

The list of tokens.

Return type:

List[str]

convert_tokens_to_string(tokens)[source]#

Converts a sequence of tokens (list of string) to a single string by using ' '.join(tokens) .

Parameters:: tokens (list[str]) – A sequence of tokens.
Returns:: Converted string.
Return type:: str

static load_vocabulary(filepath, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs)[source]#

Instantiate an instance of Vocab from a file reserving all tokens by using Vocab.from_dict. The file contains a token per line, and the line number would be the index of corresponding token.

Parameters:

filepath (str) – path of file to construct vocabulary.
unk_token (str) – special token for unknown token. If no need, it also could be None. Defaults to None.
pad_token (str) – special token for padding token. If no need, it also could be None. Defaults to None.
bos_token (str) – special token for bos token. If no need, it also could be None. Defaults to None.
eos_token (str) – special token for eos token. If no need, it also could be None. Defaults to None.
**kwargs (dict) – keyword arguments for Vocab.from_dict.

Returns:

An instance of Vocab.

Return type:

Vocab

static save_vocabulary(filepath, vocab)[source]#

Save all tokens to a vocabulary file. The file contains a token per line, and the line number would be the index of corresponding token.

Parameters:

filepath (str) – File path to be saved to.
vocab (Vocab|dict) – The Vocab or dict instance to be saved.

get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]#

Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer encode methods.

Parameters:

token_ids_0 (List[int]) – List of ids of the first sequence.
token_ids_1 (List[int], optional) – List of ids of the second sequence.
already_has_special_tokens (bool, optional) – Whether or not the token list is already formatted with special tokens for the model. Defaults to None.

Returns:

The list of integers in the range [0, 1]:: 1 for a special token, 0 for a sequence token.

Return type:

results (List[int])

num_special_tokens_to_add(pair)[source]#

Returns the number of added tokens when encoding a sequence with special tokens.

Parameters:: pair (bool, optional) – Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence. Defaults to False.
Returns:: Number of special tokens added to sequences.
Return type:: int

get_offset_mapping(text: str, split_tokens: List[str] | None = None)[source]#

Returns the map of tokens and the start and end index of their start and end character. Modified from bojone/bert4keras :param text: Input text. :type text: str :param split_tokens: the tokens which has been split which can accelerate the operation. :type split_tokens: Optional[List[str]]

Returns:: The offset map of input text.
Return type:: list

decode_token(all_input_ids: List[int], prefix_offset: int = 0, read_offset: int = 0) → Tuple[str, int, int][source]#: tokenizer decoding for the streaming generation use case. This method can be overrided for tokenizer that doesn’t follow this API

class BPETokenizer(vocab_file, encoder_json_path='./configs/encoder.json', vocab_bpe_path='./configs/vocab.bpe', unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]')[source]#

Bases: PretrainedTokenizer

The base class for all bpe tokenizers. It mainly provides common tokenize methods for bpe type tokenizer.

Parameters:

vocab_file (str) – file path of the vocabulary.
encoder_json_path (str, optional) – file path of the id to vocab.
vocab_bpe_path (str, optional) – file path of word merge text.
unk_token (str, optional) – The special token for unknown words. Defaults to “[UNK]”.
sep_token (str, optional) – The special token for separator token. Defaults to “[SEP]”.
pad_token (str, optional) – The special token for padding. Defaults to “[PAD]”.
cls_token (str, optional) – The special token for cls. Defaults to “[CLS]”.
mask_token (str, optional) – The special token for mask. Defaults to “[MASK]”.

tokenize_chinese_chars(text)[source]#: Adds whitespace around any CJK character.

is_chinese_char(cp)[source]#: Checks whether CP is the codepoint of a CJK character.

normalize_chars(text)[source]#: Normalize the text for multiligual and chinese models. Unicode range: https://www.ling.upenn.edu/courses/Spring_2003/ling538/UnicodeRanges.html

tokenize_special_chars(text)[source]#: Adds whitespace around any special character.

convert_to_unicode(text)[source]#

Converts text to Unicode (if it’s not already), assuming utf-8 input. :param text: Text to be converted to unicode. :type text: str|bytes

Returns:: converted text.
Return type:: str

tokenizer_utils

Contents

tokenizer_utils#