tokenizer_utils#
- class PretrainedTokenizer(**kwargs)[source]#
Bases:
ChatTemplateMixin
,PretrainedTokenizerBase
Base class for all tokenizers.
Inherits from [
PretrainedTokenizerBase
].Handle all the shared methods for tokenization and special tokens as well as methods downloading/caching/loading pretrained tokenizers as well as adding tokens to the vocabulary.
This class also contain the added tokens in a unified way on top of all tokenizers so we don’t have to handle the specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece…).
- resource_files_names (
Dict[str, str]
) – A dictionary with, as keys, the__init__
keyword name of each vocabulary file required by the model, and as associated values, the filename for saving the associated file (string).
- resource_files_names (
- pretrained_resource_files_map (
Dict[str, Dict[str, str]]
) – A dictionary of dictionaries, with the high-level keys being the
__init__
keyword name of each vocabulary file required by the model, the low-level being theshort-cut-names
of the pretrained models with, as associated values, theurl
to the associated pretrained vocabulary file.
- pretrained_resource_files_map (
- max_model_input_sizes (
Dict[str, Optional[int]]
) – A dictionary with, as keys, theshort-cut-names
of the pretrained models, and as associated values, the maximum length of the sequence inputs of this model, or
None
if the model has no maximum input size.
- max_model_input_sizes (
- pretrained_init_configuration (
Dict[str, Dict[str, Any]]
) – A dictionary with, as keys, the short-cut-names
of the pretrained models, and as associated values, a dictionary of specific arguments to pass to the__init__
method of the tokenizer class for this pretrained model when loading the tokenizer with the [from_pretrained
] method.
- pretrained_init_configuration (
model_input_names (
List[str]
) – A list of inputs expected in the forward pass of the model.- padding_side (
str
) – The default value for the side on which the model should have padding applied. Should be
'right'
or'left'
.
- padding_side (
- truncation_side (
str
) – The default value for the side on which the model should have truncation applied. Should be
'right'
or'left'
.
- truncation_side (
Moreover, methods common to tokenizers for tokenization, token/id conversion and encoding as model inputs are also provided here.
Besides, metaclass
InitTrackerMeta
is used to createPretrainedTokenizer
, by which subclasses can track arguments for initialization automatically and expose special tokens initialization used as attributes.- property vocab_size: int#
Size of the base vocabulary (without the added tokens).
- Type:
int
- get_added_vocab() Dict[str, int] [source]#
Returns the added tokens in the vocabulary as a dictionary of token to index.
- Returns:
The added tokens.
- Return type:
Dict[str, int]
- prepare_for_tokenization(text, is_split_into_words=False, **kwargs)[source]#
Performs any necessary transformations before tokenization.
This method should pop the arguments from kwargs and return the remaining
kwargs
as well. We test thekwargs
at the end of the encoding process to be sure all the arguments have been used.- Parameters:
text (
str
) – The text to prepare.is_split_into_words (
bool
, optional, defaults toFalse
) – Whether or not the input is already pre-tokenized (e.g., split into words). If set toTrue
, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.kwargs – Keyword arguments to use for the tokenization.
- Returns:
The prepared text and the unused kwargs.
- Return type:
Tuple[str, Dict[str, Any]]
- tokenize(text: str, **kwargs) List[str] [source]#
Converts a string in a sequence of tokens, using the tokenizer.
Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces). Takes care of added tokens.
- Parameters:
text (
str
) – The sequence to be encoded.**kwargs (additional keyword arguments) – Passed along to the model-specific
prepare_for_tokenization
preprocessing method.
- Returns:
The list of tokens.
- Return type:
List[str]
- convert_tokens_to_string(tokens)[source]#
Converts a sequence of tokens (list of string) to a single string by using
' '.join(tokens)
.- Parameters:
tokens (list[str]) – A sequence of tokens.
- Returns:
Converted string.
- Return type:
str
- static load_vocabulary(filepath, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs)[source]#
Instantiate an instance of
Vocab
from a file reserving all tokens by usingVocab.from_dict
. The file contains a token per line, and the line number would be the index of corresponding token.- Parameters:
filepath (str) – path of file to construct vocabulary.
unk_token (str) – special token for unknown token. If no need, it also could be
None
. Defaults toNone
.pad_token (str) – special token for padding token. If no need, it also could be
None
. Defaults toNone
.bos_token (str) – special token for bos token. If no need, it also could be
None
. Defaults toNone
.eos_token (str) – special token for eos token. If no need, it also could be
None
. Defaults toNone
.**kwargs (dict) – keyword arguments for
Vocab.from_dict
.
- Returns:
An instance of
Vocab
.- Return type:
- static save_vocabulary(filepath, vocab)[source]#
Save all tokens to a vocabulary file. The file contains a token per line, and the line number would be the index of corresponding token.
- Parameters:
filepath (str) – File path to be saved to.
vocab (Vocab|dict) – The
Vocab
ordict
instance to be saved.
- get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]#
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer
encode
methods.- Parameters:
token_ids_0 (List[int]) – List of ids of the first sequence.
token_ids_1 (List[int], optional) – List of ids of the second sequence.
already_has_special_tokens (bool, optional) – Whether or not the token list is already formatted with special tokens for the model. Defaults to None.
- Returns:
- The list of integers in the range [0, 1]:
1 for a special token, 0 for a sequence token.
- Return type:
results (List[int])
- num_special_tokens_to_add(pair)[source]#
Returns the number of added tokens when encoding a sequence with special tokens.
- Parameters:
pair (bool, optional) – Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence. Defaults to
False
.- Returns:
Number of special tokens added to sequences.
- Return type:
int
- get_offset_mapping(text: str, split_tokens: List[str] | None = None)[source]#
Returns the map of tokens and the start and end index of their start and end character. Modified from bojone/bert4keras :param text: Input text. :type text: str :param split_tokens: the tokens which has been split which can accelerate the operation. :type split_tokens: Optional[List[str]]
- Returns:
The offset map of input text.
- Return type:
list
- class BPETokenizer(vocab_file, encoder_json_path='./configs/encoder.json', vocab_bpe_path='./configs/vocab.bpe', unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]')[source]#
Bases:
PretrainedTokenizer
The base class for all bpe tokenizers. It mainly provides common tokenize methods for bpe type tokenizer.
- Parameters:
vocab_file (str) – file path of the vocabulary.
encoder_json_path (str, optional) – file path of the id to vocab.
vocab_bpe_path (str, optional) – file path of word merge text.
unk_token (str, optional) – The special token for unknown words. Defaults to “[UNK]”.
sep_token (str, optional) – The special token for separator token. Defaults to “[SEP]”.
pad_token (str, optional) – The special token for padding. Defaults to “[PAD]”.
cls_token (str, optional) – The special token for cls. Defaults to “[CLS]”.
mask_token (str, optional) – The special token for mask. Defaults to “[MASK]”.
- normalize_chars(text)[source]#
Normalize the text for multiligual and chinese models. Unicode range: https://www.ling.upenn.edu/courses/Spring_2003/ling538/UnicodeRanges.html