tokenizer_utils¶
-
class
PretrainedTokenizer
(**kwargs)[source]¶ Bases:
paddlenlp.transformers.tokenizer_utils_base.PretrainedTokenizerBase
Base class for all tokenizers.
Inherits from [
PretrainedTokenizerBase
].Handle all the shared methods for tokenization and special tokens as well as methods downloading/caching/loading pretrained tokenizers as well as adding tokens to the vocabulary.
This class also contain the added tokens in a unified way on top of all tokenizers so we don’t have to handle the specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece…).
- resource_files_names (
Dict[str, str]
) – A dictionary with, as keys, the__init__
keyword name of each vocabulary file required by the model, and as associated values, the filename for saving the associated file (string).
- resource_files_names (
- pretrained_resource_files_map (
Dict[str, Dict[str, str]]
) – A dictionary of dictionaries, with the high-level keys being the
__init__
keyword name of each vocabulary file required by the model, the low-level being theshort-cut-names
of the pretrained models with, as associated values, theurl
to the associated pretrained vocabulary file.
- pretrained_resource_files_map (
- max_model_input_sizes (
Dict[str, Optional[int]]
) – A dictionary with, as keys, theshort-cut-names
of the pretrained models, and as associated values, the maximum length of the sequence inputs of this model, or
None
if the model has no maximum input size.
- max_model_input_sizes (
- pretrained_init_configuration (
Dict[str, Dict[str, Any]]
) – A dictionary with, as keys, the short-cut-names
of the pretrained models, and as associated values, a dictionary of specific arguments to pass to the__init__
method of the tokenizer class for this pretrained model when loading the tokenizer with the [from_pretrained
] method.
- pretrained_init_configuration (
model_input_names (
List[str]
) – A list of inputs expected in the forward pass of the model.- padding_side (
str
) – The default value for the side on which the model should have padding applied. Should be
'right'
or'left'
.
- padding_side (
- truncation_side (
str
) – The default value for the side on which the model should have truncation applied. Should be
'right'
or'left'
.
- truncation_side (
Moreover, methods common to tokenizers for tokenization, token/id conversion and encoding as model inputs are also provided here.
Besides, metaclass
InitTrackerMeta
is used to createPretrainedTokenizer
, by which subclasses can track arguments for initialization automatically and expose special tokens initialization used as attributes.-
property
vocab_size
¶ Size of the base vocabulary (without the added tokens).
- Type
int
-
get_added_vocab
() → Dict[str, int][source]¶ Returns the added tokens in the vocabulary as a dictionary of token to index.
- Returns
The added tokens.
- Return type
Dict[str, int]
-
prepare_for_tokenization
(text, is_split_into_words=False, **kwargs)[source]¶ Performs any necessary transformations before tokenization.
This method should pop the arguments from kwargs and return the remaining
kwargs
as well. We test thekwargs
at the end of the encoding process to be sure all the arguments have been used.- Parameters
text (
str
) – The text to prepare.is_split_into_words (
bool
, optional, defaults toFalse
) – Whether or not the input is already pre-tokenized (e.g., split into words). If set toTrue
, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.kwargs – Keyword arguments to use for the tokenization.
- Returns
The prepared text and the unused kwargs.
- Return type
Tuple[str, Dict[str, Any]]
-
tokenize
(text: str, **kwargs) → List[str][source]¶ Converts a string in a sequence of tokens, using the tokenizer.
Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces). Takes care of added tokens.
- Parameters
text (
str
) – The sequence to be encoded.**kwargs (additional keyword arguments) – Passed along to the model-specific
prepare_for_tokenization
preprocessing method.
- Returns
The list of tokens.
- Return type
List[str]
-
convert_tokens_to_string
(tokens)[source]¶ Converts a sequence of tokens (list of string) to a single string by using
' '.join(tokens)
.- Parameters
tokens (list[str]) – A sequence of tokens.
- Returns
Converted string.
- Return type
str
-
static
load_vocabulary
(filepath, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs)[source]¶ Instantiate an instance of
Vocab
from a file reserving all tokens by usingVocab.from_dict
. The file contains a token per line, and the line number would be the index of corresponding token.- Parameters
filepath (str) – path of file to construct vocabulary.
unk_token (str) – special token for unknown token. If no need, it also could be
None
. Defaults toNone
.pad_token (str) – special token for padding token. If no need, it also could be
None
. Defaults toNone
.bos_token (str) – special token for bos token. If no need, it also could be
None
. Defaults toNone
.eos_token (str) – special token for eos token. If no need, it also could be
None
. Defaults toNone
.**kwargs (dict) – keyword arguments for
Vocab.from_dict
.
- Returns
An instance of
Vocab
.- Return type
-
static
save_vocabulary
(filepath, vocab)[source]¶ Save all tokens to a vocabulary file. The file contains a token per line, and the line number would be the index of corresponding token.
- Parameters
filepath (str) – File path to be saved to.
vocab (Vocab|dict) – The
Vocab
ordict
instance to be saved.
-
get_special_tokens_mask
(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]¶ Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer
encode
methods.- Parameters
token_ids_0 (List[int]) – List of ids of the first sequence.
token_ids_1 (List[int], optional) – List of ids of the second sequence.
already_has_special_tokens (bool, optional) – Whether or not the token list is already formatted with special tokens for the model. Defaults to None.
- Returns
- The list of integers in the range [0, 1]:
1 for a special token, 0 for a sequence token.
- Return type
results (List[int])
-
num_special_tokens_to_add
(pair)[source]¶ Returns the number of added tokens when encoding a sequence with special tokens.
- Parameters
pair (bool, optional) – Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence. Defaults to
False
.- Returns
Number of special tokens added to sequences.
- Return type
int
-
get_offset_mapping
(text: str, split_tokens: Optional[List[str]] = None)[source]¶ Returns the map of tokens and the start and end index of their start and end character. Modified from https://github.com/bojone/bert4keras/blob/master/bert4keras/tokenizers.py#L372 :param text: Input text. :type text: str :param split_tokens: the tokens which has been split which can accelerate the operation. :type split_tokens: Optional[List[str]]
- Returns
The offset map of input text.
- Return type
list
-
class
BPETokenizer
(vocab_file, encoder_json_path='./configs/encoder.json', vocab_bpe_path='./configs/vocab.bpe', unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]')[source]¶ Bases:
paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer
The base class for all bpe tokenizers. It mainly provides common tokenize methods for bpe type tokenizer.
- Parameters
vocab_file (str) – file path of the vocabulary.
encoder_json_path (str, optional) – file path of the id to vocab.
vocab_bpe_path (str, optional) – file path of word merge text.
unk_token (str, optional) – The special token for unknown words. Defaults to “[UNK]”.
sep_token (str, optional) – The special token for separator token. Defaults to “[SEP]”.
pad_token (str, optional) – The special token for padding. Defaults to “[PAD]”.
cls_token (str, optional) – The special token for cls. Defaults to “[CLS]”.
mask_token (str, optional) – The special token for mask. Defaults to “[MASK]”.
-
normalize_chars
(text)[source]¶ Normalize the text for multiligual and chinese models. Unicode range: https://www.ling.upenn.edu/courses/Spring_2003/ling538/UnicodeRanges.html