tokenizer#
Tokenization class for FNet model.
- class FNetTokenizer(sentencepiece_model_file, do_lower_case=False, remove_space=True, keep_accents=True, unk_token='<unk>', sep_token='[SEP]', pad_token='<pad>', cls_token='[CLS]', mask_token='[MASK]', sp_model_kwargs: Dict[str, Any] | None = None, **kwargs)[source]#
Bases:
AlbertEnglishTokenizer
Construct a FNet tokenizer. Inherit from
AlbertEnglishTokenizer
. Based on SentencePiece.- Parameters:
sentencepiece_model_file (
str
) – SentencePiece file (generally has aspm
extension) that contains the vocabulary necessary to instantiate a tokenizer.do_lower_case (
bool
,optional
, defaults toFalse
) – Whether or not to lowercase the input when tokenizing.remove_space (
bool
,optional
, defaults toTrue
) – Whether or not to strip the text when tokenizing (removing excess spaces before and after the string).keep_accents (
bool
,optional
, defaults toTrue
) – Whether or not to keep accents when tokenizing.unk_token (
str
,optional
, defaults to"<unk>"
) – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.sep_token (
str
,optional
, defaults to"[SEP]"
) – The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.pad_token (
str
,optional
, defaults to"<pad>"
) – The token used for padding, for example when batching sequences of different lengths.cls_token (
str
,optional
, defaults to"[CLS]"
) – The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.mask_token (
str
,optional
, defaults to"[MASK]"
) – The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.sp_model_kwargs (
dict
,optional
) –Will be passed to the
SentencePieceProcessor.__init__()
method. The Python wrapper for SentencePiece can be used, among other things, to set:enable_sampling
: Enable subword regularization.nbest_size
: Sampling parameters for unigram. Invalid for BPE-Dropout.nbest_size = {0,1}
: No sampling is performed.nbest_size > 1
: samples from the nbest_size results.nbest_size < 0
: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.
alpha
: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout.
- sp_model#
The
SentencePiece
processor that is used for every conversion (string, tokens and IDs).- Type:
SentencePieceProcessor
- convert_tokens_to_string(tokens)[source]#
Converts a sequence of tokens (strings for sub-words) in a single string.
- build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: List[int] | None = None) List[int] [source]#
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An FNet sequence has the following format:
single sequence:
[CLS] X [SEP]
pair of sequences:
[CLS] A [SEP] B [SEP]
- Parameters:
token_ids_0 (
List[int]
) – List of IDs to which the special tokens will be added.token_ids_1 (
List[int]
,optional
) – Optional second list of IDs for sequence pairs.
- Returns:
List of input IDs with the appropriate special tokens.
- Return type:
List[int]
- get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]#
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer
encode
methods.- Parameters:
token_ids_0 (List[int]) – List of ids of the first sequence.
token_ids_1 (List[int], optional) – List of ids of the second sequence.
already_has_special_tokens (bool, optional) – Whether or not the token list is already formatted with special tokens for the model. Defaults to None.
- Returns:
- The list of integers in the range [0, 1]:
1 for a special token, 0 for a sequence token.
- Return type:
results (List[int])
- create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: List[int] | None = None) List[int] [source]#
Create a mask from the two sequences passed to be used in a sequence-pair classification task. An FNet sequence pair mask has the following format:
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence |
If
token_ids_1
isNone
, this method only returns the first portion of the mask (0s).- Parameters:
token_ids_0 (
List[int]
) – List of IDs.token_ids_1 (
List[int]
,optional
) – Optional second list of IDs for sequence pairs.
- Returns:
List of token type IDs according to the given sequence(s).
- Return type:
List[int]