Tokenization class for FNet model.

class FNetTokenizer(sentencepiece_model_file, do_lower_case=False, remove_space=True, keep_accents=True, unk_token='<unk>', sep_token='[SEP]', pad_token='<pad>', cls_token='[CLS]', mask_token='[MASK]', sp_model_kwargs: Optional[Dict[str, Any]] = None, **kwargs)[源代码]


Construct a FNet tokenizer. Inherit from AlbertEnglishTokenizer. Based on SentencePiece.

  • sentencepiece_model_file (str) -- SentencePiece file (generally has a spm extension) that contains the vocabulary necessary to instantiate a tokenizer.

  • do_lower_case (bool, optional, defaults to False) -- Whether or not to lowercase the input when tokenizing.

  • remove_space (bool, optional, defaults to True) -- Whether or not to strip the text when tokenizing (removing excess spaces before and after the string).

  • keep_accents (bool, optional, defaults to True) -- Whether or not to keep accents when tokenizing.

  • unk_token (str, optional, defaults to "<unk>") -- The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

  • sep_token (str, optional, defaults to "[SEP]") -- The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.

  • pad_token (str, optional, defaults to "<pad>") -- The token used for padding, for example when batching sequences of different lengths.

  • cls_token (str, optional, defaults to "[CLS]") -- The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.

  • mask_token (str, optional, defaults to "[MASK]") -- The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.

  • sp_model_kwargs (dict, optional) --

    Will be passed to the SentencePieceProcessor.__init__() method. The Python wrapper for SentencePiece can be used, among other things, to set:

    • enable_sampling: Enable subword regularization.

    • nbest_size: Sampling parameters for unigram. Invalid for BPE-Dropout.

      • nbest_size = {0,1}: No sampling is performed.

      • nbest_size > 1: samples from the nbest_size results.

      • nbest_size < 0: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.

    • alpha: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout.


The SentencePiece processor that is used for every conversion (string, tokens and IDs).



property vocab_size

Size of the base vocabulary (without the added tokens).




Converts a string in a sequence of tokens, using the tokenizer.

Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces). Takes care of added tokens.

  • text (str) -- The sequence to be encoded.

  • **kwargs (additional keyword arguments) -- Passed along to the model-specific prepare_for_tokenization preprocessing method.


The list of tokens.




Converts a sequence of tokens (strings for sub-words) in a single string.

build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None)List[int][源代码]

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An FNet sequence has the following format:

  • single sequence: [CLS] X [SEP]

  • pair of sequences: [CLS] A [SEP] B [SEP]

  • token_ids_0 (List[int]) -- List of IDs to which the special tokens will be added.

  • token_ids_1 (List[int], optional) -- Optional second list of IDs for sequence pairs.


List of input IDs with the appropriate special tokens.



get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[源代码]

Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer encode methods.

  • token_ids_0 (List[int]) -- List of ids of the first sequence.

  • token_ids_1 (List[int], optional) -- List of ids of the second sequence.

  • already_has_special_tokens (bool, optional) -- Whether or not the token list is already formatted with special tokens for the model. Defaults to None.


The list of integers in the range [0, 1]:

1 for a special token, 0 for a sequence token.


results (List[int])

create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None)List[int][源代码]

Create a mask from the two sequences passed to be used in a sequence-pair classification task. An FNet sequence pair mask has the following format:

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence |

If token_ids_1 is None, this method only returns the first portion of the mask (0s).

  • token_ids_0 (List[int]) -- List of IDs.

  • token_ids_1 (List[int], optional) -- Optional second list of IDs for sequence pairs.


List of token type IDs according to the given sequence(s).




Save tokenizer related resources to resource_files_names indicating files under save_directory by copying directly. Override it if necessary.


save_directory (str) -- Directory to save files into.