tokenizer#

Tokenization class for FNet model.

class FNetTokenizer(sentencepiece_model_file, do_lower_case=False, remove_space=True, keep_accents=True, unk_token='<unk>', sep_token='[SEP]', pad_token='<pad>', cls_token='[CLS]', mask_token='[MASK]', sp_model_kwargs: Dict[str, Any] | None = None, **kwargs)[source]#

Bases: AlbertEnglishTokenizer

Construct a FNet tokenizer. Inherit from AlbertEnglishTokenizer. Based on SentencePiece.

Parameters:
  • sentencepiece_model_file (str) – SentencePiece file (generally has a spm extension) that contains the vocabulary necessary to instantiate a tokenizer.

  • do_lower_case (bool, optional, defaults to False) – Whether or not to lowercase the input when tokenizing.

  • remove_space (bool, optional, defaults to True) – Whether or not to strip the text when tokenizing (removing excess spaces before and after the string).

  • keep_accents (bool, optional, defaults to True) – Whether or not to keep accents when tokenizing.

  • unk_token (str, optional, defaults to "<unk>") – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

  • sep_token (str, optional, defaults to "[SEP]") – The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.

  • pad_token (str, optional, defaults to "<pad>") – The token used for padding, for example when batching sequences of different lengths.

  • cls_token (str, optional, defaults to "[CLS]") – The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.

  • mask_token (str, optional, defaults to "[MASK]") – The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.

  • sp_model_kwargs (dict, optional) –

    Will be passed to the SentencePieceProcessor.__init__() method. The Python wrapper for SentencePiece can be used, among other things, to set:

    • enable_sampling: Enable subword regularization.

    • nbest_size: Sampling parameters for unigram. Invalid for BPE-Dropout.

      • nbest_size = {0,1}: No sampling is performed.

      • nbest_size > 1: samples from the nbest_size results.

      • nbest_size < 0: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.

    • alpha: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout.

sp_model#

The SentencePiece processor that is used for every conversion (string, tokens and IDs).

Type:

SentencePieceProcessor

convert_tokens_to_string(tokens)[source]#

Converts a sequence of tokens (strings for sub-words) in a single string.

build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: List[int] | None = None) List[int][source]#

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An FNet sequence has the following format:

  • single sequence: [CLS] X [SEP]

  • pair of sequences: [CLS] A [SEP] B [SEP]

Parameters:
  • token_ids_0 (List[int]) – List of IDs to which the special tokens will be added.

  • token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs.

Returns:

List of input IDs with the appropriate special tokens.

Return type:

List[int]

get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]#

Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer encode methods.

Parameters:
  • token_ids_0 (List[int]) – List of ids of the first sequence.

  • token_ids_1 (List[int], optional) – List of ids of the second sequence.

  • already_has_special_tokens (bool, optional) – Whether or not the token list is already formatted with special tokens for the model. Defaults to None.

Returns:

The list of integers in the range [0, 1]:

1 for a special token, 0 for a sequence token.

Return type:

results (List[int])

create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: List[int] | None = None) List[int][source]#

Create a mask from the two sequences passed to be used in a sequence-pair classification task. An FNet sequence pair mask has the following format:

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence |

If token_ids_1 is None, this method only returns the first portion of the mask (0s).

Parameters:
  • token_ids_0 (List[int]) – List of IDs.

  • token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs.

Returns:

List of token type IDs according to the given sequence(s).

Return type:

List[int]