tokenizer#
Tokenization class for FNet model.
- class FNetTokenizer(sentencepiece_model_file, do_lower_case=False, remove_space=True, keep_accents=True, unk_token='<unk>', sep_token='[SEP]', pad_token='<pad>', cls_token='[CLS]', mask_token='[MASK]', sp_model_kwargs: Dict[str, Any] | None = None, **kwargs)[源代码]#
基类:
AlbertEnglishTokenizer
Construct a FNet tokenizer. Inherit from
AlbertEnglishTokenizer
. Based on SentencePiece.- 参数:
sentencepiece_model_file (
str
) -- SentencePiece file (generally has aspm
extension) that contains the vocabulary necessary to instantiate a tokenizer.do_lower_case (
bool
,optional
, defaults toFalse
) -- Whether or not to lowercase the input when tokenizing.remove_space (
bool
,optional
, defaults toTrue
) -- Whether or not to strip the text when tokenizing (removing excess spaces before and after the string).keep_accents (
bool
,optional
, defaults toTrue
) -- Whether or not to keep accents when tokenizing.unk_token (
str
,optional
, defaults to"<unk>"
) -- The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.sep_token (
str
,optional
, defaults to"[SEP]"
) -- The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.pad_token (
str
,optional
, defaults to"<pad>"
) -- The token used for padding, for example when batching sequences of different lengths.cls_token (
str
,optional
, defaults to"[CLS]"
) -- The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.mask_token (
str
,optional
, defaults to"[MASK]"
) -- The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.sp_model_kwargs (
dict
,optional
) --Will be passed to the
SentencePieceProcessor.__init__()
method. The Python wrapper for SentencePiece can be used, among other things, to set:enable_sampling
: Enable subword regularization.nbest_size
: Sampling parameters for unigram. Invalid for BPE-Dropout.nbest_size = {0,1}
: No sampling is performed.nbest_size > 1
: samples from the nbest_size results.nbest_size < 0
: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.
alpha
: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout.
- sp_model#
The
SentencePiece
processor that is used for every conversion (string, tokens and IDs).- Type:
SentencePieceProcessor
- convert_tokens_to_string(tokens)[源代码]#
Converts a sequence of tokens (strings for sub-words) in a single string.
- build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: List[int] | None = None) List[int] [源代码]#
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An FNet sequence has the following format:
single sequence:
[CLS] X [SEP]
pair of sequences:
[CLS] A [SEP] B [SEP]
- 参数:
token_ids_0 (
List[int]
) -- List of IDs to which the special tokens will be added.token_ids_1 (
List[int]
,optional
) -- Optional second list of IDs for sequence pairs.
- 返回:
List of input IDs with the appropriate special tokens.
- 返回类型:
List[int]
- get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[源代码]#
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer
encode
methods.- 参数:
token_ids_0 (List[int]) -- List of ids of the first sequence.
token_ids_1 (List[int], optional) -- List of ids of the second sequence.
already_has_special_tokens (bool, optional) -- Whether or not the token list is already formatted with special tokens for the model. Defaults to None.
- 返回:
- The list of integers in the range [0, 1]:
1 for a special token, 0 for a sequence token.
- 返回类型:
results (List[int])
- create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: List[int] | None = None) List[int] [源代码]#
Create a mask from the two sequences passed to be used in a sequence-pair classification task. An FNet sequence pair mask has the following format:
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence |
If
token_ids_1
isNone
, this method only returns the first portion of the mask (0s).- 参数:
token_ids_0 (
List[int]
) -- List of IDs.token_ids_1 (
List[int]
,optional
) -- Optional second list of IDs for sequence pairs.
- 返回:
List of token type IDs according to the given sequence(s).
- 返回类型:
List[int]