tokenizer¶
Tokenization class for FNet model.
-
class
FNetTokenizer
(sentencepiece_model_file, do_lower_case=False, remove_space=True, keep_accents=True, unk_token='<unk>', sep_token='[SEP]', pad_token='<pad>', cls_token='[CLS]', mask_token='[MASK]', sp_model_kwargs: Optional[Dict[str, Any]] = None, **kwargs)[源代码]¶ 基类:
paddlenlp.transformers.albert.tokenizer.AlbertEnglishTokenizer
Construct a FNet tokenizer. Inherit from
AlbertEnglishTokenizer
. Based on SentencePiece.- 参数
sentencepiece_model_file (
str
) -- SentencePiece file (generally has aspm
extension) that contains the vocabulary necessary to instantiate a tokenizer.do_lower_case (
bool
,optional
, defaults toFalse
) -- Whether or not to lowercase the input when tokenizing.remove_space (
bool
,optional
, defaults toTrue
) -- Whether or not to strip the text when tokenizing (removing excess spaces before and after the string).keep_accents (
bool
,optional
, defaults toTrue
) -- Whether or not to keep accents when tokenizing.unk_token (
str
,optional
, defaults to"<unk>"
) -- The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.sep_token (
str
,optional
, defaults to"[SEP]"
) -- The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.pad_token (
str
,optional
, defaults to"<pad>"
) -- The token used for padding, for example when batching sequences of different lengths.cls_token (
str
,optional
, defaults to"[CLS]"
) -- The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.mask_token (
str
,optional
, defaults to"[MASK]"
) -- The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.sp_model_kwargs (
dict
,optional
) --Will be passed to the
SentencePieceProcessor.__init__()
method. The Python wrapper for SentencePiece can be used, among other things, to set:enable_sampling
: Enable subword regularization.nbest_size
: Sampling parameters for unigram. Invalid for BPE-Dropout.nbest_size = {0,1}
: No sampling is performed.nbest_size > 1
: samples from the nbest_size results.nbest_size < 0
: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.
alpha
: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout.
-
sp_model
¶ The
SentencePiece
processor that is used for every conversion (string, tokens and IDs).- Type
SentencePieceProcessor
-
property
vocab_size
¶ Size of the base vocabulary (without the added tokens).
- Type
int
-
tokenize
(text)[源代码]¶ Converts a string in a sequence of tokens, using the tokenizer.
Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces). Takes care of added tokens.
- 参数
text (
str
) -- The sequence to be encoded.**kwargs (additional keyword arguments) -- Passed along to the model-specific
prepare_for_tokenization
preprocessing method.
- 返回
The list of tokens.
- 返回类型
List[str]
-
convert_tokens_to_string
(tokens)[源代码]¶ Converts a sequence of tokens (strings for sub-words) in a single string.
-
build_inputs_with_special_tokens
(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) → List[int][源代码]¶ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An FNet sequence has the following format:
single sequence:
[CLS] X [SEP]
pair of sequences:
[CLS] A [SEP] B [SEP]
- 参数
token_ids_0 (
List[int]
) -- List of IDs to which the special tokens will be added.token_ids_1 (
List[int]
,optional
) -- Optional second list of IDs for sequence pairs.
- 返回
List of input IDs with the appropriate special tokens.
- 返回类型
List[int]
-
get_special_tokens_mask
(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[源代码]¶ Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer
encode
methods.- 参数
token_ids_0 (List[int]) -- List of ids of the first sequence.
token_ids_1 (List[int], optional) -- List of ids of the second sequence.
already_has_special_tokens (bool, optional) -- Whether or not the token list is already formatted with special tokens for the model. Defaults to None.
- 返回
- The list of integers in the range [0, 1]:
1 for a special token, 0 for a sequence token.
- 返回类型
results (List[int])
-
create_token_type_ids_from_sequences
(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) → List[int][源代码]¶ Create a mask from the two sequences passed to be used in a sequence-pair classification task. An FNet sequence pair mask has the following format:
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence |
If
token_ids_1
isNone
, this method only returns the first portion of the mask (0s).- 参数
token_ids_0 (
List[int]
) -- List of IDs.token_ids_1 (
List[int]
,optional
) -- Optional second list of IDs for sequence pairs.
- 返回
List of token type IDs according to the given sequence(s).
- 返回类型
List[int]