tokenizer#
Tokenization class for XLNet model.
- class XLNetTokenizer(vocab_file, do_lower_case=False, remove_space=True, keep_accents=False, bos_token='<s>', eos_token='</s>', unk_token='<unk>', sep_token='<sep>', pad_token='<pad>', cls_token='<cls>', mask_token='<mask>', additional_special_tokens=['<eop>', '<eod>'], sp_model_kwargs=None, **kwargs)[源代码]#
-
Constructs an XLNet tokenizer based on SentencePiece.
This tokenizer inherits from
PretrainedTokenizerwhich contains most of the main methods. For more information regarding those methods, please refer to this superclass.- 参数:
vocab_file (str) -- The vocabulary file (ends with '.spm') required to instantiate a SentencePiece tokenizer.
do_lower_case (bool, optional) -- Whether or not to lowercase the input when tokenizing. Defaults to
Falseand does not lowercase the input.remove_space (bool, optional) -- Whether or not to strip the text when tokenizing. Defaults to
Trueand removes excess spaces before and after the string.keep_accents (bool, optional) -- Whether or not to keep accents when tokenizing. Defaults to
Falseand does not keep accents.bos_token (str, optional) -- A special token representing the beginning of a sequence that was used during pretraining. Defaults to
"<s>".eos_token (str, optional) -- A special token representing the end of a sequence that was used during pretraining. Defaults to
"</s>".unk_token (str, optional) -- A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be
unk_tokeninorder to be converted to an ID. Defaults to"<unk>".sep_token (str, optional) -- A special token separating two different sentences in the same input. Defaults to
"<sep>".pad_token (str, optional) -- A special token used to make arrays of tokens the same size for batching purposes. Defaults to
"<pad>".cls_token (str, optional) -- A special token used for sequence classification. It is the last token of the sequence when built with special tokens. Defaults to
"<cls>".mask_token (str, optional) -- A special token representing a masked token. This is the token used in the masked language modeling task which the model tries to predict the original unmasked ones. Defaults to
"<mask>".additional_special_tokens (List[str], optional) -- A list of additional special tokens to be used by the tokenizer. Defaults to
["<eop>", "<eod>"].
- sp_model#
The SentencePiece processor that is used for every conversion (string, tokens and IDs).
- Type:
SentencePieceProcessor
- property vocab_size#
Size of the base vocabulary (without the added tokens).
- Type:
int
- get_vocab()[源代码]#
Returns the vocabulary as a dictionary of token to index.
tokenizer.get_vocab()[token]is equivalent totokenizer.convert_tokens_to_ids(token)whentokenis in the vocab.- 返回:
The vocabulary.
- 返回类型:
Dict[str, int]
- convert_tokens_to_string(tokens)[源代码]#
Converts a sequence of tokens (list of string) to a single string by using
' '.join(tokens).- 参数:
tokens (list[str]) -- A sequence of tokens.
- 返回:
Converted string.
- 返回类型:
str
- num_special_tokens_to_add(pair=False)[源代码]#
Returns the number of added tokens when encoding a sequence with special tokens.
- 参数:
pair (bool, optional) -- Whether the input is a sequence pair or a single sequence. Defaults to
Falseand the input is a single sequence.- 返回:
Number of tokens added to sequences.
- 返回类型:
int
- build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]#
Builds model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An XLNet sequence has the following format:
single sequence:
X <sep> <cls>pair of sequences:
A <sep> B <sep> <cls>
- 参数:
token_ids_0 (List[int]) -- List of IDs for the first sequence.
token_ids_1 (List[int], optional) -- Optional second list of IDs for the second sequenze. Defaults to
None.
- 返回:
List of input IDs with the appropriate special tokens.
- 返回类型:
List[int]
- build_offset_mapping_with_special_tokens(offset_mapping_0, offset_mapping_1=None)[源代码]#
Builds offset map from a pair of offset map by concatenating and adding offsets of special tokens.
An XLNet offset_mapping has the following format:
single sequence:
X (0,0) (0,0)pair of sequences:
A (0,0) B (0,0) (0,0)
- 参数:
offset_mapping_0 (List[tuple]) -- List of char offsets to which the special tokens will be added.
offset_mapping_1 (List[tuple], optional) -- Optional second list of char offsets for offset mapping pairs. Defaults to
None.
- 返回:
A list of char offsets with the appropriate offsets of special tokens.
- 返回类型:
List[tuple]
- get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[源代码]#
Creates a special tokens mask from the input sequences. This method is called when adding special tokens using the tokenizer
encodemethod.- 参数:
token_ids_0 (List[int]) -- A list of
inputs_idsfor the first sequence.token_ids_1 (List[int], optional) -- Optional second list of
inputs_idsfor the second sequence. Defaults toNone.already_has_special_tokens (bool, optional) -- Whether or not the token list already contains special tokens for the model. Defaults to
False.
- 返回:
A list of integers which is either 0 or 1: 1 for a special token, 0 for a sequence token.
- 返回类型:
List[int]
- create_token_type_ids_from_sequences(token_ids_0, token_ids_1=None)[源代码]#
Creates a token_type mask from the input sequences. If
token_ids_1is notNone, then a sequence pair token_type mask has the following format:0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 2 | first sequence | second sequence |
Else if
token_ids_1isNone, then a single sequence token_type mask has the following format:0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 | first sequence |
0 stands for the segment id of first segment tokens,
1 stands for the segment id of second segment tokens,
2 stands for the segment id of cls_token.
- 参数:
token_ids_0 (List[int]) -- A list of
inputs_idsfor the first sequence.token_ids_1 (List[int], optional) -- Optional second list of
inputs_idsfor the second sequence. Defaults toNone.
- 返回:
List of token type IDs according to the given sequence(s).
- 返回类型:
List[int]