tokenizer#
- class ProphetNetTokenizer(vocab_file, do_lower_case=True, do_basic_tokenize=True, unk_token='[UNK]', sep_token='[SEP]', bos_token='[SEP]', eos_token='[SEP]', cls_token='[CLS]', x_sep_token='[X_SEP]', pad_token='[PAD]', mask_token='[MASK]', **kwargs)[源代码]#
-
Construct a ProphetNetTokenizer. Based on WordPiece.
This tokenizer inherits from [
PreTrainedTokenizer
] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.- 参数:
vocab_file (
str
) -- File containing the vocabulary.do_lower_case (
bool
, optional, defaults toTrue
) -- Whether or not to lowercase the input when tokenizing.do_basic_tokenize (
bool
, optional, defaults toTrue
) -- Whether or not to do basic tokenization before WordPiece.unk_token (
str
, optional, defaults to"[UNK]"
) -- The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.sep_token (
str
, optional, defaults to"[SEP]"
) -- The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.x_sep_token (
str
, optional, defaults to"[X_SEP]"
) -- Special second separator token, which can be generated by [ProphetNetForConditionalGeneration
]. It is used to separate bullet-point like sentences in summarization, e.g..pad_token (
str
, optional, defaults to"[PAD]"
) -- The token used for padding, for example when batching sequences of different lengths.cls_token (
str
, optional, defaults to"[CLS]"
) -- The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.mask_token (
str
, optional, defaults to"[MASK]"
) -- The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.
- property vocab_size#
Size of the base vocabulary (without the added tokens).
- Type:
int
- get_vocab()[源代码]#
Returns the vocabulary as a dictionary of token to index.
tokenizer.get_vocab()[token]
is equivalent totokenizer.convert_tokens_to_ids(token)
whentoken
is in the vocab.- 返回:
The vocabulary.
- 返回类型:
Dict[str, int]
- tokenize(text)[源代码]#
Converts a string in a sequence of tokens, using the tokenizer.
Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces). Takes care of added tokens.
- 参数:
text (
str
) -- The sequence to be encoded.**kwargs (additional keyword arguments) -- Passed along to the model-specific
prepare_for_tokenization
preprocessing method.
- 返回:
The list of tokens.
- 返回类型:
List[str]
- convert_tokens_to_ids(tokens)[源代码]#
Converts a sequence of tokens into ids using the
vocab
attribute (an instance ofVocab
). Override it if needed.- Args:
tokens (list[int]): List of token ids.
- 返回:
Converted id list.
- 返回类型:
list
- convert_ids_to_tokens(ids, skip_special_tokens=False)[源代码]#
Converts a single index or a sequence of indices to a token or a sequence of tokens, using the vocabulary and added tokens.
- 参数:
ids (int or List[int]) -- The token id (or token ids) to be converted to token(s).
skip_special_tokens (bool, optional) -- Whether or not to remove special tokens in the decoding. Defaults to
False
and we do not remove special tokens.
- 返回:
The decoded token(s).
- 返回类型:
str or List[str]
- get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[源代码]#
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer
prepare_for_model
method.- 参数:
token_ids_0 (
List[int]
) -- List of IDs.token_ids_1 (
List[int]
, optional) -- Optional second list of IDs for sequence pairs.already_has_special_tokens (
bool
, optional, defaults toFalse
) -- Whether or not the token list is already formatted with special tokens for the model.
- 返回:
A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
- 返回类型:
List[int]
- create_token_type_ids_from_sequences(token_ids_0, token_ids_1=None)[源代码]#
Create a mask from the two sequences passed to be used in a sequence-pair classification task. A ProphetNet sequence pair mask has the following format:
` 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | `
If
token_ids_1
isNone
, this method only returns the first portion of the mask (0s).- 参数:
token_ids_0 (
List[int]
) -- List of IDs.token_ids_1 (
List[int]
, optional) -- Optional second list of IDs for sequence pairs.
- 返回:
List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
- 返回类型:
List[int]
- build_inputs_with_special_tokens(token_ids_0, token_ids_1=None) List[int] [源代码]#
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERT sequence has the following format:
single sequence:
[CLS] X [SEP]
pair of sequences:
[CLS] A [SEP] B [SEP]
- 参数:
token_ids_0 (
List[int]
) -- List of IDs to which the special tokens will be added.token_ids_1 (
List[int]
, optional) -- Optional second list of IDs for sequence pairs.
- 返回:
List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
- 返回类型:
List[int]