tokenizer#
Tokenization classes for LayoutXLM model.
- class LayoutXLMTokenizer(vocab_file, bos_token='<s>', eos_token='</s>', sep_token='</s>', cls_token='<s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', **kwargs)[源代码]#
-
- build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: List[int] | None = None) List[int] [源代码]#
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens.
This implementation does not add special tokens and this method should be overridden in a subclass.
- 参数:
token_ids_0 (
List[int]
) -- The first tokenized sequence.token_ids_1 (
List[int]
, optional) -- The second tokenized sequence.
- 返回:
The model input with special tokens.
- 返回类型:
List[int]
- get_special_tokens_mask(token_ids_0: List[int], token_ids_1: List[int] | None = None, already_has_special_tokens: bool = False) List[int] [源代码]#
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer
encode
methods.- 参数:
token_ids_0 (List[int]) -- List of ids of the first sequence.
token_ids_1 (List[int], optional) -- List of ids of the second sequence.
already_has_special_tokens (bool, optional) -- Whether or not the token list is already formatted with special tokens for the model. Defaults to None.
- 返回:
- The list of integers in the range [0, 1]:
1 for a special token, 0 for a sequence token.
- 返回类型:
results (List[int])
- create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: List[int] | None = None) List[int] [源代码]#
Create the token type IDs corresponding to the sequences passed. [What are token type IDs?](../glossary#token-type-ids)
Should be overridden in a subclass if the model has a special way of building those.
- 参数:
token_ids_0 (
List[int]
) -- The first tokenized sequence.token_ids_1 (
List[int]
, optional) -- The second tokenized sequence.
- 返回:
The token type ids.
- 返回类型:
List[int]
- property vocab_size#
Size of the base vocabulary (without the added tokens).
- Type:
int
- get_vocab()[源代码]#
Returns the vocabulary as a dictionary of token to index.
tokenizer.get_vocab()[token]
is equivalent totokenizer.convert_tokens_to_ids(token)
whentoken
is in the vocab.- 返回:
The vocabulary.
- 返回类型:
Dict[str, int]
- convert_tokens_to_string(tokens)[源代码]#
Converts a sequence of tokens (strings for sub-words) in a single string.
- num_special_tokens_to_add(pair=False)[源代码]#
Returns the number of added tokens when encoding a sequence with special tokens.
- 参数:
pair (bool, optional) -- Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence. Defaults to
False
.- 返回:
Number of special tokens added to sequences.
- 返回类型:
int