tokenizer#
Tokenization classes for LayoutXLM model.
- class LayoutXLMTokenizer(vocab_file, bos_token='<s>', eos_token='</s>', sep_token='</s>', cls_token='<s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', **kwargs)[source]#
Bases:
PretrainedTokenizer
- build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: List[int] | None = None) List[int] [source]#
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens.
This implementation does not add special tokens and this method should be overridden in a subclass.
- Parameters:
token_ids_0 (
List[int]
) – The first tokenized sequence.token_ids_1 (
List[int]
, optional) – The second tokenized sequence.
- Returns:
The model input with special tokens.
- Return type:
List[int]
- get_special_tokens_mask(token_ids_0: List[int], token_ids_1: List[int] | None = None, already_has_special_tokens: bool = False) List[int] [source]#
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer
encode
methods.- Parameters:
token_ids_0 (List[int]) – List of ids of the first sequence.
token_ids_1 (List[int], optional) – List of ids of the second sequence.
already_has_special_tokens (bool, optional) – Whether or not the token list is already formatted with special tokens for the model. Defaults to None.
- Returns:
- The list of integers in the range [0, 1]:
1 for a special token, 0 for a sequence token.
- Return type:
results (List[int])
- create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: List[int] | None = None) List[int] [source]#
Create the token type IDs corresponding to the sequences passed. [What are token type IDs?](../glossary#token-type-ids)
Should be overridden in a subclass if the model has a special way of building those.
- Parameters:
token_ids_0 (
List[int]
) – The first tokenized sequence.token_ids_1 (
List[int]
, optional) – The second tokenized sequence.
- Returns:
The token type ids.
- Return type:
List[int]
- property vocab_size#
Size of the base vocabulary (without the added tokens).
- Type:
int
- get_vocab()[source]#
Returns the vocabulary as a dictionary of token to index.
tokenizer.get_vocab()[token]
is equivalent totokenizer.convert_tokens_to_ids(token)
whentoken
is in the vocab.- Returns:
The vocabulary.
- Return type:
Dict[str, int]
- convert_tokens_to_string(tokens)[source]#
Converts a sequence of tokens (strings for sub-words) in a single string.
- num_special_tokens_to_add(pair=False)[source]#
Returns the number of added tokens when encoding a sequence with special tokens.
- Parameters:
pair (bool, optional) – Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence. Defaults to
False
.- Returns:
Number of special tokens added to sequences.
- Return type:
int