tokenizer#

Tokenization classes for LayoutXLM model.

class LayoutXLMTokenizer(vocab_file, bos_token='<s>', eos_token='</s>', sep_token='</s>', cls_token='<s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', **kwargs)[source]#

Bases: PretrainedTokenizer

build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: List[int] | None = None) List[int][source]#

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens.

This implementation does not add special tokens and this method should be overridden in a subclass.

Parameters:
  • token_ids_0 (List[int]) – The first tokenized sequence.

  • token_ids_1 (List[int], optional) – The second tokenized sequence.

Returns:

The model input with special tokens.

Return type:

List[int]

get_special_tokens_mask(token_ids_0: List[int], token_ids_1: List[int] | None = None, already_has_special_tokens: bool = False) List[int][source]#

Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer encode methods.

Parameters:
  • token_ids_0 (List[int]) – List of ids of the first sequence.

  • token_ids_1 (List[int], optional) – List of ids of the second sequence.

  • already_has_special_tokens (bool, optional) – Whether or not the token list is already formatted with special tokens for the model. Defaults to None.

Returns:

The list of integers in the range [0, 1]:

1 for a special token, 0 for a sequence token.

Return type:

results (List[int])

create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: List[int] | None = None) List[int][source]#

Create the token type IDs corresponding to the sequences passed. [What are token type IDs?](../glossary#token-type-ids)

Should be overridden in a subclass if the model has a special way of building those.

Parameters:
  • token_ids_0 (List[int]) – The first tokenized sequence.

  • token_ids_1 (List[int], optional) – The second tokenized sequence.

Returns:

The token type ids.

Return type:

List[int]

property vocab_size#

Size of the base vocabulary (without the added tokens).

Type:

int

get_vocab()[source]#

Returns the vocabulary as a dictionary of token to index.

tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab.

Returns:

The vocabulary.

Return type:

Dict[str, int]

convert_tokens_to_string(tokens)[source]#

Converts a sequence of tokens (strings for sub-words) in a single string.

num_special_tokens_to_add(pair=False)[source]#

Returns the number of added tokens when encoding a sequence with special tokens.

Parameters:

pair (bool, optional) – Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence. Defaults to False.

Returns:

Number of special tokens added to sequences.

Return type:

int