tokenizer¶
Tokenization classes for LayoutXLM model.
-
class
LayoutXLMTokenizer
(vocab_file, bos_token='<s>', eos_token='</s>', sep_token='</s>', cls_token='<s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', **kwargs)[source]¶ Bases:
paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer
-
build_inputs_with_special_tokens
(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) → List[int][source]¶ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens.
This implementation does not add special tokens and this method should be overridden in a subclass.
- Parameters
token_ids_0 (
List[int]
) – The first tokenized sequence.token_ids_1 (
List[int]
, optional) – The second tokenized sequence.
- Returns
The model input with special tokens.
- Return type
List[int]
-
get_special_tokens_mask
(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False) → List[int][source]¶ Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer
encode
methods.- Parameters
token_ids_0 (List[int]) – List of ids of the first sequence.
token_ids_1 (List[int], optional) – List of ids of the second sequence.
already_has_special_tokens (bool, optional) – Whether or not the token list is already formatted with special tokens for the model. Defaults to None.
- Returns
- The list of integers in the range [0, 1]:
1 for a special token, 0 for a sequence token.
- Return type
results (List[int])
-
create_token_type_ids_from_sequences
(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) → List[int][source]¶ Create the token type IDs corresponding to the sequences passed. [What are token type IDs?](../glossary#token-type-ids)
Should be overridden in a subclass if the model has a special way of building those.
- Parameters
token_ids_0 (
List[int]
) – The first tokenized sequence.token_ids_1 (
List[int]
, optional) – The second tokenized sequence.
- Returns
The token type ids.
- Return type
List[int]
-
property
vocab_size
¶ Size of the base vocabulary (without the added tokens).
- Type
int
-
get_vocab
()[source]¶ Returns the vocabulary as a dictionary of token to index.
tokenizer.get_vocab()[token]
is equivalent totokenizer.convert_tokens_to_ids(token)
whentoken
is in the vocab.- Returns
The vocabulary.
- Return type
Dict[str, int]
-
convert_tokens_to_string
(tokens)[source]¶ Converts a sequence of tokens (strings for sub-words) in a single string.
-
num_special_tokens_to_add
(pair=False)[source]¶ Returns the number of added tokens when encoding a sequence with special tokens.
- Parameters
pair (bool, optional) – Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence. Defaults to
False
.- Returns
Number of special tokens added to sequences.
- Return type
int
-