tokenizer

Tokenization classes for LayoutXLM model.

class LayoutXLMTokenizer(vocab_file, bos_token='<s>', eos_token='</s>', sep_token='</s>', cls_token='<s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', **kwargs)[源代码]

基类:paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer

build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None)List[int][源代码]

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens.

This implementation does not add special tokens and this method should be overridden in a subclass.

参数
  • token_ids_0 (List[int]) -- The first tokenized sequence.

  • token_ids_1 (List[int], optional) -- The second tokenized sequence.

返回

The model input with special tokens.

返回类型

List[int]

get_special_tokens_mask(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False)List[int][源代码]

Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer encode methods.

参数
  • token_ids_0 (List[int]) -- List of ids of the first sequence.

  • token_ids_1 (List[int], optional) -- List of ids of the second sequence.

  • already_has_special_tokens (bool, optional) -- Whether or not the token list is already formatted with special tokens for the model. Defaults to None.

返回

The list of integers in the range [0, 1]:

1 for a special token, 0 for a sequence token.

返回类型

results (List[int])

create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None)List[int][源代码]

Create the token type IDs corresponding to the sequences passed. [What are token type IDs?](../glossary#token-type-ids)

Should be overridden in a subclass if the model has a special way of building those.

参数
  • token_ids_0 (List[int]) -- The first tokenized sequence.

  • token_ids_1 (List[int], optional) -- The second tokenized sequence.

返回

The token type ids.

返回类型

List[int]

property vocab_size

Size of the base vocabulary (without the added tokens).

Type

int

get_vocab()[源代码]

Returns the vocabulary as a dictionary of token to index.

tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab.

返回

The vocabulary.

返回类型

Dict[str, int]

convert_tokens_to_string(tokens)[源代码]

Converts a sequence of tokens (strings for sub-words) in a single string.

num_special_tokens_to_add(pair=False)[源代码]

Returns the number of added tokens when encoding a sequence with special tokens.

参数

pair (bool, optional) -- Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence. Defaults to False.

返回

Number of special tokens added to sequences.

返回类型

int