tokenizer#

class RemBertTokenizer(vocab_file, do_lower_case=False, remove_space=True, keep_accents=True, cls_token='[CLS]', unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', mask_token='[MASK]', **kwargs)[source]#

Bases: PretrainedTokenizer

Construct a RemBertTokenizer. For more information regarding those methods, please refer to this superclass.

Parameters:
  • vocab_file (str) – The vocabulary file path (ends with ‘.txt’) required to instantiate a WordpieceTokenizer.

  • do_lower_case (bool, optional) – Whether or not to lowercase the input when tokenizing. Defaults to False.

  • unk_token (str, optional) – A special token representing the unknown (out-of-vocabulary) token. An unknown token is set to be unk_token inorder to be converted to an ID. Defaults to “[UNK]”.

  • sep_token (str, optional) – A special token separating two different sentences in the same input. Defaults to “[SEP]”.

  • pad_token (str, optional) – A special token used to make arrays of tokens the same size for batching purposes. Defaults to “[PAD]”.

  • cls_token (str, optional) – A special token used for sequence classification. It is the last token of the sequence when built with special tokens. Defaults to “[CLS]”.

  • mask_token (str, optional) – A special token representing a masked token. This is the token used in the masked language modeling task which the model tries to predict the original unmasked ones. Defaults to “[MASK]”.

Examples

from paddlenlp.transformers import RemBertTokenizer
tokenizer = RemBertTokenizer.from_pretrained('rembert')

inputs = tokenizer('欢迎使用飞桨!')
print(inputs)

'''
{'input_ids': [312, 573, 36203, 3916, 9744, 242391, 646, 313],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0]}
'''
property vocab_size#

Size of the base vocabulary (without the added tokens).

Type:

int

get_vocab()[source]#

Returns the vocabulary as a dictionary of token to index.

tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab.

Returns:

The vocabulary.

Return type:

Dict[str, int]

convert_tokens_to_string(tokens)[source]#

Converts a sequence of tokens (list of string) to a single string by using ' '.join(tokens) .

Parameters:

tokens (list[str]) – A sequence of tokens.

Returns:

Converted string.

Return type:

str

build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: List[int] | None = None) List[int][source]#

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A REMBERT sequence has the following format:

  • single sequence: [CLS] X [SEP]

  • pair of sequences: [CLS] A [SEP] B [SEP]

Parameters:
  • token_ids_0 (List[int]) – List of IDs to which the special tokens will be added.

  • token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs.

Returns:

List of input IDs with the appropriate special tokens.

Return type:

List[int]

get_special_tokens_mask(token_ids_0: List[int], token_ids_1: List[int] | None = None, already_has_special_tokens: bool = False) List[int][source]#

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.

Parameters:
  • token_ids_0 (List[int]) – List of IDs.

  • token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs.

  • already_has_special_tokens (bool, optional, defaults to False) – Whether or not the token list is already formatted with special tokens for the model.

Returns:

A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

Return type:

List[int]

create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: List[int] | None = None) List[int][source]#

Create a mask from the two sequences passed to be used in a sequence-pair classification task. A RemBERT sequence pair mask has the following format:

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence    | second sequence |

If token_ids_1 is None, this method only returns the first portion of the mask (0s).

Parameters:
  • token_ids_0 (List[int]) – List of IDs.

  • token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs.

Returns:

List of token type IDs according to the given sequence(s).

Return type:

List[int]

save_vocabulary(save_directory: str, filename_prefix: str | None = None)[source]#

Save all tokens to a vocabulary file. The file contains a token per line, and the line number would be the index of corresponding token.

Parameters:
  • filepath (str) – File path to be saved to.

  • vocab (Vocab|dict) – The Vocab or dict instance to be saved.