tokenizer

class AutoTokenizer(*args, **kwargs)[source]

Bases: object

AutoClass can help you automatically retrieve the relevant model given the provided pretrained weights/vocabulary. AutoTokenizer is a generic tokenizer class that will be instantiated as one of the base tokenizer classes when created with the AutoTokenizer.from_pretrained() classmethod.

classmethod from_pretrained(pretrained_model_name_or_path, from_hf_hub=False, subfolder=None, *model_args, **kwargs)[source]

Creates an instance of AutoTokenizer. Related resources are loaded by specifying name of a built-in pretrained model, or a community-contributed pretrained model, or a local file directory path.

Parameters
  • pretrained_model_name_or_path (str) –

    Name of pretrained model or dir path to load from. The string can be:

    • Name of built-in pretrained model

    • Name of a community-contributed pretrained model.

    • Local directory path which contains tokenizer related resources and tokenizer config file (“tokenizer_config.json”).

  • from_hf_hub (bool, optional) –

  • subfolder (str, optional) – Only works when loading from HuggingFace Hub.

  • *args (tuple) – position arguments for model __init__. If provided, use these as position argument values for tokenizer initialization.

  • **kwargs (dict) – keyword arguments for model __init__. If provided, use these to update pre-defined keyword argument values for tokenizer initialization.

Returns

An instance of PretrainedTokenizer.

Return type

PretrainedTokenizer

Example

from paddlenlp.transformers import AutoTokenizer

# Name of built-in pretrained model
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
print(type(tokenizer))
# <class 'paddlenlp.transformers.bert.tokenizer.BertTokenizer'>

# Name of community-contributed pretrained model
tokenizer = AutoTokenizer.from_pretrained('yingyibiao/bert-base-uncased-sst-2-finetuned')
print(type(tokenizer))
# <class 'paddlenlp.transformers.bert.tokenizer.BertTokenizer'>

# Load from local directory path
tokenizer = AutoTokenizer.from_pretrained('./my_bert/')
print(type(tokenizer))
# <class 'paddlenlp.transformers.bert.tokenizer.BertTokenizer'>