class OPTModel(vocab_size: int, hidden_size: int = 768, word_embed_proj_dim: int = 768, num_hidden_layers: int = 12, num_attention_heads: int = 12, intermediate_size: int = 3072, hidden_act: str = 'relu', hidden_dropout_prob: float = 0.1, attention_probs_dropout_prob: float = 0.1, max_position_embeddings: int = 512, type_vocab_size: int = 16, initializer_range: float = 0.02, pad_token_id: int = 0, eos_token_id: int = 7, bos_token_id: int = 0, eol_token_id: int = 3, normalize_before: bool = True, **kwargs)[source]

Bases: paddlenlp.transformers.opt.modeling.OPTPretrainedModel

The bare OPT Model transformer outputting raw hidden-states.

This model inherits from PretrainedModel. Refer to the superclass documentation for the generic methods.

This model is also a Paddle paddle.nn.Layer subclass. Use it as a regular Paddle Layer and refer to the Paddle documentation for all matter related to general usage and behavior.

  • vocab_size (int) – Vocabulary size of inputs_ids in OPTModel. Also is the vocab size of token embedding matrix. Defines the number of different tokens that can be represented by the inputs_ids passed when calling OPTModel.

  • hidden_size (int, optional) – Dimensionality of the embedding layer and decoder layer. Defaults to 768.

  • num_hidden_layers (int, optional) – Number of hidden layers in the Transformer decoder. Defaults to 12.

  • num_attention_heads (int, optional) – Number of attention heads for each attention layer in the Transformer decoder. Defaults to 12.

  • intermediate_size (int, optional) – Dimensionality of the feed-forward (ff) layer in the decoder. Input tensors to ff layers are firstly projected from hidden_size to intermediate_size, and then projected back to hidden_size. Typically intermediate_size is larger than hidden_size. Defaults to 3072.

  • hidden_act (str, optional) – The non-linear activation function in the feed-forward layer. "gelu", "relu" and any other paddle supported activation functions are supported. Defaults to "relu".

  • hidden_dropout_prob (float, optional) – The dropout probability for all fully connected layers in the embeddings and decoder. Defaults to 0.1.

  • attention_probs_dropout_prob (float, optional) – The dropout probability used in MultiHeadAttention in all decoder layers to drop some attention target. Defaults to 0.1.

  • max_position_embeddings (int, optional) – The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input sequence. Defaults to 512.

  • type_vocab_size (int, optional) –

    The vocabulary size of the token_type_ids. Defaults to 16.


    Please NOT using type_vocab_size, for it will be obsolete in the future..

  • initializer_range (float, optional) –

    The standard deviation of the normal initializer. Default to 0.02.


    A normal_initializer initializes weight matrices as normal distributions. See OPTPretrainedModel._init_weights() for how weights are initialized in OPTModel.

  • pad_token_id (int, optional) –

    The index of padding token in the token vocabulary.

    to 0.

forward(input_ids, position_ids=None, attention_mask=None, use_cache=False, cache=None)[source]

The OPTModel forward method, overrides the __call__() special method.

  • input_ids (Tensor) – Indices of input sequence tokens in the vocabulary. They are numerical representations of tokens that build the input sequence. Its data type should be int64 and it has a shape of [batch_size, sequence_length].

  • position_ids (Tensor, optional) – Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, max_position_embeddings - 1]. Shape as (batch_size, num_tokens) and dtype as int64. Defaults to None.

  • attention_mask (Tensor, optional) – Mask used in self attention to avoid performing attention to some unwanted positions, usually the subsequent positions. It is a tensor with shape broadcasted to [batch_size, num_attention_heads, sequence_length, sequence_length]. It is a tensor with shape broadcasted to [batch_size, num_attention_heads, sequence_length, sequence_length]. For example, its shape can be [batch_size, sequence_length], [batch_size, sequence_length, sequence_length], [batch_size, num_attention_heads, sequence_length, sequence_length]. Its data type should be float32. The masked tokens have -1e-9 values, and the unmasked tokens have 0 values. Defaults to None, which means nothing needed to be prevented attention to.

  • use_cache (bool, optional) – Whether or not to use cache. Defaults to False. If set to True, key value states will be returned and can be used to speed up decoding.

  • cache (list, optional) – It is a list, and each element in the list is a tuple (incremental_cache, static_cache). See TransformerDecoder.gen_cache for more details. It is only used for inference and should be None for training. Default to None.


Returns tensor encoder_output, which is the output at the last layer of the model. Its data type should be float32 and has a shape of [batch_size, sequence_length, hidden_size].

Return type



import paddle
from paddlenlp.transformers import OPTModel, GPTTokenizer

tokenizer = GPTTokenizer.from_pretrained('facebook/opt-125m')

model = OPTModel.from_pretrained('facebook/opt-125m')

inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLimage.pngP!", return_token_type_ids=False)
inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
output = model(**inputs)
class OPTPretrainedModel(*args, **kwargs)[source]

Bases: paddlenlp.transformers.model_utils.PretrainedModel

An abstract class for pretrained OPT models. It provides OPT related model_config_file, resource_files_names, pretrained_resource_files_map, pretrained_init_configuration, base_model_prefix for downloading and loading pretrained models. See PretrainedModel for more details.


Initialization hook


alias of paddlenlp.transformers.opt.modeling.OPTModel

class OPTForCausalLM(opt: paddlenlp.transformers.opt.modeling.OPTModel)[source]

Bases: paddlenlp.transformers.opt.modeling.OPTPretrainedModel

The OPT Model with a language modeling head on top.


opt (OPTModel) – An instance of OPTModel.

forward(input_ids, position_ids=None, attention_mask=None, use_cache=False, cache=None)[source]
  • input_ids (Tensor) – See OPTModel.

  • position_ids (Tensor, optional) – See OPTModel.

  • attention_mask (Tensor, optional) – See OPTModel.

  • use_cache (bool, optional) – See OPTModel.

  • cache (Tensor, optional) – See OPTModel.


Returns tensor logits or tuple (logits, cached_kvs). If use_cache is True, tuple (logits, cached_kvs) will be returned. Otherwise, tensor logits will be returned. logits is the output of the opt model. cache_kvs is the cache output of opt model if use_cache is True.

Return type

Tensor or tuple


import paddle
from paddlenlp.transformers import OPTForCausalLM, GPTTokenizer

tokenizer = GPTTokenizer.from_pretrained('facebook/opt-125m')
model = OPTForCausalLM.from_pretrained('facebook/opt-125m')

inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
output_ids, score = model.generate(input_ids=inputs['input_ids'])