modeling#

class GPTModel(config: GPTConfig)[source]#

Bases: GPTPretrainedModel

The bare GPT Model transformer outputting raw hidden-states.

This model inherits from PretrainedModel. Refer to the superclass documentation for the generic methods.

This model is also a Paddle paddle.nn.Layer subclass. Use it as a regular Paddle Layer and refer to the Paddle documentation for all matter related to general usage and behavior.

Parameters:
  • vocab_size (int) – Vocabulary size of inputs_ids in GPTModel. Also is the vocab size of token embedding matrix. Defines the number of different tokens that can be represented by the inputs_ids passed when calling GPTModel.

  • hidden_size (int, optional) – Dimensionality of the embedding layer and decoder layer. Defaults to 768.

  • num_hidden_layers (int, optional) – Number of hidden layers in the Transformer decoder. Defaults to 12.

  • num_attention_heads (int, optional) – Number of attention heads for each attention layer in the Transformer decoder. Defaults to 12.

  • intermediate_size (int, optional) – Dimensionality of the feed-forward (ff) layer in the decoder. Input tensors to ff layers are firstly projected from hidden_size to intermediate_size, and then projected back to hidden_size. Typically intermediate_size is larger than hidden_size. Defaults to 3072.

  • hidden_act (str, optional) – The non-linear activation function in the feed-forward layer. "gelu", "relu" and any other paddle supported activation functions are supported. Defaults to "gelu".

  • hidden_dropout_prob (float, optional) – The dropout probability for all fully connected layers in the embeddings and decoder. Defaults to 0.1.

  • attention_probs_dropout_prob (float, optional) – The dropout probability used in MultiHeadAttention in all decoder layers to drop some attention target. Defaults to 0.1.

  • max_position_embeddings (int, optional) – The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input sequence. Defaults to 512.

  • type_vocab_size (int, optional) –

    The vocabulary size of the token_type_ids. Defaults to 16.

    Note

    Please NOT using type_vocab_size, for it will be obsolete in the future..

  • initializer_range (float, optional) –

    The standard deviation of the normal initializer. Default to 0.02.

    Note

    A normal_initializer initializes weight matrices as normal distributions. See GPTPretrainedModel._init_weights() for how weights are initialized in GPTModel.

  • pad_token_id (int, optional) – The index of padding token in the token vocabulary. Defaults to 0.

get_input_embeddings()[source]#

get input embedding of model

Returns:

embedding of model

Return type:

nn.Embedding

set_input_embeddings(value)[source]#

set new input embedding for model

Parameters:

value (Embedding) – the new embedding of model

Raises:

NotImplementedError – Model has not implement set_input_embeddings method

forward(input_ids=None, position_ids=None, attention_mask=None, inputs_embeds=None, use_cache=False, past_key_values=None, output_attentions=False, output_hidden_states=False, return_dict=False)[source]#

The GPTModel forward method, overrides the __call__() special method.

Parameters:
  • input_ids (Tensor, optional) – Indices of input sequence tokens in the vocabulary. They are numerical representations of tokens that build the input sequence. Its data type should be int64 and it has a shape of [batch_size, sequence_length]. Defaults to None.

  • position_ids (Tensor, optional) – Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, max_position_embeddings - 1]. Shape as (batch_size, num_tokens) and dtype as int64. Defaults to None.

  • attention_mask (Tensor, optional) – Mask used in self attention to avoid performing attention to some unwanted positions, usually the subsequent positions. It is a tensor with shape broadcasted to [batch_size, num_attention_heads, sequence_length, sequence_length]. It is a tensor with shape broadcasted to [batch_size, num_attention_heads, sequence_length, sequence_length]. For example, its shape can be [batch_size, sequence_length], [batch_size, sequence_length, sequence_length], [batch_size, num_attention_heads, sequence_length, sequence_length]. Its data type should be int64. The masked tokens have 0 values, and the unmasked tokens have 1 values. Defaults to None, which means nothing needed to be prevented attention to.

  • inputs_embeds (Tensor, optional) – Optionally, instead of passing input_ids you can choose to directly pass an embedded representation of shape (batch_size, sequence_length, hidden_size). This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. Default to None.

  • use_cache (bool, optional) – Whether or not to use cache. Defaults to False. If set to True, key value states will be returned and can be used to speed up decoding.

  • past_key_values (list, optional) – It is only used for inference and should be None for training. Default to None.

  • output_attentions (bool, optional) – Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail. Defaults to False.

  • output_hidden_states (bool, optional) – Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail. Defaults to False.

  • return_dict (bool, optional) – Whether to return a BaseModelOutputWithPastAndCrossAttentions object. If False, the output will be a tuple of tensors. Defaults to False.

Returns:

An instance of BaseModelOutputWithPastAndCrossAttentions if return_dict=True. Otherwise it returns a tuple of tensors corresponding to ordered and not None (depending on the input arguments) fields of BaseModelOutputWithPastAndCrossAttentions.

Especially, When return_dict=output_hidden_states=output_attentions=False, returns tensor outputs which is the output at the last layer of the model. Its data type should be float32 and has a shape of [batch_size, sequence_length, hidden_size].

Example

import paddle
from paddlenlp.transformers import GPTModel, GPTTokenizer

tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en')
model = GPTModel.from_pretrained('gpt2-medium-en')

inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!", return_token_type_ids=False)
inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
output = model(**inputs)
class GPTPretrainedModel(*args, **kwargs)[source]#

Bases: PretrainedModel

An abstract class for pretrained GPT models. It provides GPT related model_config_file, resource_files_names, pretrained_resource_files_map, pretrained_init_configuration, base_model_prefix for downloading and loading pretrained models. See PretrainedModel for more details.

config_class#

alias of GPTConfig

base_model_class#

alias of GPTModel

class GPTPretrainingCriterion(config)[source]#

Bases: Layer

Criterion for GPT. It calculates the final loss.

forward(prediction_scores, masked_lm_labels, loss_mask=None)[source]#
Parameters:
  • prediction_scores (Tensor) – The logits of masked token prediction. Its data type should be float32 and its shape is [batch_size, sequence_length, vocab_size].

  • masked_lm_labels (Tensor) – The labels of the masked language modeling, the dimensionality of masked_lm_labels is equal to prediction_scores. Its data type should be int64 and its shape is [batch_size, sequence_length, 1].

  • loss_mask (Tensor) – Mask used for calculating the loss of the masked language modeling to avoid calculating some unwanted tokens. Its data type should be float32 and its shape is [batch_size, sequence_length, 1].

Returns:

The pretraining loss. Its data type should be float32 and its shape is [1].

Return type:

Tensor

class GPTForGreedyGeneration(config: GPTConfig, max_predict_len: int = 32)[source]#

Bases: GPTPretrainedModel

The generate model for GPT-2. It use the greedy strategy and generate the output sequence with highest probability.

Parameters:
  • gpt (GPTModel) – An instance of paddlenlp.transformers.GPTModel.

  • max_predict_len (int) – The max length of the prediction.

model(input_ids, position_ids=None, attention_mask=None, masked_positions=None, use_cache=False, past_key_values=None)[source]#
Parameters:
  • input_ids (Tensor, optional) – See GPTModel.

  • position_ids (Tensor, optional) – See GPTModel.

  • attention_mask (Tensor, optional) – See GPTModel.

  • use_cache (bool, optional) – See GPTModel.

  • cache (Tensor, optional) – See GPTModel.

Returns:

Returns tensor logits or tuple (logits, cached_kvs). If use_cache is True, tuple (logits, cached_kvs) will be returned. Otherwise, tensor logits will be returned. logits is the output of the gpt model. cache_kvs is the cache output of gpt model if use_cache is True.

Return type:

Tensor or tuple

forward(input_ids)[source]#
Parameters:

input_ids (Tensor) – See GPTModel.

Returns:

Returns tensor src_ids, which means the indices of output sequence tokens in the vocabulary. They are numerical representations of tokens that build the output sequence.

Return type:

Tensor

GPTLMHeadModel#

alias of GPTForCausalLM

class GPTForTokenClassification(config: GPTConfig)[source]#

Bases: GPTPretrainedModel

GPT Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

Parameters:
  • gpt (GPTModel) – An instance of GPTModel.

  • num_labels (int, optional) – The number of classes. Defaults to 2.

  • dropout (float, optional) – The dropout probability for output of GPT. If None, use the same value as hidden_dropout_prob of GPTModel instance gpt. Defaults to None.

forward(input_ids=None, position_ids=None, attention_mask=None, inputs_embeds=None, labels=None, output_attentions=False, output_hidden_states=False, return_dict=False)[source]#

The GPTForTokenClassification forward method, overrides the __call__() special method.

Parameters:
  • input_ids (Tensor, optional) – See GPTModel.

  • position_ids (Tensor, optional) – See GPTModel.

  • attention_mask (list, optional) – See GPTModel.

  • inputs_embeds (Tensor, optional) – See GPTModel.

  • labels (Tensor, optional) – Labels of shape (batch_size, sequence_length) for computing the sequence classification/regression loss. Indices should be in [0, ..., num_labels - 1]. If num_labels == 1 a regression loss is computed (Mean-Square loss), If num_labels > 1 a classification loss is computed (Cross-Entropy). Defaults to None.

  • output_attentions (bool, optional) – See GPTModel.

  • output_hidden_states (bool, optional) – See GPTModel.

  • return_dict (bool, optional) – See GPTModel.

Returns:

An instance of TokenClassifierOutput if return_dict=True. Otherwise it returns a tuple of tensors corresponding to ordered and not None (depending on the input arguments) fields of TokenClassifierOutput.

Especialy, when return_dict=output_attentions=output_hidden_states=False, returns tensor logits, a tensor of the input token classification logits. Shape as [batch_size, sequence_length, num_labels] and dtype as float32.

Example

import paddle
from paddlenlp.transformers import GPTForTokenClassification, GPTTokenizer

tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en')
model = GPTForTokenClassification.from_pretrained('gpt2-medium-en')

inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!", return_token_type_ids=False)
inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
logits = model(**inputs)
class GPTForSequenceClassification(config: GPTConfig)[source]#

Bases: GPTPretrainedModel

GPT Model with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks.

Parameters:
  • gpt (GPTModel) – An instance of GPTModel.

  • num_labels (int, optional) – The number of classes. Defaults to 2.

forward(input_ids=None, position_ids=None, attention_mask=None, inputs_embeds=None, labels=None, use_cache=False, output_attentions=False, output_hidden_states=False, return_dict=False)[source]#

The GPTForSequenceClassification forward method, overrides the __call__() special method.

Parameters:
  • input_ids (Tensor, optional) – See GPTModel.

  • position_ids (Tensor, optional) – See GPTModel.

  • attention_mask (list, optional) – See GPTModel.

  • inputs_embeds (Tensor, optional) – See GPTModel.

  • labels (Tensor, optional) – Labels of shape (batch_size, sequence_length) for computing the sequence classification/regression loss. Indices should be in [0, ..., num_labels - 1]. If num_labels == 1 a regression loss is computed (Mean-Square loss), If num_labels > 1 a classification loss is computed (Cross-Entropy). Defaults to None.

  • use_cache (bool, optional) – See :classL GPTModel.

  • output_attentions (bool, optional) – See GPTModel.

  • output_hidden_states (bool, optional) – See GPTModel.

  • return_dict (bool, optional) – See GPTModel.

Returns:

An instance of SequenceClassifierOutputWithPast if return_dict=True. Otherwise it returns a tuple of tensors corresponding to ordered and not None (depending on the input arguments) fields of SequenceClassifierOutputWithPast.

Especialy, when return_dict=output_attentions=output_hidden_states=False, returns tensor logits, a tensor of the input text classification logits. Shape as [batch_size, num_labels] and dtype as float32.

Example

import paddle
from paddlenlp.transformers import GPTForSequenceClassification, GPTTokenizer

tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en')
model = GPTForSequenceClassification.from_pretrained('gpt2-medium-en')

inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!", return_token_type_ids=False)
inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
logits = model(**inputs)
class GPTForCausalLM(config: GPTConfig)[source]#

Bases: GPTPretrainedModel

The GPT Model with a language modeling head on top.

Parameters:

gpt (GPTModel) – An instance of GPTModel.

get_output_embeddings()[source]#

To be overwrited for models with output embeddings

Returns:

the otuput embedding of model

Return type:

Optional[Embedding]

forward(input_ids=None, position_ids=None, attention_mask=None, inputs_embeds=None, use_cache=False, past_key_values=None, labels=None, output_attentions=False, output_hidden_states=False, return_dict=False)[source]#
Parameters:
  • input_ids (Tensor, optional) – See GPTModel.

  • position_ids (Tensor, optional) – See GPTModel.

  • attention_mask (Tensor, optional) – See GPTModel.

  • inputs_embeds (Tensor, optional) – See GPTModel.

  • use_cache (bool, optional) – See GPTModel.

  • past_key_values (Tensor, optional) – See GPTModel.

  • labels (paddle.Tensor, optional) – A Tensor of shape (batch_size, sequence_length). Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can set labels = input_ids Indices are selected in [-100, 0, ..., vocab_size] All labels set to -100 are ignored (masked), the loss is only computed for labels in [0, ..., vocab_size] Defaults to None.

  • output_attentions (bool, optional) – See GPTModel.

  • output_hidden_states (bool, optional) – See GPTModel.

  • return_dict (bool, optional) – See GPTModel.

Returns:

An instance of BaseModelOutputWithPastAndCrossAttentions if return_dict=True. Otherwise it returns a tuple of tensors corresponding to ordered and not None (depending on the input arguments) fields of BaseModelOutputWithPastAndCrossAttentions.

Especialy, when return_dict=use_cache=output_attentions=output_hidden_states=False, returns a tensor logits which is the output of the gpt model.

class GPTEmbeddings(config)[source]#

Bases: Layer

Include embeddings from word and position embeddings.

forward(input_ids, position_ids=None, inputs_embeddings=None)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters:
  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

class GPTDecoderLayer(config: GPTConfig)[source]#

Bases: Layer

The transformer decoder layer.

It contains multiheadattention and some linear layers.

forward(hidden_states, attention_mask=None, use_cache=False, past_key_value=None, output_attentions=False)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters:
  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

class GPTLayerNorm(config, normalized_shape, epsilon=1e-05, weight_attr=None, bias_attr=None, name=None)[source]#

Bases: LayerNorm

forward(input)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters:
  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments