modeling#
- class GPTModel(config: GPTConfig)[source]#
Bases:
GPTPretrainedModel
The bare GPT Model transformer outputting raw hidden-states.
This model inherits from
PretrainedModel
. Refer to the superclass documentation for the generic methods.This model is also a Paddle paddle.nn.Layer subclass. Use it as a regular Paddle Layer and refer to the Paddle documentation for all matter related to general usage and behavior.
- Parameters:
vocab_size (int) – Vocabulary size of
inputs_ids
inGPTModel
. Also is the vocab size of token embedding matrix. Defines the number of different tokens that can be represented by theinputs_ids
passed when callingGPTModel
.hidden_size (int, optional) – Dimensionality of the embedding layer and decoder layer. Defaults to
768
.num_hidden_layers (int, optional) – Number of hidden layers in the Transformer decoder. Defaults to
12
.num_attention_heads (int, optional) – Number of attention heads for each attention layer in the Transformer decoder. Defaults to
12
.intermediate_size (int, optional) – Dimensionality of the feed-forward (ff) layer in the decoder. Input tensors to ff layers are firstly projected from
hidden_size
tointermediate_size
, and then projected back tohidden_size
. Typicallyintermediate_size
is larger thanhidden_size
. Defaults to3072
.hidden_act (str, optional) – The non-linear activation function in the feed-forward layer.
"gelu"
,"relu"
and any other paddle supported activation functions are supported. Defaults to"gelu"
.hidden_dropout_prob (float, optional) – The dropout probability for all fully connected layers in the embeddings and decoder. Defaults to
0.1
.attention_probs_dropout_prob (float, optional) – The dropout probability used in MultiHeadAttention in all decoder layers to drop some attention target. Defaults to
0.1
.max_position_embeddings (int, optional) – The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input sequence. Defaults to
512
.type_vocab_size (int, optional) –
The vocabulary size of the
token_type_ids
. Defaults to16
.Note
Please NOT using
type_vocab_size
, for it will be obsolete in the future..initializer_range (float, optional) –
The standard deviation of the normal initializer. Default to
0.02
.Note
A normal_initializer initializes weight matrices as normal distributions. See
GPTPretrainedModel._init_weights()
for how weights are initialized inGPTModel
.pad_token_id (int, optional) – The index of padding token in the token vocabulary. Defaults to
0
.
- get_input_embeddings()[source]#
get input embedding of model
- Returns:
embedding of model
- Return type:
nn.Embedding
- set_input_embeddings(value)[source]#
set new input embedding for model
- Parameters:
value (Embedding) – the new embedding of model
- Raises:
NotImplementedError – Model has not implement
set_input_embeddings
method
- forward(input_ids=None, position_ids=None, attention_mask=None, inputs_embeds=None, use_cache=False, past_key_values=None, output_attentions=False, output_hidden_states=False, return_dict=False)[source]#
The GPTModel forward method, overrides the
__call__()
special method.- Parameters:
input_ids (Tensor, optional) – Indices of input sequence tokens in the vocabulary. They are numerical representations of tokens that build the input sequence. Its data type should be
int64
and it has a shape of [batch_size, sequence_length]. Defaults to None.position_ids (Tensor, optional) – Indices of positions of each input sequence tokens in the position embeddings. Selected in the range
[0, max_position_embeddings - 1]
. Shape as(batch_size, num_tokens)
and dtype as int64. Defaults toNone
.attention_mask (Tensor, optional) – Mask used in self attention to avoid performing attention to some unwanted positions, usually the subsequent positions. It is a tensor with shape broadcasted to
[batch_size, num_attention_heads, sequence_length, sequence_length]
. It is a tensor with shape broadcasted to[batch_size, num_attention_heads, sequence_length, sequence_length]
. For example, its shape can be [batch_size, sequence_length], [batch_size, sequence_length, sequence_length], [batch_size, num_attention_heads, sequence_length, sequence_length]. Its data type should be int64. Themasked
tokens have0
values, and theunmasked
tokens have1
values. Defaults toNone
, which means nothing needed to be prevented attention to.inputs_embeds (Tensor, optional) – Optionally, instead of passing
input_ids
you can choose to directly pass an embedded representation of shape(batch_size, sequence_length, hidden_size)
. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. Default to None.use_cache (bool, optional) – Whether or not to use cache. Defaults to
False
. If set toTrue
, key value states will be returned and can be used to speed up decoding.past_key_values (list, optional) – It is only used for inference and should be None for training. Default to
None
.output_attentions (bool, optional) – Whether or not to return the attentions tensors of all attention layers. See
attentions
under returned tensors for more detail. Defaults toFalse
.output_hidden_states (bool, optional) – Whether or not to return the hidden states of all layers. See
hidden_states
under returned tensors for more detail. Defaults toFalse
.return_dict (bool, optional) – Whether to return a
BaseModelOutputWithPastAndCrossAttentions
object. IfFalse
, the output will be a tuple of tensors. Defaults toFalse
.
- Returns:
An instance of
BaseModelOutputWithPastAndCrossAttentions
ifreturn_dict=True
. Otherwise it returns a tuple of tensors corresponding to ordered and not None (depending on the input arguments) fields ofBaseModelOutputWithPastAndCrossAttentions
.Especially, When
return_dict=output_hidden_states=output_attentions=False
, returns tensoroutputs
which is the output at the last layer of the model. Its data type should be float32 and has a shape of [batch_size, sequence_length, hidden_size].
Example
import paddle from paddlenlp.transformers import GPTModel, GPTTokenizer tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en') model = GPTModel.from_pretrained('gpt2-medium-en') inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!", return_token_type_ids=False) inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} output = model(**inputs)
- class GPTPretrainedModel(*args, **kwargs)[source]#
Bases:
PretrainedModel
An abstract class for pretrained GPT models. It provides GPT related
model_config_file
,resource_files_names
,pretrained_resource_files_map
,pretrained_init_configuration
,base_model_prefix
for downloading and loading pretrained models. SeePretrainedModel
for more details.- config_class#
alias of
GPTConfig
- class GPTPretrainingCriterion(config)[source]#
Bases:
Layer
Criterion for GPT. It calculates the final loss.
- forward(prediction_scores, masked_lm_labels, loss_mask=None)[source]#
- Parameters:
prediction_scores (Tensor) – The logits of masked token prediction. Its data type should be float32 and its shape is [batch_size, sequence_length, vocab_size].
masked_lm_labels (Tensor) – The labels of the masked language modeling, the dimensionality of
masked_lm_labels
is equal toprediction_scores
. Its data type should be int64 and its shape is [batch_size, sequence_length, 1].loss_mask (Tensor) – Mask used for calculating the loss of the masked language modeling to avoid calculating some unwanted tokens. Its data type should be float32 and its shape is [batch_size, sequence_length, 1].
- Returns:
The pretraining loss. Its data type should be float32 and its shape is [1].
- Return type:
Tensor
- class GPTForGreedyGeneration(config: GPTConfig, max_predict_len: int = 32)[source]#
Bases:
GPTPretrainedModel
The generate model for GPT-2. It use the greedy strategy and generate the output sequence with highest probability.
- Parameters:
gpt (
GPTModel
) – An instance ofpaddlenlp.transformers.GPTModel
.max_predict_len (int) – The max length of the prediction.
- model(input_ids, position_ids=None, attention_mask=None, masked_positions=None, use_cache=False, past_key_values=None)[source]#
- Parameters:
- Returns:
Returns tensor
logits
or tuple(logits, cached_kvs)
. Ifuse_cache
is True, tuple (logits, cached_kvs
) will be returned. Otherwise, tensorlogits
will be returned.logits
is the output of the gpt model.cache_kvs
is the cache output of gpt model ifuse_cache
is True.- Return type:
Tensor or tuple
- GPTLMHeadModel#
alias of
GPTForCausalLM
- class GPTForTokenClassification(config: GPTConfig)[source]#
Bases:
GPTPretrainedModel
GPT Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.
- Parameters:
- forward(input_ids=None, position_ids=None, attention_mask=None, inputs_embeds=None, labels=None, output_attentions=False, output_hidden_states=False, return_dict=False)[source]#
The GPTForTokenClassification forward method, overrides the __call__() special method.
- Parameters:
input_ids (Tensor, optional) – See
GPTModel
.position_ids (Tensor, optional) – See
GPTModel
.attention_mask (list, optional) – See
GPTModel
.inputs_embeds (Tensor, optional) – See
GPTModel
.labels (Tensor, optional) – Labels of shape
(batch_size, sequence_length)
for computing the sequence classification/regression loss. Indices should be in[0, ..., num_labels - 1]
. Ifnum_labels == 1
a regression loss is computed (Mean-Square loss), Ifnum_labels > 1
a classification loss is computed (Cross-Entropy). Defaults to None.output_attentions (bool, optional) – See
GPTModel
.output_hidden_states (bool, optional) – See
GPTModel
.return_dict (bool, optional) – See
GPTModel
.
- Returns:
An instance of
TokenClassifierOutput
ifreturn_dict=True
. Otherwise it returns a tuple of tensors corresponding to ordered and not None (depending on the input arguments) fields ofTokenClassifierOutput
.Especialy, when
return_dict=output_attentions=output_hidden_states=False
, returns tensorlogits
, a tensor of the input token classification logits. Shape as[batch_size, sequence_length, num_labels]
and dtype asfloat32
.
Example
import paddle from paddlenlp.transformers import GPTForTokenClassification, GPTTokenizer tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en') model = GPTForTokenClassification.from_pretrained('gpt2-medium-en') inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!", return_token_type_ids=False) inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} logits = model(**inputs)
- class GPTForSequenceClassification(config: GPTConfig)[source]#
Bases:
GPTPretrainedModel
GPT Model with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks.
- Parameters:
gpt (
GPTModel
) – An instance of GPTModel.num_labels (int, optional) – The number of classes. Defaults to
2
.
- forward(input_ids=None, position_ids=None, attention_mask=None, inputs_embeds=None, labels=None, use_cache=False, output_attentions=False, output_hidden_states=False, return_dict=False)[source]#
The GPTForSequenceClassification forward method, overrides the __call__() special method.
- Parameters:
input_ids (Tensor, optional) – See
GPTModel
.position_ids (Tensor, optional) – See
GPTModel
.attention_mask (list, optional) – See
GPTModel
.inputs_embeds (Tensor, optional) – See
GPTModel
.labels (Tensor, optional) – Labels of shape
(batch_size, sequence_length)
for computing the sequence classification/regression loss. Indices should be in[0, ..., num_labels - 1]
. Ifnum_labels == 1
a regression loss is computed (Mean-Square loss), Ifnum_labels > 1
a classification loss is computed (Cross-Entropy). Defaults to None.use_cache (bool, optional) – See :classL
GPTModel
.output_attentions (bool, optional) – See
GPTModel
.output_hidden_states (bool, optional) – See
GPTModel
.return_dict (bool, optional) – See
GPTModel
.
- Returns:
An instance of
SequenceClassifierOutputWithPast
ifreturn_dict=True
. Otherwise it returns a tuple of tensors corresponding to ordered and not None (depending on the input arguments) fields ofSequenceClassifierOutputWithPast
.Especialy, when
return_dict=output_attentions=output_hidden_states=False
, returns tensorlogits
, a tensor of the input text classification logits. Shape as[batch_size, num_labels]
and dtype as float32.
Example
import paddle from paddlenlp.transformers import GPTForSequenceClassification, GPTTokenizer tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en') model = GPTForSequenceClassification.from_pretrained('gpt2-medium-en') inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!", return_token_type_ids=False) inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} logits = model(**inputs)
- class GPTForCausalLM(config: GPTConfig)[source]#
Bases:
GPTPretrainedModel
The GPT Model with a
language modeling
head on top.- get_output_embeddings()[source]#
To be overwrited for models with output embeddings
- Returns:
the otuput embedding of model
- Return type:
Optional[Embedding]
- forward(input_ids=None, position_ids=None, attention_mask=None, inputs_embeds=None, use_cache=False, past_key_values=None, labels=None, output_attentions=False, output_hidden_states=False, return_dict=False)[source]#
- Parameters:
input_ids (Tensor, optional) – See
GPTModel
.position_ids (Tensor, optional) – See
GPTModel
.attention_mask (Tensor, optional) – See
GPTModel
.inputs_embeds (Tensor, optional) – See
GPTModel
.use_cache (bool, optional) – See
GPTModel
.past_key_values (Tensor, optional) – See
GPTModel
.labels (paddle.Tensor, optional) – A Tensor of shape
(batch_size, sequence_length)
. Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can setlabels = input_ids
Indices are selected in[-100, 0, ..., vocab_size]
All labels set to-100
are ignored (masked), the loss is only computed for labels in[0, ..., vocab_size]
Defaults to None.output_attentions (bool, optional) – See
GPTModel
.output_hidden_states (bool, optional) – See
GPTModel
.return_dict (bool, optional) – See
GPTModel
.
- Returns:
An instance of
BaseModelOutputWithPastAndCrossAttentions
ifreturn_dict=True
. Otherwise it returns a tuple of tensors corresponding to ordered and not None (depending on the input arguments) fields ofBaseModelOutputWithPastAndCrossAttentions
.Especialy, when
return_dict=use_cache=output_attentions=output_hidden_states=False
, returns a tensorlogits
which is the output of the gpt model.
- class GPTEmbeddings(config)[source]#
Bases:
Layer
Include embeddings from word and position embeddings.
- class GPTDecoderLayer(config: GPTConfig)[source]#
Bases:
Layer
The transformer decoder layer.
It contains multiheadattention and some linear layers.
- forward(hidden_states, attention_mask=None, use_cache=False, past_key_value=None, output_attentions=False)[source]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters:
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments