model_outputs¶

tuple_output
(outputs: Tuple[paddle.Tensor], loss: Optional[paddle.Tensor] = None)[source]¶ reconstruct the outputs with one method which contains the simple logic
 Parameters
outputs (Tuple[Tensor]) – the source of the outputs
loss (Optional[Tensor], optional) – the loss of the model. Defaults to None.

class
ModelOutput
[source]¶ Base class for all model outputs as dataclass. Has a
__getitem__
that allows indexing by integer or slice (like a tuple) or strings (like a dictionary) that will ignore theNone
attributes. Otherwise behaves like a regular python dictionary.<Tip warning={true}>
You can’t unpack a
ModelOutput
directly. Use the [to_tuple
] method to convert it to a tuple before.</Tip>

setdefault
(*args, **kwargs)[source]¶ Insert key with a value of default if key is not in the dictionary.
Return the value for key if key is in the dictionary, else default.

pop
(k[, d]) → v, remove specified key and return the corresponding[source]¶ value. If key is not found, d is returned if given, otherwise KeyError is raised.


class
BaseModelOutput
(last_hidden_state: Optional[paddle.Tensor] = None, hidden_states: Optional[Tuple[paddle.Tensor]] = None, attentions: Optional[Tuple[paddle.Tensor]] = None)[source]¶ Base class for model’s outputs, with potential hidden states and attentions.
 Parameters
last_hidden_state (
paddle.Tensor
of shape(batch_size, sequence_length, hidden_size)
) – Sequence of hiddenstates at the output of the last layer of the model.hidden_states (
tuple(paddle.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) –Tuple of
paddle.Tensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(paddle.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) –Tuple of
paddle.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.

class
BaseModelOutputWithNoAttention
(last_hidden_state: Optional[paddle.Tensor] = None, hidden_states: Optional[Tuple[paddle.Tensor]] = None)[source]¶ Base class for model’s outputs, with potential hidden states.
 Parameters
last_hidden_state (
paddle.Tensor
of shape(batch_size, num_channels, height, width)
) – Sequence of hiddenstates at the output of the last layer of the model.hidden_states (
tuple(paddle.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) –Tuple of
paddle.Tensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, num_channels, height, width)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.

class
BaseModelOutputWithPooling
(last_hidden_state: Optional[paddle.Tensor] = None, pooler_output: Optional[paddle.Tensor] = None, hidden_states: Optional[Tuple[paddle.Tensor]] = None, attentions: Optional[Tuple[paddle.Tensor]] = None)[source]¶ Base class for model’s outputs that also contains a pooling of the last hidden states.
 Parameters
last_hidden_state (
paddle.Tensor
of shape(batch_size, sequence_length, hidden_size)
) – Sequence of hiddenstates at the output of the last layer of the model.pooler_output (
paddle.Tensor
of shape(batch_size, hidden_size)
) – Last layer hiddenstate of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERTfamily of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.hidden_states (
tuple(paddle.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) –Tuple of
paddle.Tensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(paddle.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) –Tuple of
paddle.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.

class
BaseModelOutputWithPastAndCrossAttentions
(last_hidden_state: Optional[paddle.Tensor] = None, past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None, hidden_states: Optional[Tuple[paddle.Tensor]] = None, attentions: Optional[Tuple[paddle.Tensor]] = None, cross_attentions: Optional[Tuple[paddle.Tensor]] = None)[source]¶ Base class for model’s outputs that may also contain a past key/values (to speed up sequential decoding).
 Parameters
last_hidden_state (
paddle.Tensor
of shape(batch_size, sequence_length, hidden_size)
) –Sequence of hiddenstates at the output of the last layer of the model.
If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output.past_key_values (
tuple(tuple(paddle.Tensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) –Tuple of
tuple(paddle.Tensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and optionally ifconfig.is_encoder_decoder=True
2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and optionally if
config.is_encoder_decoder=True
in the crossattention blocks) that can be used (seepast_key_values
input) to speed up sequential decoding.hidden_states (
tuple(paddle.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) –Tuple of
paddle.Tensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(paddle.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) –Tuple of
paddle.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(paddle.Tensor)
, optional, returned whenoutput_attentions=True
andconfig.add_cross_attention=True
is passed or whenconfig.output_attentions=True
) –Tuple of
paddle.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.

class
BaseModelOutputWithPoolingAndCrossAttentions
(last_hidden_state: Optional[paddle.Tensor] = None, pooler_output: Optional[paddle.Tensor] = None, past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None, hidden_states: Optional[Tuple[paddle.Tensor]] = None, attentions: Optional[Tuple[paddle.Tensor]] = None, cross_attentions: Optional[Tuple[paddle.Tensor]] = None)[source]¶ Base class for model’s outputs that also contains a pooling of the last hidden states.
 Parameters
last_hidden_state (
paddle.Tensor
of shape(batch_size, sequence_length, hidden_size)
) – Sequence of hiddenstates at the output of the last layer of the model.pooler_output (
paddle.Tensor
of shape(batch_size, hidden_size)
) – Last layer hiddenstate of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERTfamily of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.hidden_states (
tuple(paddle.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) –Tuple of
paddle.Tensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(paddle.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) –Tuple of
paddle.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(paddle.Tensor)
, optional, returned whenoutput_attentions=True
andconfig.add_cross_attention=True
is passed or whenconfig.output_attentions=True
) –Tuple of
paddle.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
past_key_values (
tuple(tuple(paddle.Tensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) –Tuple of
tuple(paddle.Tensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and optionally ifconfig.is_encoder_decoder=True
2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and optionally if
config.is_encoder_decoder=True
in the crossattention blocks) that can be used (seepast_key_values
input) to speed up sequential decoding.

class
SequenceClassifierOutput
(loss: Optional[paddle.Tensor] = None, logits: Optional[paddle.Tensor] = None, hidden_states: Optional[Tuple[paddle.Tensor]] = None, attentions: Optional[Tuple[paddle.Tensor]] = None)[source]¶ Base class for outputs of sentence classification models.
 Parameters
loss (
paddle.Tensor
of shape(1,)
, optional, returned whenlabels
is provided) – Classification (or regression if config.num_labels==1) loss.logits (
paddle.Tensor
of shape(batch_size, config.num_labels)
) – Classification (or regression if config.num_labels==1) scores (before SoftMax).hidden_states (
tuple(paddle.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) –Tuple of
paddle.Tensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(paddle.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) –Tuple of
paddle.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.

class
TokenClassifierOutput
(loss: Optional[paddle.Tensor] = None, logits: Optional[paddle.Tensor] = None, hidden_states: Optional[Tuple[paddle.Tensor]] = None, attentions: Optional[Tuple[paddle.Tensor]] = None)[source]¶ Base class for outputs of token classification models.
 Parameters
loss (
paddle.Tensor
of shape(1,)
, optional, returned whenlabels
is provided) – Classification loss.logits (
paddle.Tensor
of shape(batch_size, sequence_length, config.num_labels)
) – Classification scores (before SoftMax).hidden_states (
tuple(paddle.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) –Tuple of
paddle.Tensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(paddle.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) –Tuple of
paddle.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.

class
QuestionAnsweringModelOutput
(loss: Optional[paddle.Tensor] = None, start_logits: Optional[paddle.Tensor] = None, end_logits: Optional[paddle.Tensor] = None, hidden_states: Optional[Tuple[paddle.Tensor]] = None, attentions: Optional[Tuple[paddle.Tensor]] = None)[source]¶ Base class for outputs of question answering models.
 Parameters
loss (
paddle.Tensor
of shape(1,)
, optional, returned whenlabels
is provided) – Total span extraction loss is the sum of a CrossEntropy for the start and end positions.start_logits (
paddle.Tensor
of shape(batch_size, sequence_length)
) – Spanstart scores (before SoftMax).end_logits (
paddle.Tensor
of shape(batch_size, sequence_length)
) – Spanend scores (before SoftMax).hidden_states (
tuple(paddle.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) –Tuple of
paddle.Tensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(paddle.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) –Tuple of
paddle.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.

class
MultipleChoiceModelOutput
(loss: Optional[paddle.Tensor] = None, logits: Optional[paddle.Tensor] = None, hidden_states: Optional[Tuple[paddle.Tensor]] = None, attentions: Optional[Tuple[paddle.Tensor]] = None)[source]¶ Base class for outputs of multiple choice models.
 Parameters
loss (
paddle.Tensor
of shape (1,), optional, returned whenlabels
is provided) – Classification loss.logits (
paddle.Tensor
of shape(batch_size, num_choices)
) –num_choices is the second dimension of the input tensors. (see input_ids above).
Classification scores (before SoftMax).
hidden_states (
tuple(paddle.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) –Tuple of
paddle.Tensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(paddle.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) –Tuple of
paddle.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.

class
MaskedLMOutput
(loss: Optional[paddle.Tensor] = None, logits: Optional[paddle.Tensor] = None, hidden_states: Optional[Tuple[paddle.Tensor]] = None, attentions: Optional[Tuple[paddle.Tensor]] = None)[source]¶ Base class for masked language models outputs.
 Parameters
loss (
paddle.Tensor
of shape(1,)
, optional, returned whenlabels
is provided) – Masked language modeling (MLM) loss.logits (
paddle.Tensor
of shape(batch_size, sequence_length, config.vocab_size)
) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).hidden_states (
tuple(paddle.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) –Tuple of
paddle.Tensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(paddle.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) –Tuple of
paddle.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.

class
CausalLMOutputWithCrossAttentions
(loss: Optional[paddle.Tensor] = None, logits: Optional[paddle.Tensor] = None, past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None, hidden_states: Optional[Tuple[paddle.Tensor]] = None, attentions: Optional[Tuple[paddle.Tensor]] = None, cross_attentions: Optional[Tuple[paddle.Tensor]] = None)[source]¶ Base class for causal language model (or autoregressive) outputs.
 Parameters
loss (
paddle.Tensor
of shape(1,)
, optional, returned whenlabels
is provided) – Language modeling loss (for nexttoken prediction).logits (
paddle.Tensor
of shape(batch_size, sequence_length, config.vocab_size)
) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).hidden_states (
tuple(paddle.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) –Tuple of
paddle.Tensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(paddle.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) –Tuple of
paddle.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(paddle.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) –Tuple of
paddle.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Cross attentions weights after the attention softmax, used to compute the weighted average in the crossattention heads.
past_key_values (
tuple(tuple(paddle.Tensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) –Tuple of
paddle.Tensor
tuples of lengthconfig.n_layers
, with each tuple containing the cached key, value states of the selfattention and the crossattention layers if model is used in encoderdecoder setting. Only relevant ifconfig.is_decoder = True
.Contains precomputed hiddenstates (key and values in the attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.

class
Seq2SeqModelOutput
(last_hidden_state: Optional[paddle.Tensor] = None, past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None, decoder_hidden_states: Optional[Tuple[paddle.Tensor]] = None, decoder_attentions: Optional[Tuple[paddle.Tensor]] = None, cross_attentions: Optional[Tuple[paddle.Tensor]] = None, encoder_last_hidden_state: Optional[paddle.Tensor] = None, encoder_hidden_states: Optional[Tuple[paddle.Tensor]] = None, encoder_attentions: Optional[Tuple[paddle.Tensor]] = None)[source]¶ Base class for model encoder’s outputs that also contains : precomputed hidden states that can speed up sequential decoding.
 Parameters
last_hidden_state (
paddle.Tensor
) –Sequence of hiddenstates at the output of the last layer of the decoder of the model, whose shape is
(batch_size, Sequence_length, hidden_size)
.If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output.past_key_values (
tuple(tuple(paddle.Tensor))
, optional) –Tuple of
tuple(paddle.Tensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
. Returned whenuse_cache=True
is passed or whenconfig.use_cache=True
.Contains precomputed hiddenstates (key and values in the selfattention blocks and in the crossattention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(paddle.Tensor)
, optional) –Tuple of
paddle.Tensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
. Returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
.Hiddenstates of the decoder at the output of each layer plus the optional initial embedding outputs.
decoder_attentions (
tuple(paddle.Tensor)
, optional) –Tuple of
paddle.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
. Returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(paddle.Tensor)
, optional) –Tuple of
paddle.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
. Returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
.Attentions weights of the decoder’s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
encoder_last_hidden_state (
paddle.Tensor
, optional) – Sequence of hiddenstates at the output of the last layer of the encoder of the model whose shape is(batch_size, sequence_length, hidden_size)
,encoder_hidden_states (
tuple(paddle.Tensor)
, optional) –Tuple of
paddle.Tensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
. Returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
.Hiddenstates of the encoder at the output of each layer plus the optional initial embedding outputs.
encoder_attentions (
tuple(paddle.Tensor)
, optional) –Tuple of
paddle.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
. Returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.

class
Seq2SeqLMOutput
(loss: Optional[paddle.Tensor] = None, logits: Optional[paddle.Tensor] = None, past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None, decoder_hidden_states: Optional[Tuple[paddle.Tensor]] = None, decoder_attentions: Optional[Tuple[paddle.Tensor]] = None, cross_attentions: Optional[Tuple[paddle.Tensor]] = None, encoder_last_hidden_state: Optional[paddle.Tensor] = None, encoder_hidden_states: Optional[Tuple[paddle.Tensor]] = None, encoder_attentions: Optional[Tuple[paddle.Tensor]] = None)[source]¶ Base class for sequencetosequence language models outputs.
 Parameters
loss (
paddle.Tensor
, optional) – Language modeling loss whose shape is(1,)
. Returned whenlabels
is provided.logits (
paddle.Tensor
) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax) whose shape is(batch_size, sequence_length, config.vocab_size)
).past_key_values (
tuple(tuple(paddle.Tensor))
, optional) –Tuple of
tuple(paddle.Tensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
. Returned whenuse_cache=True
is passed or whenconfig.use_cache=True
.Contains precomputed hiddenstates (key and values in the selfattention blocks and in the crossattention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(paddle.Tensor)
, optional) –Tuple of
paddle.Tensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
. Returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (
tuple(paddle.Tensor)
, optional) –Tuple of
paddle.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
. Returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(paddle.Tensor)
, optional) –Tuple of
paddle.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
. Returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
.Attentions weights of the decoder’s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
encoder_last_hidden_state (
paddle.Tensor
, optional) – Sequence of hiddenstates at the output of the last layer of the encoder of the model whose shape is(batch_size, sequence_length, hidden_size)
.encoder_hidden_states (
tuple(paddle.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) –Tuple of
paddle.Tensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (
tuple(paddle.Tensor)
, optional) –Tuple of
paddle.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
. Returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.