model_outputs#

tuple_output(outputs: Tuple[Tensor], loss: Tensor | None = None)[源代码]#

re-construct the outputs with one method which contains the simple logic

参数:
  • outputs (Tuple[Tensor]) -- the source of the outputs

  • loss (Optional[Tensor], optional) -- the loss of the model. Defaults to None.

convert_encoder_output(encoder_output)[源代码]#

Convert encoder_output from tuple to class:BaseModelOutput.

参数:

encoder_output (tuple or ModelOutput) -- The output of the encoder, a tuple consists last_hidden_state, hidden_states`(optional), `attentions`(optional). The data type of `last_hidden_state is float32 and its shape is [batch_size, sequence_length, hidden_size].

class ModelOutput[源代码]#

Base class for all model outputs as dataclass. Has a __getitem__ that allows indexing by integer or slice (like a tuple) or strings (like a dictionary) that will ignore the None attributes. Otherwise behaves like a regular python dictionary.

<Tip warning={true}>

You can't unpack a ModelOutput directly. Use the [to_tuple] method to convert it to a tuple before.

</Tip>

setdefault(*args, **kwargs)[源代码]#

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

pop(key[, default]) v, remove specified key and return the corresponding value.[源代码]#

If the key is not found, return the default if given; otherwise, raise a KeyError.

update([E, ]**F) None.  Update D from dict/iterable E and F.[源代码]#

If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

to_tuple() Tuple[Any][源代码]#

Convert self to a tuple containing all the attributes/keys that are not None.

class BaseModelOutput(last_hidden_state: Tensor | None = None, hidden_states: Tuple[Tensor] | None = None, attentions: Tuple[Tensor] | None = None)[源代码]#

Base class for model's outputs, with potential hidden states and attentions.

参数:
  • last_hidden_state (paddle.Tensor of shape (batch_size, sequence_length, hidden_size)) -- Sequence of hidden-states at the output of the last layer of the model.

  • hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

class BaseModelOutputWithNoAttention(last_hidden_state: Tensor | None = None, hidden_states: Tuple[Tensor] | None = None)[源代码]#

Base class for model's outputs, with potential hidden states.

参数:
  • last_hidden_state (paddle.Tensor of shape (batch_size, num_channels, height, width)) -- Sequence of hidden-states at the output of the last layer of the model.

  • hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, num_channels, height, width).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

class BaseModelOutputWithPooling(last_hidden_state: Tensor | None = None, pooler_output: Tensor | None = None, hidden_states: Tuple[Tensor] | None = None, attentions: Tuple[Tensor] | None = None)[源代码]#

Base class for model's outputs that also contains a pooling of the last hidden states.

参数:
  • last_hidden_state (paddle.Tensor of shape (batch_size, sequence_length, hidden_size)) -- Sequence of hidden-states at the output of the last layer of the model.

  • pooler_output (paddle.Tensor of shape (batch_size, hidden_size)) -- Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

  • hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

class BaseModelOutputWithPast(last_hidden_state: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor]] | None = None, hidden_states: Tuple[Tensor] | None = None, attentions: Tuple[Tensor] | None = None)[源代码]#

Base class for model's outputs that may also contain a past key/values (to speed up sequential decoding).

参数:
  • last_hidden_state (paddle.Tensor of shape (batch_size, sequence_length, hidden_size)) --

    Sequence of hidden-states at the output of the last layer of the model.

    If past_key_values is used only the last hidden-state of the sequences of shape (batch_size, 1, hidden_size) is output.

  • past_key_values (tuple(tuple(paddle.Tensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) --

    Tuple of tuple(paddle.Tensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and optionally if config.is_encoder_decoder=True 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=True in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

class BaseModelOutputWithPastAndCrossAttentions(last_hidden_state: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor]] | None = None, hidden_states: Tuple[Tensor] | None = None, attentions: Tuple[Tensor] | None = None, cross_attentions: Tuple[Tensor] | None = None)[源代码]#

Base class for model's outputs that may also contain a past key/values (to speed up sequential decoding).

参数:
  • last_hidden_state (paddle.Tensor of shape (batch_size, sequence_length, hidden_size)) --

    Sequence of hidden-states at the output of the last layer of the model.

    If past_key_values is used only the last hidden-state of the sequences of shape (batch_size, 1, hidden_size) is output.

  • past_key_values (tuple(tuple(paddle.Tensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) --

    Tuple of tuple(paddle.Tensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and optionally if config.is_encoder_decoder=True 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=True in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True and config.add_cross_attention=True is passed or when config.output_attentions=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

class BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state: Tensor | None = None, pooler_output: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor]] | None = None, hidden_states: Tuple[Tensor] | None = None, attentions: Tuple[Tensor] | None = None, cross_attentions: Tuple[Tensor] | None = None)[源代码]#

Base class for model's outputs that also contains a pooling of the last hidden states.

参数:
  • last_hidden_state (paddle.Tensor of shape (batch_size, sequence_length, hidden_size)) -- Sequence of hidden-states at the output of the last layer of the model.

  • pooler_output (paddle.Tensor of shape (batch_size, hidden_size)) -- Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

  • hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True and config.add_cross_attention=True is passed or when config.output_attentions=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • past_key_values (tuple(tuple(paddle.Tensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) --

    Tuple of tuple(paddle.Tensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and optionally if config.is_encoder_decoder=True 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=True in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

class SequenceClassifierOutput(loss: Tensor | None = None, logits: Tensor | None = None, hidden_states: Tuple[Tensor] | None = None, attentions: Tuple[Tensor] | None = None)[源代码]#

Base class for outputs of sentence classification models.

参数:
  • loss (paddle.Tensor of shape (1,), optional, returned when labels is provided) -- Classification (or regression if config.num_labels==1) loss.

  • logits (paddle.Tensor of shape (batch_size, config.num_labels)) -- Classification (or regression if config.num_labels==1) scores (before SoftMax).

  • hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

class TokenClassifierOutput(loss: Tensor | None = None, logits: Tensor | None = None, hidden_states: Tuple[Tensor] | None = None, attentions: Tuple[Tensor] | None = None)[源代码]#

Base class for outputs of token classification models.

参数:
  • loss (paddle.Tensor of shape (1,), optional, returned when labels is provided) -- Classification loss.

  • logits (paddle.Tensor of shape (batch_size, sequence_length, config.num_labels)) -- Classification scores (before SoftMax).

  • hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

class QuestionAnsweringModelOutput(loss: Tensor | None = None, start_logits: Tensor | None = None, end_logits: Tensor | None = None, hidden_states: Tuple[Tensor] | None = None, attentions: Tuple[Tensor] | None = None)[源代码]#

Base class for outputs of question answering models.

参数:
  • loss (paddle.Tensor of shape (1,), optional, returned when labels is provided) -- Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.

  • start_logits (paddle.Tensor of shape (batch_size, sequence_length)) -- Span-start scores (before SoftMax).

  • end_logits (paddle.Tensor of shape (batch_size, sequence_length)) -- Span-end scores (before SoftMax).

  • hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

class MultipleChoiceModelOutput(loss: Tensor | None = None, logits: Tensor | None = None, hidden_states: Tuple[Tensor] | None = None, attentions: Tuple[Tensor] | None = None)[源代码]#

Base class for outputs of multiple choice models.

参数:
  • loss (paddle.Tensor of shape (1,), optional, returned when labels is provided) -- Classification loss.

  • logits (paddle.Tensor of shape (batch_size, num_choices)) --

    num_choices is the second dimension of the input tensors. (see input_ids above).

    Classification scores (before SoftMax).

  • hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

class MaskedLMOutput(loss: Tensor | None = None, logits: Tensor | None = None, hidden_states: Tuple[Tensor] | None = None, attentions: Tuple[Tensor] | None = None)[源代码]#

Base class for masked language models outputs.

参数:
  • loss (paddle.Tensor of shape (1,), optional, returned when labels is provided) -- Masked language modeling (MLM) loss.

  • logits (paddle.Tensor of shape (batch_size, sequence_length, config.vocab_size)) -- Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

class CausalLMOutputWithPast(loss: Tensor | None = None, logits: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor]] | None = None, hidden_states: Tuple[Tensor] | None = None, attentions: Tuple[Tensor] | None = None)[源代码]#

Base class for causal language model (or autoregressive) outputs.

参数:
  • loss (paddle.Tensor of shape (1,), optional, returned when labels is provided) -- Language modeling loss (for next-token prediction).

  • logits (paddle.Tensor of shape (batch_size, sequence_length, config.vocab_size)) -- Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • past_key_values (tuple(tuple(paddle.Tensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) --

    Tuple of paddle.Tensor tuples of length config.n_layers, with each tuple containing the cached key, value states of the self-attention and the cross-attention layers if model is used in encoder-decoder setting. Only relevant if config.is_decoder = True.

    Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

class CausalLMOutputWithCrossAttentions(loss: Tensor | None = None, logits: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor]] | None = None, hidden_states: Tuple[Tensor] | None = None, attentions: Tuple[Tensor] | None = None, cross_attentions: Tuple[Tensor] | None = None)[源代码]#

Base class for causal language model (or autoregressive) outputs.

参数:
  • loss (paddle.Tensor of shape (1,), optional, returned when labels is provided) -- Language modeling loss (for next-token prediction).

  • logits (paddle.Tensor of shape (batch_size, sequence_length, config.vocab_size)) -- Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Cross attentions weights after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • past_key_values (tuple(tuple(paddle.Tensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) --

    Tuple of paddle.Tensor tuples of length config.n_layers, with each tuple containing the cached key, value states of the self-attention and the cross-attention layers if model is used in encoder-decoder setting. Only relevant if config.is_decoder = True.

    Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

class Seq2SeqModelOutput(last_hidden_state: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor]] | None = None, decoder_hidden_states: Tuple[Tensor] | None = None, decoder_attentions: Tuple[Tensor] | None = None, cross_attentions: Tuple[Tensor] | None = None, encoder_last_hidden_state: Tensor | None = None, encoder_hidden_states: Tuple[Tensor] | None = None, encoder_attentions: Tuple[Tensor] | None = None)[源代码]#

Base class for model encoder's outputs that also contains : pre-computed hidden states that can speed up sequential decoding.

参数:
  • last_hidden_state (paddle.Tensor) --

    Sequence of hidden-states at the output of the last layer of the decoder of the model, whose shape is (batch_size, Sequence_length, hidden_size).

    If past_key_values is used only the last hidden-state of the sequences of shape (batch_size, 1, hidden_size) is output.

  • past_key_values (tuple(tuple(paddle.Tensor)), optional) --

    Tuple of tuple(paddle.Tensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head). Returned when use_cache=True is passed or when config.use_cache=True.

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • decoder_hidden_states (tuple(paddle.Tensor), optional) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Returned when output_hidden_states=True is passed or when config.output_hidden_states=True.

    Hidden-states of the decoder at the output of each layer plus the optional initial embedding outputs.

  • decoder_attentions (tuple(paddle.Tensor), optional) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Returned when output_attentions=True is passed or when config.output_attentions=True.

    Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(paddle.Tensor), optional) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Returned when output_attentions=True is passed or when config.output_attentions=True.

    Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • encoder_last_hidden_state (paddle.Tensor, optional) -- Sequence of hidden-states at the output of the last layer of the encoder of the model whose shape is (batch_size, sequence_length, hidden_size),

  • encoder_hidden_states (tuple(paddle.Tensor), optional) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Returned when output_hidden_states=True is passed or when config.output_hidden_states=True.

    Hidden-states of the encoder at the output of each layer plus the optional initial embedding outputs.

  • encoder_attentions (tuple(paddle.Tensor), optional) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Returned when output_attentions=True is passed or when config.output_attentions=True.

    Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

class Seq2SeqLMOutput(loss: Tensor | None = None, logits: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor]] | None = None, decoder_hidden_states: Tuple[Tensor] | None = None, decoder_attentions: Tuple[Tensor] | None = None, cross_attentions: Tuple[Tensor] | None = None, encoder_last_hidden_state: Tensor | None = None, encoder_hidden_states: Tuple[Tensor] | None = None, encoder_attentions: Tuple[Tensor] | None = None)[源代码]#

Base class for sequence-to-sequence language models outputs.

参数:
  • loss (paddle.Tensor, optional) -- Language modeling loss whose shape is (1,). Returned when labels is provided.

  • logits (paddle.Tensor) -- Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax) whose shape is (batch_size, sequence_length, config.vocab_size)).

  • past_key_values (tuple(tuple(paddle.Tensor)), optional) --

    Tuple of tuple(paddle.Tensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head). Returned when use_cache=True is passed or when config.use_cache=True.

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • decoder_hidden_states (tuple(paddle.Tensor), optional) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Returned when output_hidden_states=True is passed or when config.output_hidden_states=True.

    Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.

  • decoder_attentions (tuple(paddle.Tensor), optional) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Returned when output_attentions=True is passed or when config.output_attentions=True.

    Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(paddle.Tensor), optional) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Returned when output_attentions=True is passed or when config.output_attentions=True.

    Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • encoder_last_hidden_state (paddle.Tensor, optional) -- Sequence of hidden-states at the output of the last layer of the encoder of the model whose shape is (batch_size, sequence_length, hidden_size).

  • encoder_hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.

  • encoder_attentions (tuple(paddle.Tensor), optional) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Returned when output_attentions=True is passed or when config.output_attentions=True.

    Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

class Seq2SeqQuestionAnsweringModelOutput(loss: Tensor | None = None, start_logits: Tensor | None = None, end_logits: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor]] | None = None, decoder_hidden_states: Tuple[Tensor] | None = None, decoder_attentions: Tuple[Tensor] | None = None, cross_attentions: Tuple[Tensor] | None = None, encoder_last_hidden_state: Tensor | None = None, encoder_hidden_states: Tuple[Tensor] | None = None, encoder_attentions: Tuple[Tensor] | None = None)[源代码]#

Base class for outputs of sequence-to-sequence question answering models. :param loss: Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.

A Tensor of shape (1,), returned when labels is provided.

参数:
  • start_logits (paddle.Tensor) -- Span-start scores (before SoftMax). Tensor of shape (batch_size, sequence_length)).

  • end_logits (paddle.Tensor) -- Span-end scores (before SoftMax). Tensor of shape (batch_size, sequence_length)).

  • past_key_values (tuple(tuple(paddle.Tensor)), optional) -- Tuple of tuple(paddle.Tensor) of length n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head). Returned when use_cache=True is passed. Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • decoder_hidden_states (tuple(paddle.Tensor), optional) -- Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Returned when output_hidden_states=True is passed. Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.

  • decoder_attentions (tuple(paddle.Tensor), optional) -- Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Returned when output_attentions=True is passed. Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(paddle.Tensor), optional) -- Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Returned when output_attentions=True is passed. Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • encoder_last_hidden_state (paddle.Tensor optional) -- Sequence of hidden-states at the output of the last layer of the encoder of the model. Tensor of shape (batch_size, sequence_length, hidden_size).

  • encoder_hidden_states (tuple(paddle.Tensor), optional) -- Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Returned when output_hidden_states=True is passed. Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.

  • encoder_attentions (tuple(paddle.Tensor), optional) -- Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Returned when output_attentions=True is passed. Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

class Seq2SeqSequenceClassifierOutput(loss: Tensor | None = None, logits: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor]] | None = None, decoder_hidden_states: Tuple[Tensor] | None = None, decoder_attentions: Tuple[Tensor] | None = None, cross_attentions: Tuple[Tensor] | None = None, encoder_last_hidden_state: Tensor | None = None, encoder_hidden_states: Tuple[Tensor] | None = None, encoder_attentions: Tuple[Tensor] | None = None)[源代码]#

Base class for outputs of sequence-to-sequence sentence classification models. :param loss: Classification (or regression if config.num_labels==1) loss of shape (1,). Returned when label is provided). :type loss: paddle.Tensor optional :param logits: Classification (or regression if config.num_labels==1) scores (before SoftMax) of shape (batch_size, config.num_labels) :type logits: paddle.Tensor :param past_key_values: Tuple of tuple(paddle.Tensor) of length config.n_layers, with each tuple having 2 tensors of shape

(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head). Returned when use_cache=True is passed. Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

参数:
  • decoder_hidden_states (tuple(paddle.Tensor), optional) -- Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Returned when output_hidden_states=True is passed. Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.

  • decoder_attentions (tuple(paddle.Tensor), optional) -- Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Returned when output_attentions=True is passed. Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(paddle.Tensor), optional) -- Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Returned when output_attentions=True is passed. Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • encoder_last_hidden_state (paddle.Tensor, optional) -- Sequence of hidden-states at the output of the last layer of the encoder of the model. Tensor of shape (batch_size, sequence_length, hidden_size).

  • encoder_hidden_states (tuple(paddle.Tensor), optional) -- Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Returned when output_hidden_states=True is passed. Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.

  • encoder_attentions (tuple(paddle.Tensor), optional) -- Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Returned when output_attentions=True is passed. Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

class SequenceClassifierOutputWithPast(loss: Tensor | None = None, logits: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor]] | None = None, hidden_states: Tuple[Tensor] | None = None, attentions: Tuple[Tensor] | None = None)[源代码]#

Base class for outputs of sentence classification models. :param loss: Classification (or regression if config.num_labels==1) loss whose shape is (1,).

Returned when labels is provided.

参数:
  • logits (paddle.Tensor) -- Classification (or regression if config.num_labels==1) scores (before SoftMax) whose shape is (batch_size, num_labels)

  • past_key_values (tuple(tuple(paddle.Tensor)), optional) -- Tuple of tuple(paddle.Tensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) Returned when use_cache=True is passed or when config.use_cache=True). Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • hidden_states (tuple(paddle.Tensor), optional) -- Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Returned when output_hidden_states=True is passed or when config.output_hidden_states=True). Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(paddle.Tensor), optional) -- Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Returned when output_attentions=True is passed or when config.output_attentions=True). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

class BackboneOutput(feature_maps: Tuple[Tensor] | None = None, hidden_states: Tuple[Tensor] | None = None, attentions: Tuple[Tensor] | None = None)[源代码]#

Base class for outputs of backbones.

参数:
  • feature_maps (tuple(paddle.Tensor) of shape (batch_size, num_channels, height, width)) -- Feature maps of the stages.

  • hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) --

    Tuple of paddle.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size) or (batch_size, num_channels, height, width), depending on the backbone.

    Hidden-states of the model at the output of each stage plus the initial embedding outputs.

  • attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Only applicable if the backbone uses attention.

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

class BaseModelOutputWithPoolingAndNoAttention(last_hidden_state: Tensor | None = None, pooler_output: Tensor | None = None, hidden_states: Tuple[Tensor] | None = None)[源代码]#

Base class for model's outputs that also contains a pooling of the last hidden states.

参数:
  • last_hidden_state (paddle.Tensor of shape (batch_size, num_channels, height, width)) -- Sequence of hidden-states at the output of the last layer of the model.

  • pooler_output (paddle.Tensor of shape (batch_size, hidden_size)) -- Last layer hidden-state after a pooling operation on the spatial dimensions.

  • hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, num_channels, height, width).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

class ImageClassifierOutputWithNoAttention(loss: Tensor | None = None, logits: Tensor | None = None, hidden_states: Tuple[Tensor] | None = None)[源代码]#

Base class for outputs of image classification models.

参数:
  • loss (paddle.Tensor of shape (1,), optional, returned when labels is provided) -- Classification (or regression if config.num_labels==1) loss.

  • logits (paddle.Tensor of shape (batch_size, config.num_labels)) -- Classification (or regression if config.num_labels==1) scores (before SoftMax).

  • hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) -- Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the model at the output of each stage.

class DepthEstimatorOutput(loss: Tensor | None = None, predicted_depth: Tensor | None = None, hidden_states: Tuple[Tensor] | None = None, attentions: Tuple[Tensor] | None = None)[源代码]#

Base class for outputs of depth estimation models.

参数:
  • loss (paddle.Tensor of shape (1,), optional, returned when labels is provided) -- Classification (or regression if config.num_labels==1) loss.

  • predicted_depth (paddle.Tensor of shape (batch_size, height, width)) -- Predicted depth for each pixel.

  • hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, num_channels, height, width).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, patch_size, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

class SemanticSegmenterOutput(loss: Tensor | None = None, logits: Tensor | None = None, hidden_states: Tuple[Tensor] | None = None, attentions: Tuple[Tensor] | None = None)[源代码]#

Base class for outputs of semantic segmentation models. :param loss: Classification (or regression if config.num_labels==1) loss. :type loss: paddle.Tensor of shape (1,), optional, returned when labels is provided :param logits: Classification scores for each pixel.

<Tip warning={true}> The logits returned do not necessarily have the same size as the pixel_values passed as inputs. This is to avoid doing two interpolations and lose some quality when a user needs to resize the logits to the original image size as post-processing. You should always check your logits shape and resize as needed. </Tip>

参数:
  • hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) -- Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, patch_size, hidden_size). Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) -- Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, patch_size, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

class Seq2SeqSpectrogramOutput(loss: Tensor | None = None, spectrogram: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor]] | None = None, decoder_hidden_states: Tuple[Tensor] | None = None, decoder_attentions: Tuple[Tensor] | None = None, cross_attentions: Tuple[Tensor] | None = None, encoder_last_hidden_state: Tensor | None = None, encoder_hidden_states: Tuple[Tensor] | None = None, encoder_attentions: Tuple[Tensor] | None = None)[源代码]#

Base class for sequence-to-sequence spectrogram outputs.

参数:
  • loss (paddle.Tensor of shape (1,), optional, returned when labels is provided) -- Spectrogram generation loss.

  • spectrogram (paddle.Tensor of shape (batch_size, sequence_length, num_bins)) -- The predicted spectrogram.

  • past_key_values (tuple(tuple(paddle.Tensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) --

    Tuple of tuple(paddle.Tensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • decoder_hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.

  • decoder_attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • encoder_last_hidden_state (paddle.Tensor of shape (batch_size, sequence_length, hidden_size), optional) -- Sequence of hidden-states at the output of the last layer of the encoder of the model.

  • encoder_hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.

  • encoder_attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

class MoEModelOutputWithPast(last_hidden_state: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor]] | None = None, hidden_states: Tuple[Tensor] | None = None, attentions: Tuple[Tensor] | None = None, router_logits: Tuple[Tensor] | None = None)[源代码]#

Base class for model's outputs, with potential hidden states and attentions.

参数:
  • last_hidden_state (paddle.Tensor of shape (batch_size, sequence_length, hidden_size)) -- Sequence of hidden-states at the output of the last layer of the model.

  • past_key_values (tuple(tuple(paddle.Tensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) --

    Tuple of tuple(paddle.Tensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and optionally if config.is_encoder_decoder=True 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=True in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • router_logits (tuple(paddle.Tensor), optional, returned when output_router_probs=True and config.add_router_probs=True is passed or when config.output_router_probs=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, sequence_length, num_experts).

    Raw router logtis (post-softmax) that are computed by MoE routers, these terms are used to compute the auxiliary loss for Mixture of Experts models.

class MoECausalLMOutputWithPast(loss: Tensor | None = None, aux_loss: Tensor | None = None, logits: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor]] | None = None, hidden_states: Tuple[Tensor] | None = None, attentions: Tuple[Tensor] | None = None, router_logits: Tuple[Tensor] | None = None)[源代码]#

Base class for causal language model (or autoregressive) with mixture of experts outputs.

参数:
  • loss (paddle.Tensor of shape (1,), optional, returned when labels is provided) -- Language modeling loss (for next-token prediction).

  • logits (paddle.Tensor of shape (batch_size, sequence_length, config.vocab_size)) -- Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • aux_loss (paddle.Tensor, optional, returned when labels is provided) -- aux_loss for the sparse modules.

  • router_logits (tuple(paddle.Tensor), optional, returned when output_router_probs=True and config.add_router_probs=True is passed or when config.output_router_probs=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, sequence_length, num_experts).

    Raw router logtis (post-softmax) that are computed by MoE routers, these terms are used to compute the auxiliary loss for Mixture of Experts models.

  • past_key_values (tuple(tuple(paddle.Tensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) --

    Tuple of tuple(paddle.Tensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head))

    Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • hidden_states (tuple(paddle.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) --

    Tuple of paddle.Tensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(paddle.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) --

    Tuple of paddle.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.