modeling

class DalleBartModel(text_vocab_size=50264, image_vocab_size=16384, bos_token_id=16384, pad_token_id=16384, eos_token_id=16384, max_text_length=64, max_image_length=256, decoder_start_token_id=16384, d_model=1024, num_encoder_layers=12, num_decoder_layers=12, encoder_attention_heads=16, decoder_attention_heads=16, encoder_ffn_dim=2730, decoder_ffn_dim=2730, dropout=0.0, activation_function='gelu', attention_dropout=0.0, activation_dropout=0.0, use_bias=False, init_std=0.02)[source]

Bases: paddlenlp.transformers.dallebart.modeling.DalleBartPretrainedModel

The bare DalleBart Model outputting raw hidden-states. This model inherits from PretrainedModel. Refer to the superclass documentation for the generic methods. This model is also a Paddle paddle.nn.Layer subclass. Use it as a regular Paddle Layer and refer to the Paddle documentation for all matter related to general usage and behavior. :param text_vocab_size: Vocabulary size of inputs_ids in DalleBartModel. Also is the vocab size of text token embedding matrix.

Defines the number of different tokens that can be represented by the inputs_ids passed when calling DalleBartModel.

Parameters
  • image_vocab_size (int) – Vocabulary size of decoder_inputs_ids in DalleBartModel. Also is the vocab size of image token embedding matrix. Defines the number of different tokens that can be represented by the decoder_inputs_ids passed when calling DalleBartModel.

  • bos_token (int, optional) – The beginning of image sequence token that was used during pretraining. Defaults to 16384.

  • pad_token_id (int, optional) – The index of padding token in the image token vocabulary. Defaults to 16384.

  • eos_token (int, optional) – A special token representing the end of a image sequence. Defaults to 16384.

  • max_text_length (int, optional) – The maximum value of the dimensionality of text position encoding, which dictates the maximum supported length of the text input sequence. Defaults to 64.

  • max_image_length (int, optional) – The maximum value of the dimensionality of image position encoding, which dictates the maximum supported length of the image input sequence. Defaults to 256.

  • decoder_start_token_id (int, optional) – The id indicating the start of decoding image sentence. Defaults to 16384.

  • d_model (int, optional) – Dimensionality of the embedding layer, encoder layer and decoder layer. Defaults to 1024.

  • num_encoder_layers (int, optional) – Number of hidden layers in the DalleBartEncoder. Defaults to 12.

  • num_decoder_layers (int, optional) – Number of hidden layers in the DalleBartDecoder. Defaults to 12.

  • encoder_attention_heads (int, optional) – Number of attention heads for each attention layer in the DalleBartEncoder. Defaults to 16.

  • decoder_attention_heads (int, optional) – Number of attention heads for each attention layer in the DalleBartDecoder. Defaults to 16.

  • encoder_ffn_dim (int, optional) – Dimensionality of the Gated Linear Units (glu) layer in the encoder. Input tensors to glu layers are firstly projected from d_model to encoder_ffn_dim, and then projected back to d_model. Typically encoder_ffn_dim is larger than d_model. Defaults to 2730.

  • decoder_ffn_dim (int, optional) – Dimensionality of the Gated Linear Units (glu) layer in the encoder. Input tensors to glu layers are firstly projected from d_model to decoder_ffn_dim, and then projected back to d_model. Typically decoder_ffn_dim is larger than d_model. Defaults to 2730.

  • dropout (float, optional) – The dropout probability used in all fully connected layers (pre-process and post-process of MHA and FFN sub-layer) in the encoders and decoders. Defaults to 0..

  • activation_function (str, optional) – The non-linear activation function in the glu layer. "gelu", "relu" and any other paddle supported activation functions are supported. Defaults to "gelu".

  • attention_dropout (float, optional) – The dropout probability used in MultiHeadAttention in all encoder layers and decoder layers to drop some attention target. Defaults to 0..

  • activation_dropout (float, optional) – The dropout probability used after glu activation in all encoder layers and decoder layers. Defaults to 0..

  • use_bias (bool, optional) – Whether or not use bias in all linear layers. Defaults to False.

  • init_std (float, optional) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices. Default to 0.02.

forward(input_ids, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, encoder_output=None, use_cache=False, cache=None)[source]

The DalleBartModel forward method, overrides the __call__() special method. :param input_ids: Indices of input sequence tokens in the vocabulary. They are

numerical representations of tokens that build the input sequence. Its data type should be int64 and it has a shape of [batch_size, sequence_length].

Parameters
  • attention_mask (Tensor, optional) – Mask used in multi-head attention to avoid performing attention to some unwanted positions, usually the paddings or the subsequent positions. Its data type can be int, float and bool. When the data type is bool, the masked tokens have False values and the others have True values. When the data type is int, the masked tokens have 0 values and the others have 1 values. When the data type is float, the masked tokens have -INF values and the others have 0 values. It is a tensor with shape broadcasted to [batch_size, num_attention_heads, sequence_length, sequence_length]. For example, its shape can be [batch_size, sequence_length], [batch_size, sequence_length, sequence_length], [batch_size, num_attention_heads, sequence_length, sequence_length]. Defaults to None, which means nothing needed to be prevented attention to.

  • decoder_input_ids (Tensor, optional) – Indices of decoder input sequence tokens in the vocabulary. Its data type should be int64 and it has a shape of [batch_size, sequence_length]. Defaults to None, which means no decoder_input_ids is provided, the model will create the tensor by shifting the input_ids to the right.

  • decoder_attention_mask (Tensor, optional) – Mask used in multi-head attention to avoid performing attention to some unwanted positions in decoder_input_ids. Its data type and shape is the same as attention_mask. Defaults to None.

  • encoder_output (tuple, optional) – The output of the encoder, a tuple consists last_hidden_state, hidden_states`(optional), `attentions`(optional). The data type of `last_hidden_state is float32 and its shape is [batch_size, sequence_length, hidden_size]. hidden_states is hidden_states of all layers in the Transformer encoder. The length of hidden_states is num_hidden_layers + 1. For all element in the tuple, its data type should be float32 and its shape is [batch_size, sequence_length, hidden_size]. attentions is attentions of all layers of in the Transformer encoder. The length of attentions is num_hidden_layers. For all element in the tuple, its data type should be float32 and its shape is [batch_size, num_attention_heads, sequence_length, sequence_length].

  • use_cache (bool, optional) – Whether or not to use cache. Defaults to False. If set to True, key value states will be returned and can be used to speed up decoding.

  • cache (list, optional) – It is a list, and each element in the list is a tuple (incremental_cache, static_cache). See TransformerDecoder.gen_cache for more details. It is only used for inference and should be None for training. Default to None.

Returns

Returns tensor decoder_output, which is the output at the last layer of the model. Its data type should be float32 and has a shape of [batch_size, sequence_length, hidden_size].

Return type

Tensor

Example

class DalleBartPretrainedModel(*args, **kwargs)[source]

Bases: paddlenlp.transformers.model_utils.PretrainedModel

An abstract class for pretrained Bart models. It provides DalleBart related model_config_file, pretrained_init_configuration, resource_files_names, pretrained_resource_files_map, base_model_prefix for downloading and loading pretrained models. See PretrainedModel for more details.

init_weights(layer)[source]

Initialization hook

base_model_class

alias of paddlenlp.transformers.dallebart.modeling.DalleBartModel

class DalleBartEncoder(d_model=1024, nhead=16, dim_feedforward=2730, max_text_length=64, text_vocab_size=50264, text_pad_token_id=1, encoder_layers=12, dropout=0.0, activation='gelu', attn_dropout=None, act_dropout=None, bias_attr=False, init_std=0.02)[source]

Bases: paddlenlp.transformers.dallebart.modeling.DalleBartPretrainedModel

The Encoder of DalleBartModel. The arguments of DalleBartEncoder can see DalleBartModel.

forward(input_ids, attention_mask=None, **kwargs)[source]

The DalleBartEncoder forward method, overrides the __call__() special method. :param input_ids: See DalleBartModel. :type input_ids: Tensor, optional :param attention_mask: See DalleBartModel. :type attention_mask: Tensor, optional

Returns

Returns tensor encoder_output, which is the output at the last layer of the model. Its data type should be float32 and has a shape of [batch_size, sequence_length, hidden_size].

Return type

Tensor

class DalleBartDecoder(d_model=1024, nhead=16, dim_feedforward=2730, image_vocab_size=16384, max_image_length=256, decoder_layers=12, dropout=0.0, activation='gelu', attn_dropout=None, act_dropout=None, bias_attr=False, init_std=0.02)[source]

Bases: paddlenlp.transformers.dallebart.modeling.DalleBartPretrainedModel

The Decoder of DalleBartModel. The arguments of DalleBartDecoder can see DalleBartModel.

forward(decoder_input_ids=None, decoder_attention_mask=None, encoder_output=None, memory_mask=None, cache=None)[source]

The DalleBartDecoder forward method, overrides the __call__() special method. :param decoder_input_ids: See DalleBartModel. :type decoder_input_ids: Tensor, optional :param decoder_attention_mask: See DalleBartModel. :type decoder_attention_mask: Tensor, optional :param encoder_output: See DalleBartModel. :type encoder_output: Tensor, optional :param memory_mask: See DalleBartModel. :type memory_mask: Tensor, optional :param cache: See DalleBartModel. :type cache: Tensor, optional

Returns

Returns tensor decoder_output, which is the output at the last layer of the model. Its data type should be float32 and has a shape of [batch_size, sequence_length, hidden_size].

Return type

Tensor

class DalleBartForConditionalGeneration(dallebart)[source]

Bases: paddlenlp.transformers.dallebart.modeling.DalleBartPretrainedModel

DalleBart Model with a language modeling head on top. :param dallebart: An instance of DalleBartModel. :type dallebart: DalleBartModel

forward(input_ids, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, encoder_output=None, use_cache=False, cache=None)[source]

The DalleBartForConditionalGeneration forward method, overrides the __call__() special method. :param input_ids: See DalleBartModel. :type input_ids: Tensor :param attention_mask: See DalleBartModel. :type attention_mask: Tensor, optional :param decoder_input_ids: See DalleBartModel. :type decoder_input_ids: Tensor, optional :param decoder_attention_mask: See DalleBartModel. :type decoder_attention_mask: Tensor, optional :param encoder_output: See DalleBartModel. :type encoder_output: Tensor, optonal :param use_cache: See DalleBartModel. :type use_cache: bool, optional :param cache: See DalleBartModel. :type cache: Tensor, optional

Returns

Returns Tensor lm_logits if use_cache is False, otherwise, returns tuple (lm_logits, cache). With the fields: - lm_logits (Tensor):

The generated sentence of the model. Its data type should be float32 and has a shape of [batch_size, sequence_length, vocab_size].

Return type

Tensor or tuple

Example

generate(input_ids=None, max_length=256, min_length=256, decode_strategy='sampling', temperature=1.0, top_k=0, top_p=1.0, repetition_penalty=1.0, num_beams=1, num_beam_groups=1, length_penalty=0.0, early_stopping=False, bos_token_id=None, eos_token_id=None, pad_token_id=None, text_pad_token_id=1, decoder_start_token_id=None, forced_bos_token_id=None, forced_eos_token_id=None, num_return_sequences=1, diversity_rate=0.0, use_cache=True, use_faster=False, use_fp16_decoding=False, condition_scale=1.0, **model_kwargs)[source]

The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.

Parameters
  • input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value bos_token_id.

  • max_length (int, optional) – The maximum length of the sequence to be generated. Default to 256.

  • min_length (int, optional) – The minimum length of the sequence to be generated. Default to 256.

  • decode_strategy (str, optional) – The decoding strategy in generation. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”. Default to “sampling”.

  • temperature (float, optional) – The value used to module the next token probabilities in the “sampling” strategy. Default to 1.0, which means no effect.

  • top_k (int, optional) – The number of highest probability tokens to keep for top-k-filtering in the “sampling” strategy. Default to 0, which means no effect.

  • top_p (float, optional) – The cumulative probability for top-p-filtering in the “sampling” strategy. The value should satisfy \(0 <= top\_p < 1\). Default to 1.0, which means no effect.

  • repetition_penalty (float, optional) – The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details. Defaults to 1.0.

  • num_beams (int, optional) – The number of beams in the “beam_search” strategy. Default to 1.

  • num_beam_groups (int, optional) – Number of groups to divide num_beams into in order to use DIVERSE BEAM SEARCH. See this paper for more details. Default to 1.

  • length_penalty (float, optional) – The exponential penalty to the sequence length in the “beam_search” strategy. The larger this param is, the more that the model would generate shorter sequences. Default to 0.0, which means no penalty.

  • early_stopping (bool, optional) – Whether to stop searching in the “beam_search” strategy when at least num_beams sentences are finished per batch or not. Default to False.

  • bos_token_id (int, optional) – The id of the bos_token. Default to None.

  • eos_token_id (int, optional) – The id of the eos_token. Default to None.

  • pad_token_id (int, optional) – The id of the pad_token. Default to None.

  • decoder_start_token_id (int, optional) – The start token id for encoder-decoder models. Default to None.

  • forced_bos_token_id (int, optional) – The id of the token to force as the first generated token. Usually use for multilingual models. Default to None.

  • forced_eos_token_id (int, optional) – The id of the token to force as the last generated token. Default to None.

  • num_return_sequences (int, optional) – The number of returned sequences for each sequence in the batch. Default to 1.

  • diversity_rate (float, optional) – If num_beam_groups is 1, this is the diversity_rate for Diverse Siblings Search. See `this paper https://arxiv.org/abs/1611.08562`__ for more details. If not, this is the diversity_rate for DIVERSE BEAM SEARCH.

  • use_cache – (bool, optional): Whether to use the model cache to speed up decoding. Default to True.

  • use_faster – (bool, optional): Whether to use faster entry of model for FasterGeneration. Default to False.

  • use_fp16_decoding – (bool, optional): Whether to use fp16 for decoding. Only works when faster entry is avalible. Default to False.

  • condition_scale (float, optional) – The scale of super conditioning. See this twitter Default to 1.0.

  • model_kwargs (dict) – It can be used to specify additional kwargs passed to the model.

Returns

It is a tuple contains two elements: ids and scores. Each element is a Tensor.

With the fields:

  • ids (Tensor):

    The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input input_ids.

  • scores (Tensor):

    The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.

Return type

tuple[Tensor]

Example

import paddle
from paddlenlp.transformers import (
    DalleBartForConditionalGeneration,
    DalleBartTokenizer
)

# Initialize the model and tokenizer
model_name_or_path = 'dalle-mini'
model = DalleBartForConditionalGeneration.from_pretrained(model_name_or_path)
tokenizer = DalleBartTokenizer.from_pretrained(model_name_or_path)

# Prepare the model inputs.
prompts = "graphite sketch of Elon Musk"
tokenized_inputs = tokenizer(
    prompts,
    return_tensors="pd",
    padding="max_length",
    truncation=True,
    return_attention_mask=True,
    max_length=64,
)

# Generate 4 sequences by using "sampling" strategy (top_k=64, condition_scale=10.0)
image_token_ids, scores = model.generate(
    input_ids=tokenized_inputs['input_ids'],
    attention_mask=tokenized_inputs['attention_mask'],
    decode_strategy="sampling",
    condition_scale=10.0,
    top_k=64,
    num_return_sequences=4)
print(image_token_ids.shape, scores.shape)
# [4, 256] [4, 1]
class DalleBartForImageGeneration(dallebart)[source]

Bases: paddlenlp.transformers.dallebart.modeling.DalleBartForConditionalGeneration

DalleBart Model with a language modeling head and VQGanTokenizer on top. :param dallebart: An instance of DalleBartModel. :type dallebart: DalleBartModel

generate(input_ids, attention_mask=None, top_k=0, top_p=1.0, temperature=1.0, condition_scale=1.0, num_return_sequences=1, **kwargs)[source]

The DalleBartForImageGeneration generate method. :param input_ids: See DalleBartForConditionalGeneration. :type input_ids: Tensor :param attention_mask: See DalleBartForConditionalGeneration. :type attention_mask: Tensor, optional :param top_k: The number of highest probability tokens to

keep for top-k-filtering in the “sampling” strategy. Default to 0, which means no effect.

Parameters
  • top_p (float, optional) – The cumulative probability for top-p-filtering in the “sampling” strategy. The value should satisfy \(0 <= top\_p < 1\). Default to 1.0, which means no effect.

  • temperature (float, optional) – The value used to module the next token probabilities in the “sampling” strategy. Default to 1.0, which means no effect.

  • condition_scale (float, optional) – The scale of super conditioning. See this twitter Default to 1.0.

  • num_return_sequences (int, optional) – The number of returned sequences for each sequence in the batch. Default to 1.

Returns

Returns tensor images, which is the output of VQGanDetokenizer. Its data type should be uint8 and has a shape of [batch_size, num_return_sequences, 256, 256, 3].

Return type

Tensor

Example

class VQGanDetokenizer(image_vocab_size=16384, embed_count=256)[source]

Bases: paddle.fluid.dygraph.layers.Layer

forward(z)[source]

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters
  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments