modeling

modeling#

position_encoding_init(n_position, d_pos_vec, dtype='float32')[source]#

Generates the initial values for the sinusoidal position encoding table. This method follows the implementation in tensor2tensor, but is slightly different from the description in “Attention Is All You Need”.

Parameters:

n_position (int) – The largest position for sequences, that is, the maximum length of source or target sequences.
d_pos_vec (int) – The size of positional embedding vector.
dtype (str, optional) – The output numpy.array’s data type. Defaults to “float32”.

Returns:

The embedding table of sinusoidal position encoding with shape [n_position, d_pos_vec].

Return type:

numpy.array

Example

from paddlenlp.transformers import position_encoding_init

max_length = 256
emb_dim = 512
pos_table = position_encoding_init(max_length, emb_dim)

class WordEmbedding(vocab_size, emb_dim, bos_id=0)[source]#

Bases: Layer

Word Embedding Layer of Transformer. This layer automatically constructs a 2D embedding matrix based on the input the size of vocabulary (vocab_size) and the size of each embedding vector (emb_dim). This layer lookups embeddings vector of ids provided by input word.

After the embedding, those weights are multiplied by sqrt(d_model) which is sqrt(emb_dim) in the interface.

\[Out = embedding(word) * sqrt(emb\_dim)\]

Parameters:

vocab_size (int) – The size of vocabulary.
emb_dim (int) – Dimensionality of each embedding vector.
bos_id (int, optional) – The start token id and also is used as padding id. Defaults to 0.

forward(word)[source]#

Computes word embedding.

Parameters:: word (Tensor) – The input ids which indicates the sequences’ words with shape [batch_size, sequence_length] whose data type can be int or int64.
Returns:: The (scaled) embedding tensor of shape (batch_size, sequence_length, emb_dim) whose data type can be float32 or float64.
Return type:: Tensor

Example

import paddle
from paddlenlp.transformers import WordEmbedding

word_embedding = WordEmbedding(
    vocab_size=30000,
    emb_dim=512,
    bos_id=0)

batch_size = 5
sequence_length = 10
src_words = paddle.randint(low=3, high=30000, shape=[batch_size, sequence_length])
src_emb = word_embedding(src_words)

class PositionalEmbedding(emb_dim, max_length)[source]#

Bases: Layer

This layer produces sinusoidal positional embeddings of any length. While in forward() method, this layer lookups embeddings vector of ids provided by input pos.

Parameters:

emb_dim (int) – The size of each embedding vector.
max_length (int) – The maximum length of sequences.

forward(pos)[source]#

Computes positional embedding.

Parameters:: pos (Tensor) – The input position ids with shape [batch_size, sequence_length] whose data type can be int or int64.
Returns:: The positional embedding tensor of shape (batch_size, sequence_length, emb_dim) whose data type can be float32 or float64.
Return type:: Tensor

Example

import paddle
from paddlenlp.transformers import PositionalEmbedding

pos_embedding = PositionalEmbedding(
    emb_dim=512,
    max_length=256)

batch_size = 5
pos = paddle.tile(paddle.arange(start=0, end=50), repeat_times=[batch_size, 1])
pos_emb = pos_embedding(pos)

class CrossEntropyCriterion(label_smooth_eps=None, pad_idx=0)[source]#

Bases: Layer

Computes the cross entropy loss for given input with or without label smoothing.

Parameters:

label_smooth_eps (float, optional) – The weight used to mix up the original ground-truth distribution and the fixed distribution. Defaults to None. If given, label smoothing will be applied on label.
pad_idx (int, optional) – The token id used to pad variant sequence. Defaults to 0.

forward(predict, label)[source]#

Computes cross entropy loss with or without label smoothing.

Parameters:

predict (Tensor) – The predict results of TransformerModel with shape [batch_size, sequence_length, vocab_size] whose data type can be float32 or float64.
label (Tensor) – The label for correspoding results with shape [batch_size, sequence_length, 1].

Returns:

A tuple with items: (sum_cost, avg_cost, token_num).

With the corresponding fields:

sum_cost (Tensor):
The sum of loss of current batch whose data type can be float32, float64.
avg_cost (Tensor):
The average loss of current batch whose data type can be float32, float64. The relation between sum_cost and avg_cost can be described as:

\[avg\_cost = sum\_cost / token\_num\]
token_num (Tensor):
The number of tokens of current batch. Its data type can be float32, float64.

Return type:

tuple

Example

import paddle
from paddlenlp.transformers import CrossEntropyCriterion

criterion = CrossEntropyCriterion(label_smooth_eps=0.1, pad_idx=0)
batch_size = 1
seq_len = 2
vocab_size = 30000
predict = paddle.rand(shape=[batch_size, seq_len, vocab_size])
label = paddle.randint(
    low=3,
    high=vocab_size,
    shape=[batch_size, seq_len, 1])

criterion(predict, label)

class TransformerDecodeCell(decoder, word_embedding=None, pos_embedding=None, linear=None, dropout=0.1)[source]#

Bases: Layer

This layer wraps a Transformer decoder combined with embedding layer and output layer to produce logits from ids and position.

Parameters:

decoder (callable) – Can be a paddle.nn.TransformerDecoder instance. Or a wrapper that includes an embedding layer accepting ids and positions and includes an output layer transforming decoder output to logits.
word_embedding (callable, optional) – Can be a WordEmbedding instance or a callable that accepts ids as arguments and return embeddings. It can be None if decoder includes a embedding layer. Defaults to None.
pos_embedding (callable, optional) – Can be a PositionalEmbedding instance or a callable that accepts position as arguments and return embeddings. It can be None if decoder includes a positional embedding layer. Defaults to None.
linear (callable, optional) – Can be a paddle.nn.Linear instance or a callable to transform decoder output to logits.
dropout (float, optional) – The dropout rate for the results of word_embedding and pos_embedding. Defaults to 0.1.

forward(inputs, states, static_cache, trg_src_attn_bias, memory, **kwargs)[source]#

Produces logits.

Parameters:

inputs (Tensor|tuple|list) – A tuple/list includes target ids and positions. If word_embedding is None, then it should be a Tensor which means the input for decoder.
states (list) – It is a list and each element of the list is an instance of paddle.nn.MultiheadAttention.Cache for corresponding decoder layer. It can be produced by paddle.nn.TransformerDecoder.gen_cache.
static_cache (list) – It is a list and each element of the list is an instance of paddle.nn.MultiheadAttention.StaticCache for corresponding decoder layer. It can be produced by paddle.nn.TransformerDecoder.gen_cache.
trg_src_attn_bias (Tensor) – A tensor used in self attention to prevents attention to some unwanted positions, usually the subsequent positions. It is a tensor with shape broadcasted to [batch_size, n_head, target_length, target_length], where the unwanted positions have -INF values and the others have 0 values. The data type should be float32 or float64. It can be None when nothing wanted or needed to be prevented attention to.
memory (Tensor) – The output of Transformer encoder. It is a tensor with shape [batch_size, source_length, d_model] and its data type can be float32 or float64.

Returns:

A tuple with items: (outputs, new_states)

With the corresponding fields:

outputs (Tensor):
A float32 or float64 3D tensor representing logits shaped [batch_size, sequence_length, vocab_size].
new_states (Tensor):
This output has the same structure and data type with states while the length is one larger since concatanating the intermediate results of current step.

Return type:

tuple

Example

import paddle
from paddlenlp.transformers import TransformerDecodeCell
from paddlenlp.transformers import TransformerBeamSearchDecoder

def decoder():
    # do decoder
    pass

cell = TransformerDecodeCell(decoder())

self.decode = TransformerBeamSearchDecoder(
    cell, start_token=0, end_token=1, beam_size=4,
    var_dim_in_state=2)

class TransformerBeamSearchDecoder(cell, start_token, end_token, beam_size, var_dim_in_state)[source]#

Bases: BeamSearchDecoder

This layer is a subclass of BeamSearchDecoder to make beam search adapt to Transformer decoder.

Parameters:

cell (TransformerDecodeCell) – An instance of TransformerDecoderCell.
start_token (int) – The start token id.
end_token (int) – The end token id.
beam_size (int) – The beam width used in beam search.
var_dim_in_state (int) – Indicate which dimension of states is variant.

static tile_beam_merge_with_batch(t, beam_size)[source]#

Tiles the batch dimension of a tensor. Specifically, this function takes a tensor t shaped [batch_size, s0, s1, ...] composed of minibatch entries t[0], ..., t[batch_size - 1] and tiles it to have a shape [batch_size * beam_size, s0, s1, ...] composed of minibatch entries t[0], t[0], ..., t[1], t[1], ... where each minibatch entry is repeated beam_size times.

Parameters:

t (list|tuple) – A list of tensor with shape [batch_size, ...].
beam_size (int) – The beam width used in beam search.

Returns:

A tensor with shape [batch_size * beam_size, ...], whose data type is same as t.

Return type:

Tensor

Example

import paddle
from paddlenlp.transformers import TransformerBeamSearchDecoder

t = paddle.rand(shape=[10, 10])
TransformerBeamSearchDecoder.tile_beam_merge_with_batch(t, beam_size=4)

step(time, inputs, states, **kwargs)[source]#

Perform a beam search decoding step, which uses cell to get probabilities, and follows a beam search step to calculate scores and select candidate token ids.

Parameters:

time (Tensor) – An int64 tensor with shape [1] provided by the caller, representing the current time step number of decoding.
inputs (Tensor) – A tensor variable. It is same as initial_inputs returned by initialize() for the first decoding step and next_inputs returned by step() for the others.
states (Tensor) – A structure of tensor variables. It is same as the initial_cell_states returned by initialize() for the first decoding step and next_states returned by step() for the others.
kwargs (dict, optional) – Additional keyword arguments, provided by the caller dynamic_decode.

Returns:

Returns tuple (beam_search_output, beam_search_state, next_inputs, finished). beam_search_state and next_inputs have the same structure, shape and data type as the input arguments states and inputs separately. beam_search_output is a namedtuple(including scores, predicted_ids, parent_ids as fields) of tensor variables, where scores, predicted_ids, parent_ids all has a tensor value shaped [batch_size, beam_size] with data type float32, int64, int64. finished is a bool tensor with shape [batch_size, beam_size].

Return type:

tuple

class TransformerModel(src_vocab_size, trg_vocab_size, max_length, num_encoder_layers, num_decoder_layers, n_head, d_model, d_inner_hid, dropout, weight_sharing, attn_dropout=None, act_dropout=None, bos_id=0, eos_id=1, pad_id=None, activation='relu', normalize_before=True)[source]#

Bases: Layer

The Transformer model.

This model is a Paddle paddle.nn.Layer subclass. Use it as a regular Paddle Layer and refer to the Paddle documentation for all matter related to general usage and behavior.

Parameters:

src_vocab_size (int) – The size of source vocabulary.
trg_vocab_size (int) – The size of target vocabulary.
max_length (int) – The maximum length of input sequences.
num_encoder_layers (int) – The number of sub-layers to be stacked in the encoder.
num_decoder_layers (int) – The number of sub-layers to be stacked in the decoder.
n_head (int) – The number of head used in multi-head attention.
d_model (int) – The dimension for word embeddings, which is also the last dimension of the input and output of multi-head attention, position-wise feed-forward networks, encoder and decoder.
d_inner_hid (int) – Size of the hidden layer in position-wise feed-forward networks.
dropout (float) – Dropout rates. Used for pre-process, activation and inside attention.
weight_sharing (bool) – Whether to use weight sharing.
attn_dropout (float) – The dropout probability used in MHA to drop some attention target. If None, use the value of dropout. Defaults to None.
act_dropout (float) – The dropout probability used after FFN activation. If None, use the value of dropout. Defaults to None.
bos_id (int, optional) – The start token id and also be used as padding id. Defaults to 0.
eos_id (int, optional) – The end token id. Defaults to 1.
pad_id (int, optional) – The pad token id. Defaults to None. If it’s None, the bos_id will be used as pad_id.
activation (str, optional) – The activation used in FFN. Defaults to “relu”.
normalize_before (bool, optional) – Whether to apply pre-normalization. Defaults to True.

forward(src_word, trg_word)[source]#

The Transformer forward methods. The input are source/target sequences, and returns logits.

Parameters:

src_word (Tensor) – The ids of source sequences words. It is a tensor with shape [batch_size, source_sequence_length] and its data type can be int or int64.
trg_word (Tensor) – The ids of target sequences words. It is a tensor with shape [batch_size, target_sequence_length] and its data type can be int or int64.

Returns:

Output tensor of the final layer of the model whose data type can be float32 or float64 with shape [batch_size, sequence_length, vocab_size].

Return type:

Tensor

Example

import paddle
from paddlenlp.transformers import TransformerModel

transformer = TransformerModel(
    src_vocab_size=30000,
    trg_vocab_size=30000,
    max_length=257,
    num_encoder_layers=6,
    num_decoder_layers=6,
    n_head=8,
    d_model=512,
    d_inner_hid=2048,
    dropout=0.1,
    weight_sharing=True,
    bos_id=0,
    eos_id=1)

batch_size = 5
seq_len = 10
predict = transformer(
    src_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]),
    trg_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]))

class InferTransformerModel(src_vocab_size, trg_vocab_size, max_length, num_encoder_layers, num_decoder_layers, n_head, d_model, d_inner_hid, dropout, weight_sharing, attn_dropout=None, act_dropout=None, bos_id=0, eos_id=1, pad_id=None, beam_size=4, max_out_len=256, output_time_major=False, beam_search_version='v1', activation='relu', normalize_before=True, **kwargs)[source]#

Bases: TransformerModel

The Transformer model for auto-regressive generation.

Parameters:

src_vocab_size (int) – The size of source vocabulary.
trg_vocab_size (int) – The size of target vocabulary.
max_length (int) – The maximum length of input sequences.
num_encoder_layers (int) – The number of sub-layers to be stacked in the encoder.
num_decoder_layers (int) – The number of sub-layers to be stacked in the decoder.
n_head (int) – The number of head used in multi-head attention.
d_model (int) – The dimension for word embeddings, which is also the last dimension of the input and output of multi-head attention, position-wise feed-forward networks, encoder and decoder.
d_inner_hid (int) – Size of the hidden layer in position-wise feed-forward networks.
dropout (float) – Dropout rates. Used for pre-process, activation and inside attention.
weight_sharing (bool) – Whether to use weight sharing.
attn_dropout (float) – The dropout probability used in MHA to drop some attention target. If None, use the value of dropout. Defaults to None.
act_dropout (float) – The dropout probability used after FFN activition. If None, use the value of dropout. Defaults to None.
bos_id (int, optional) – The start token id and also is used as padding id. Defaults to 0.
eos_id (int, optional) – The end token id. Defaults to 1.
pad_id (int, optional) – The pad token id. Defaults to None. If it’s None, the bos_id will be used as pad_id.
beam_size (int, optional) – The beam width for beam search. Defaults to 4.
max_out_len (int, optional) – The maximum output length. Defaults to 256.
output_time_major (bool, optional) – Indicate the data layout of predicted Tensor. If False, the data layout would be batch major with shape [batch_size, seq_len, beam_size]. If True, the data layout would be time major with shape [seq_len, batch_size, beam_size]. Default to False.
beam_search_version (str) – Specify beam search version. It should be in one of [v1, v2]. If v2, need to set alpha`(default to 0.6) for length penalty. Default to `v1.
activation (str, optional) – The activation used in FFN. Defaults to “relu”.
normalize_before (bool, optional) – Whether to apply pre-normalization. Defaults to True.
kwargs –
The key word arguments can be rel_len and alpha:
- rel_len(bool, optional): Indicating whether max_out_len in
is the length relative to that of source text. Only works in v2 temporarily. It is suggest to set a small max_out_len and use rel_len=True. Default to False if not set.
- alpha(float, optional): The power number in length penalty
calculation. Refer to GNMT. Only works in v2 temporarily. Default to 0.6 if not set.

forward(src_word, trg_word=None)[source]#

The Transformer forward method.

Parameters:

src_word (Tensor) – The ids of source sequence words. It is a tensor with shape [batch_size, source_sequence_length] and its data type can be int or int64.
trg_word (Tensor) – The ids of target sequence words. Normally, it should NOT be given. If it’s given, force decoding with previous output token will be trigger. Defaults to None.

Returns:

An int64 tensor shaped indicating the predicted ids. Its shape is [batch_size, seq_len, beam_size] or [seq_len, batch_size, beam_size] according to output_time_major.

Return type:

Tensor

Example

import paddle
from paddlenlp.transformers import InferTransformerModel

transformer = InferTransformerModel(
    src_vocab_size=30000,
    trg_vocab_size=30000,
    max_length=256,
    num_encoder_layers=6,
    num_decoder_layers=6,
    n_head=8,
    d_model=512,
    d_inner_hid=2048,
    dropout=0.1,
    weight_sharing=True,
    bos_id=0,
    eos_id=1,
    beam_size=4,
    max_out_len=256)

batch_size = 5
seq_len = 10
transformer(
    src_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]))

beam_search_v2(src_word, beam_size=4, max_len=None, alpha=0.6, trg_word=None, trg_length=None)[source]#: Beam search with the alive and finished two queues, both have a beam size capicity separately. It includes grow_topk grow_alive grow_finish as steps. 1. grow_topk selects the top 2*beam_size candidates to avoid all getting EOS. 2. grow_alive selects the top beam_size non-EOS candidates as the inputs of next decoding step. 3. grow_finish compares the already finished candidates in the finished queue and newly added finished candidates from grow_topk, and selects the top beam_size finished candidates.

class LabelSmoothedCrossEntropyCriterion(label_smoothing, padding_idx=0)[source]#

Bases: Layer

forward(predict, label, reduce=True)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters:

*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments

modeling

Contents

modeling#