modeling¶
-
position_encoding_init
(n_position, d_pos_vec, dtype='float32')[source]¶ Generates the initial values for the sinusoidal position encoding table. This method follows the implementation in tensor2tensor, but is slightly different from the description in “Attention Is All You Need”.
- Parameters
n_position (int) – The largest position for sequences, that is, the maximum length of source or target sequences.
d_pos_vec (int) – The size of positional embedding vector.
dtype (str, optional) – The output
numpy.array
’s data type. Defaults to “float32”.
- Returns
The embedding table of sinusoidal position encoding with shape
[n_position, d_pos_vec]
.- Return type
numpy.array
Example
from paddlenlp.transformers import position_encoding_init max_length = 256 emb_dim = 512 pos_table = position_encoding_init(max_length, emb_dim)
-
class
WordEmbedding
(vocab_size, emb_dim, bos_id=0)[source]¶ Bases:
paddle.fluid.dygraph.layers.Layer
Word Embedding Layer of Transformer. This layer automatically constructs a 2D embedding matrix based on the input the size of vocabulary (
vocab_size
) and the size of each embedding vector (emb_dim
). This layer lookups embeddings vector of ids provided by inputword
.After the embedding, those weights are multiplied by
sqrt(d_model)
which issqrt(emb_dim)
in the interface.\[Out = embedding(word) * sqrt(emb\_dim)\]- Parameters
vocab_size (int) – The size of vocabulary.
emb_dim (int) – Dimensionality of each embedding vector.
bos_id (int, optional) – The start token id and also is used as padding id. Defaults to 0.
-
forward
(word)[source]¶ Computes word embedding.
- Parameters
word (Tensor) – The input ids which indicates the sequences’ words with shape
[batch_size, sequence_length]
whose data type can be int or int64.- Returns
The (scaled) embedding tensor of shape
(batch_size, sequence_length, emb_dim)
whose data type can be float32 or float64.- Return type
Tensor
Example
import paddle from paddlenlp.transformers import WordEmbedding word_embedding = WordEmbedding( vocab_size=30000, emb_dim=512, bos_id=0) batch_size = 5 sequence_length = 10 src_words = paddle.randint(low=3, high=30000, shape=[batch_size, sequence_length]) src_emb = word_embedding(src_words)
-
class
PositionalEmbedding
(emb_dim, max_length)[source]¶ Bases:
paddle.fluid.dygraph.layers.Layer
This layer produces sinusoidal positional embeddings of any length. While in
forward()
method, this layer lookups embeddings vector of ids provided by inputpos
.- Parameters
emb_dim (int) – The size of each embedding vector.
max_length (int) – The maximum length of sequences.
-
forward
(pos)[source]¶ Computes positional embedding.
- Parameters
pos (Tensor) – The input position ids with shape
[batch_size, sequence_length]
whose data type can be int or int64.- Returns
The positional embedding tensor of shape
(batch_size, sequence_length, emb_dim)
whose data type can be float32 or float64.- Return type
Tensor
Example
import paddle from paddlenlp.transformers import PositionalEmbedding pos_embedding = PositionalEmbedding( emb_dim=512, max_length=256) batch_size = 5 pos = paddle.tile(paddle.arange(start=0, end=50), repeat_times=[batch_size, 1]) pos_emb = pos_embedding(pos)
-
class
CrossEntropyCriterion
(label_smooth_eps=None, pad_idx=0)[source]¶ Bases:
paddle.fluid.dygraph.layers.Layer
Computes the cross entropy loss for given input with or without label smoothing.
- Parameters
label_smooth_eps (float, optional) – The weight used to mix up the original ground-truth distribution and the fixed distribution. Defaults to None. If given, label smoothing will be applied on
label
.pad_idx (int, optional) – The token id used to pad variant sequence. Defaults to 0.
-
forward
(predict, label)[source]¶ Computes cross entropy loss with or without label smoothing.
- Parameters
predict (Tensor) – The predict results of
TransformerModel
with shape[batch_size, sequence_length, vocab_size]
whose data type can be float32 or float64.label (Tensor) – The label for correspoding results with shape
[batch_size, sequence_length, 1]
.
- Returns
A tuple with items: (
sum_cost
,avg_cost
,token_num
).With the corresponding fields:
sum_cost
(Tensor):The sum of loss of current batch whose data type can be float32, float64.
avg_cost
(Tensor):The average loss of current batch whose data type can be float32, float64. The relation between
sum_cost
andavg_cost
can be described as:\[avg\_cost = sum\_cost / token\_num\]
token_num
(Tensor):The number of tokens of current batch. Its data type can be float32, float64.
- Return type
tuple
Example
import paddle from paddlenlp.transformers import CrossEntropyCriterion criterion = CrossEntropyCriterion(label_smooth_eps=0.1, pad_idx=0) batch_size = 1 seq_len = 2 vocab_size = 30000 predict = paddle.rand(shape=[batch_size, seq_len, vocab_size]) label = paddle.randint( low=3, high=vocab_size, shape=[batch_size, seq_len, 1]) criterion(predict, label)
-
class
TransformerDecodeCell
(decoder, word_embedding=None, pos_embedding=None, linear=None, dropout=0.1)[source]¶ Bases:
paddle.fluid.dygraph.layers.Layer
This layer wraps a Transformer decoder combined with embedding layer and output layer to produce logits from ids and position.
- Parameters
decoder (callable) – Can be a
paddle.nn.TransformerDecoder
instance. Or a wrapper that includes an embedding layer accepting ids and positions and includes an output layer transforming decoder output to logits.word_embedding (callable, optional) – Can be a
WordEmbedding
instance or a callable that accepts ids as arguments and return embeddings. It can be None ifdecoder
includes a embedding layer. Defaults to None.pos_embedding (callable, optional) – Can be a
PositionalEmbedding
instance or a callable that accepts position as arguments and return embeddings. It can be None ifdecoder
includes a positional embedding layer. Defaults to None.linear (callable, optional) – Can be a
paddle.nn.Linear
instance or a callable to transform decoder output to logits.dropout (float, optional) – The dropout rate for the results of
word_embedding
andpos_embedding
. Defaults to 0.1.
-
forward
(inputs, states, static_cache, trg_src_attn_bias, memory, **kwargs)[source]¶ Produces logits.
- Parameters
inputs (Tensor|tuple|list) – A tuple/list includes target ids and positions. If
word_embedding
is None, then it should be a Tensor which means the input for decoder.states (list) – It is a list and each element of the list is an instance of
paddle.nn.MultiheadAttention.Cache
for corresponding decoder layer. It can be produced bypaddle.nn.TransformerDecoder.gen_cache
.static_cache (list) – It is a list and each element of the list is an instance of
paddle.nn.MultiheadAttention.StaticCache
for corresponding decoder layer. It can be produced bypaddle.nn.TransformerDecoder.gen_cache
.trg_src_attn_bias (Tensor) – A tensor used in self attention to prevents attention to some unwanted positions, usually the subsequent positions. It is a tensor with shape broadcasted to
[batch_size, n_head, target_length, target_length]
, where the unwanted positions have-INF
values and the others have 0 values. The data type should be float32 or float64. It can be None when nothing wanted or needed to be prevented attention to.memory (Tensor) – The output of Transformer encoder. It is a tensor with shape
[batch_size, source_length, d_model]
and its data type can be float32 or float64.
- Returns
A tuple with items:
(outputs, new_states)
With the corresponding fields:
outputs
(Tensor):A float32 or float64 3D tensor representing logits shaped
[batch_size, sequence_length, vocab_size]
.
new_states
(Tensor):This output has the same structure and data type with
states
while the length is one larger since concatanating the intermediate results of current step.
- Return type
tuple
Example
import paddle from paddlenlp.transformers import TransformerDecodeCell from paddlenlp.transformers import TransformerBeamSearchDecoder def decoder(): # do decoder pass cell = TransformerDecodeCell(decoder()) self.decode = TransformerBeamSearchDecoder( cell, start_token=0, end_token=1, beam_size=4, var_dim_in_state=2)
-
class
TransformerBeamSearchDecoder
(cell, start_token, end_token, beam_size, var_dim_in_state)[source]¶ Bases:
paddle.fluid.layers.rnn.BeamSearchDecoder
This layer is a subclass of
BeamSearchDecoder
to make beam search adapt to Transformer decoder.- Parameters
cell (
TransformerDecodeCell
) – An instance ofTransformerDecoderCell
.start_token (int) – The start token id.
end_token (int) – The end token id.
beam_size (int) – The beam width used in beam search.
var_dim_in_state (int) – Indicate which dimension of states is variant.
-
static
tile_beam_merge_with_batch
(t, beam_size)[source]¶ Tiles the batch dimension of a tensor. Specifically, this function takes a tensor t shaped
[batch_size, s0, s1, ...]
composed of minibatch entriest[0], ..., t[batch_size - 1]
and tiles it to have a shape[batch_size * beam_size, s0, s1, ...]
composed of minibatch entriest[0], t[0], ..., t[1], t[1], ...
where each minibatch entry is repeatedbeam_size
times.- Parameters
t (list|tuple) – A list of tensor with shape
[batch_size, ...]
.beam_size (int) – The beam width used in beam search.
- Returns
A tensor with shape
[batch_size * beam_size, ...]
, whose data type is same ast
.- Return type
Tensor
Example
import paddle from paddlenlp.transformers import TransformerBeamSearchDecoder t = paddle.rand(shape=[10, 10]) TransformerBeamSearchDecoder.tile_beam_merge_with_batch(t, beam_size=4)
-
step
(time, inputs, states, **kwargs)[source]¶ Perform a beam search decoding step, which uses cell to get probabilities, and follows a beam search step to calculate scores and select candidate token ids.
- Parameters
time (Tensor) – An
int64
tensor with shape[1]
provided by the caller, representing the current time step number of decoding.inputs (Tensor) – A tensor variable. It is same as
initial_inputs
returned byinitialize()
for the first decoding step andnext_inputs
returned bystep()
for the others.states (Tensor) – A structure of tensor variables. It is same as the
initial_cell_states
returned byinitialize()
for the first decoding step andnext_states
returned bystep()
for the others.kwargs (dict, optional) – Additional keyword arguments, provided by the caller
dynamic_decode
.
- Returns
Returns tuple (
beam_search_output, beam_search_state, next_inputs, finished
).beam_search_state
andnext_inputs
have the same structure, shape and data type as the input arguments states and inputs separately.beam_search_output
is a namedtuple(including scores, predicted_ids, parent_ids as fields) of tensor variables, wherescores, predicted_ids, parent_ids
all has a tensor value shaped [batch_size, beam_size] with data type float32, int64, int64.finished
is a bool tensor with shape [batch_size, beam_size].- Return type
tuple
-
class
TransformerModel
(src_vocab_size, trg_vocab_size, max_length, num_encoder_layers, num_decoder_layers, n_head, d_model, d_inner_hid, dropout, weight_sharing, attn_dropout=None, act_dropout=None, bos_id=0, eos_id=1, pad_id=None, activation='relu', normalize_before=True)[source]¶ Bases:
paddle.fluid.dygraph.layers.Layer
The Transformer model.
This model is a Paddle paddle.nn.Layer subclass. Use it as a regular Paddle Layer and refer to the Paddle documentation for all matter related to general usage and behavior.
- Parameters
src_vocab_size (int) – The size of source vocabulary.
trg_vocab_size (int) – The size of target vocabulary.
max_length (int) – The maximum length of input sequences.
num_encoder_layers (int) – The number of sub-layers to be stacked in the encoder.
num_decoder_layers (int) – The number of sub-layers to be stacked in the decoder.
n_head (int) – The number of head used in multi-head attention.
d_model (int) – The dimension for word embeddings, which is also the last dimension of the input and output of multi-head attention, position-wise feed-forward networks, encoder and decoder.
d_inner_hid (int) – Size of the hidden layer in position-wise feed-forward networks.
dropout (float) – Dropout rates. Used for pre-process, activation and inside attention.
weight_sharing (bool) – Whether to use weight sharing.
attn_dropout (float) – The dropout probability used in MHA to drop some attention target. If None, use the value of dropout. Defaults to None.
act_dropout (float) – The dropout probability used after FFN activation. If None, use the value of dropout. Defaults to None.
bos_id (int, optional) – The start token id and also be used as padding id. Defaults to 0.
eos_id (int, optional) – The end token id. Defaults to 1.
pad_id (int, optional) – The pad token id. Defaults to None. If it’s None, the bos_id will be used as pad_id.
activation (str, optional) – The activation used in FFN. Defaults to “relu”.
normalize_before (bool, optional) – Whether to apply pre-normalization. Defaults to True.
-
forward
(src_word, trg_word)[source]¶ The Transformer forward methods. The input are source/target sequences, and returns logits.
- Parameters
src_word (Tensor) – The ids of source sequences words. It is a tensor with shape
[batch_size, source_sequence_length]
and its data type can be int or int64.trg_word (Tensor) – The ids of target sequences words. It is a tensor with shape
[batch_size, target_sequence_length]
and its data type can be int or int64.
- Returns
Output tensor of the final layer of the model whose data type can be float32 or float64 with shape
[batch_size, sequence_length, vocab_size]
.- Return type
Tensor
Example
import paddle from paddlenlp.transformers import TransformerModel transformer = TransformerModel( src_vocab_size=30000, trg_vocab_size=30000, max_length=257, num_encoder_layers=6, num_decoder_layers=6, n_head=8, d_model=512, d_inner_hid=2048, dropout=0.1, weight_sharing=True, bos_id=0, eos_id=1) batch_size = 5 seq_len = 10 predict = transformer( src_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]), trg_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]))
-
class
InferTransformerModel
(src_vocab_size, trg_vocab_size, max_length, num_encoder_layers, num_decoder_layers, n_head, d_model, d_inner_hid, dropout, weight_sharing, attn_dropout=None, act_dropout=None, bos_id=0, eos_id=1, pad_id=None, beam_size=4, max_out_len=256, output_time_major=False, beam_search_version='v1', activation='relu', normalize_before=True, **kwargs)[source]¶ Bases:
paddlenlp.transformers.transformer.modeling.TransformerModel
The Transformer model for auto-regressive generation.
- Parameters
src_vocab_size (int) – The size of source vocabulary.
trg_vocab_size (int) – The size of target vocabulary.
max_length (int) – The maximum length of input sequences.
num_encoder_layers (int) – The number of sub-layers to be stacked in the encoder.
num_decoder_layers (int) – The number of sub-layers to be stacked in the decoder.
n_head (int) – The number of head used in multi-head attention.
d_model (int) – The dimension for word embeddings, which is also the last dimension of the input and output of multi-head attention, position-wise feed-forward networks, encoder and decoder.
d_inner_hid (int) – Size of the hidden layer in position-wise feed-forward networks.
dropout (float) – Dropout rates. Used for pre-process, activation and inside attention.
weight_sharing (bool) – Whether to use weight sharing.
attn_dropout (float) – The dropout probability used in MHA to drop some attention target. If None, use the value of dropout. Defaults to None.
act_dropout (float) – The dropout probability used after FFN activition. If None, use the value of dropout. Defaults to None.
bos_id (int, optional) – The start token id and also is used as padding id. Defaults to 0.
eos_id (int, optional) – The end token id. Defaults to 1.
pad_id (int, optional) – The pad token id. Defaults to None. If it’s None, the bos_id will be used as pad_id.
beam_size (int, optional) – The beam width for beam search. Defaults to 4.
max_out_len (int, optional) – The maximum output length. Defaults to 256.
output_time_major (bool, optional) – Indicate the data layout of predicted Tensor. If
False
, the data layout would be batch major with shape[batch_size, seq_len, beam_size]
. IfTrue
, the data layout would be time major with shape[seq_len, batch_size, beam_size]
. Default toFalse
.beam_search_version (str) – Specify beam search version. It should be in one of [
v1
,v2
]. Ifv2
, need to setalpha`(default to 0.6) for length penalty. Default to `v1
.activation (str, optional) – The activation used in FFN. Defaults to “relu”.
normalize_before (bool, optional) – Whether to apply pre-normalization. Defaults to True.
kwargs –
The key word arguments can be
rel_len
andalpha
:rel_len(bool, optional)
: Indicating whethermax_out_len
in
is the length relative to that of source text. Only works in
v2
temporarily. It is suggest to set a smallmax_out_len
and userel_len=True
. Default to False if not set.alpha(float, optional)
: The power number in length penalty
calculation. Refer to GNMT. Only works in
v2
temporarily. Default to 0.6 if not set.
-
forward
(src_word, trg_word=None)[source]¶ The Transformer forward method.
- Parameters
src_word (Tensor) – The ids of source sequence words. It is a tensor with shape
[batch_size, source_sequence_length]
and its data type can be int or int64.trg_word (Tensor) – The ids of target sequence words. Normally, it should NOT be given. If it’s given, force decoding with previous output token will be trigger. Defaults to None.
- Returns
An int64 tensor shaped indicating the predicted ids. Its shape is
[batch_size, seq_len, beam_size]
or[seq_len, batch_size, beam_size]
according tooutput_time_major
.- Return type
Tensor
Example
import paddle from paddlenlp.transformers import InferTransformerModel transformer = InferTransformerModel( src_vocab_size=30000, trg_vocab_size=30000, max_length=256, num_encoder_layers=6, num_decoder_layers=6, n_head=8, d_model=512, d_inner_hid=2048, dropout=0.1, weight_sharing=True, bos_id=0, eos_id=1, beam_size=4, max_out_len=256) batch_size = 5 seq_len = 10 transformer( src_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]))
-
beam_search_v2
(src_word, beam_size=4, max_len=None, alpha=0.6, trg_word=None, trg_length=None)[source]¶ Beam search with the alive and finished two queues, both have a beam size capicity separately. It includes
grow_topk
grow_alive
grow_finish
as steps. 1.grow_topk
selects the top2*beam_size
candidates to avoid all getting EOS. 2.grow_alive
selects the topbeam_size
non-EOS candidates as the inputs of next decoding step. 3.grow_finish
compares the already finished candidates in the finished queue and newly added finished candidates fromgrow_topk
, and selects the topbeam_size
finished candidates.