modeling#
- position_encoding_init(n_position, d_pos_vec, dtype='float32')[源代码]#
Generates the initial values for the sinusoidal position encoding table. This method follows the implementation in tensor2tensor, but is slightly different from the description in "Attention Is All You Need".
- 参数:
n_position (int) -- The largest position for sequences, that is, the maximum length of source or target sequences.
d_pos_vec (int) -- The size of positional embedding vector.
dtype (str, optional) -- The output
numpy.array
's data type. Defaults to "float32".
- 返回:
The embedding table of sinusoidal position encoding with shape
[n_position, d_pos_vec]
.- 返回类型:
numpy.array
示例
from paddlenlp.transformers import position_encoding_init max_length = 256 emb_dim = 512 pos_table = position_encoding_init(max_length, emb_dim)
- class WordEmbedding(vocab_size, emb_dim, bos_id=0)[源代码]#
基类:
Layer
Word Embedding Layer of Transformer. This layer automatically constructs a 2D embedding matrix based on the input the size of vocabulary (
vocab_size
) and the size of each embedding vector (emb_dim
). This layer lookups embeddings vector of ids provided by inputword
.After the embedding, those weights are multiplied by
sqrt(d_model)
which issqrt(emb_dim)
in the interface.\[Out = embedding(word) * sqrt(emb\_dim)\]- 参数:
vocab_size (int) -- The size of vocabulary.
emb_dim (int) -- Dimensionality of each embedding vector.
bos_id (int, optional) -- The start token id and also is used as padding id. Defaults to 0.
- forward(word)[源代码]#
Computes word embedding.
- 参数:
word (Tensor) -- The input ids which indicates the sequences' words with shape
[batch_size, sequence_length]
whose data type can be int or int64.- 返回:
The (scaled) embedding tensor of shape
(batch_size, sequence_length, emb_dim)
whose data type can be float32 or float64.- 返回类型:
Tensor
示例
import paddle from paddlenlp.transformers import WordEmbedding word_embedding = WordEmbedding( vocab_size=30000, emb_dim=512, bos_id=0) batch_size = 5 sequence_length = 10 src_words = paddle.randint(low=3, high=30000, shape=[batch_size, sequence_length]) src_emb = word_embedding(src_words)
- class PositionalEmbedding(emb_dim, max_length)[源代码]#
基类:
Layer
This layer produces sinusoidal positional embeddings of any length. While in
forward()
method, this layer lookups embeddings vector of ids provided by inputpos
.- 参数:
emb_dim (int) -- The size of each embedding vector.
max_length (int) -- The maximum length of sequences.
- forward(pos)[源代码]#
Computes positional embedding.
- 参数:
pos (Tensor) -- The input position ids with shape
[batch_size, sequence_length]
whose data type can be int or int64.- 返回:
The positional embedding tensor of shape
(batch_size, sequence_length, emb_dim)
whose data type can be float32 or float64.- 返回类型:
Tensor
示例
import paddle from paddlenlp.transformers import PositionalEmbedding pos_embedding = PositionalEmbedding( emb_dim=512, max_length=256) batch_size = 5 pos = paddle.tile(paddle.arange(start=0, end=50), repeat_times=[batch_size, 1]) pos_emb = pos_embedding(pos)
- class CrossEntropyCriterion(label_smooth_eps=None, pad_idx=0)[源代码]#
基类:
Layer
Computes the cross entropy loss for given input with or without label smoothing.
- 参数:
label_smooth_eps (float, optional) -- The weight used to mix up the original ground-truth distribution and the fixed distribution. Defaults to None. If given, label smoothing will be applied on
label
.pad_idx (int, optional) -- The token id used to pad variant sequence. Defaults to 0.
- forward(predict, label)[源代码]#
Computes cross entropy loss with or without label smoothing.
- 参数:
predict (Tensor) -- The predict results of
TransformerModel
with shape[batch_size, sequence_length, vocab_size]
whose data type can be float32 or float64.label (Tensor) -- The label for correspoding results with shape
[batch_size, sequence_length, 1]
.
- 返回:
A tuple with items: (
sum_cost
,avg_cost
,token_num
).With the corresponding fields:
sum_cost
(Tensor):The sum of loss of current batch whose data type can be float32, float64.
avg_cost
(Tensor):The average loss of current batch whose data type can be float32, float64. The relation between
sum_cost
andavg_cost
can be described as:\[avg\_cost = sum\_cost / token\_num\]
token_num
(Tensor):The number of tokens of current batch. Its data type can be float32, float64.
- 返回类型:
tuple
示例
import paddle from paddlenlp.transformers import CrossEntropyCriterion criterion = CrossEntropyCriterion(label_smooth_eps=0.1, pad_idx=0) batch_size = 1 seq_len = 2 vocab_size = 30000 predict = paddle.rand(shape=[batch_size, seq_len, vocab_size]) label = paddle.randint( low=3, high=vocab_size, shape=[batch_size, seq_len, 1]) criterion(predict, label)
- class TransformerDecodeCell(decoder, word_embedding=None, pos_embedding=None, linear=None, dropout=0.1)[源代码]#
基类:
Layer
This layer wraps a Transformer decoder combined with embedding layer and output layer to produce logits from ids and position.
- 参数:
decoder (callable) -- Can be a
paddle.nn.TransformerDecoder
instance. Or a wrapper that includes an embedding layer accepting ids and positions and includes an output layer transforming decoder output to logits.word_embedding (callable, optional) -- Can be a
WordEmbedding
instance or a callable that accepts ids as arguments and return embeddings. It can be None ifdecoder
includes a embedding layer. Defaults to None.pos_embedding (callable, optional) -- Can be a
PositionalEmbedding
instance or a callable that accepts position as arguments and return embeddings. It can be None ifdecoder
includes a positional embedding layer. Defaults to None.linear (callable, optional) -- Can be a
paddle.nn.Linear
instance or a callable to transform decoder output to logits.dropout (float, optional) -- The dropout rate for the results of
word_embedding
andpos_embedding
. Defaults to 0.1.
- forward(inputs, states, static_cache, trg_src_attn_bias, memory, **kwargs)[源代码]#
Produces logits.
- 参数:
inputs (Tensor|tuple|list) -- A tuple/list includes target ids and positions. If
word_embedding
is None, then it should be a Tensor which means the input for decoder.states (list) -- It is a list and each element of the list is an instance of
paddle.nn.MultiheadAttention.Cache
for corresponding decoder layer. It can be produced bypaddle.nn.TransformerDecoder.gen_cache
.static_cache (list) -- It is a list and each element of the list is an instance of
paddle.nn.MultiheadAttention.StaticCache
for corresponding decoder layer. It can be produced bypaddle.nn.TransformerDecoder.gen_cache
.trg_src_attn_bias (Tensor) -- A tensor used in self attention to prevents attention to some unwanted positions, usually the subsequent positions. It is a tensor with shape broadcasted to
[batch_size, n_head, target_length, target_length]
, where the unwanted positions have-INF
values and the others have 0 values. The data type should be float32 or float64. It can be None when nothing wanted or needed to be prevented attention to.memory (Tensor) -- The output of Transformer encoder. It is a tensor with shape
[batch_size, source_length, d_model]
and its data type can be float32 or float64.
- 返回:
A tuple with items:
(outputs, new_states)
With the corresponding fields:
outputs
(Tensor):A float32 or float64 3D tensor representing logits shaped
[batch_size, sequence_length, vocab_size]
.
new_states
(Tensor):This output has the same structure and data type with
states
while the length is one larger since concatanating the intermediate results of current step.
- 返回类型:
tuple
示例
import paddle from paddlenlp.transformers import TransformerDecodeCell from paddlenlp.transformers import TransformerBeamSearchDecoder def decoder(): # do decoder pass cell = TransformerDecodeCell(decoder()) self.decode = TransformerBeamSearchDecoder( cell, start_token=0, end_token=1, beam_size=4, var_dim_in_state=2)
- class TransformerBeamSearchDecoder(cell, start_token, end_token, beam_size, var_dim_in_state)[源代码]#
基类:
BeamSearchDecoder
This layer is a subclass of
BeamSearchDecoder
to make beam search adapt to Transformer decoder.- 参数:
cell (
TransformerDecodeCell
) -- An instance ofTransformerDecoderCell
.start_token (int) -- The start token id.
end_token (int) -- The end token id.
beam_size (int) -- The beam width used in beam search.
var_dim_in_state (int) -- Indicate which dimension of states is variant.
- static tile_beam_merge_with_batch(t, beam_size)[源代码]#
Tiles the batch dimension of a tensor. Specifically, this function takes a tensor t shaped
[batch_size, s0, s1, ...]
composed of minibatch entriest[0], ..., t[batch_size - 1]
and tiles it to have a shape[batch_size * beam_size, s0, s1, ...]
composed of minibatch entriest[0], t[0], ..., t[1], t[1], ...
where each minibatch entry is repeatedbeam_size
times.- 参数:
t (list|tuple) -- A list of tensor with shape
[batch_size, ...]
.beam_size (int) -- The beam width used in beam search.
- 返回:
A tensor with shape
[batch_size * beam_size, ...]
, whose data type is same ast
.- 返回类型:
Tensor
示例
import paddle from paddlenlp.transformers import TransformerBeamSearchDecoder t = paddle.rand(shape=[10, 10]) TransformerBeamSearchDecoder.tile_beam_merge_with_batch(t, beam_size=4)
- step(time, inputs, states, **kwargs)[源代码]#
Perform a beam search decoding step, which uses cell to get probabilities, and follows a beam search step to calculate scores and select candidate token ids.
- 参数:
time (Tensor) -- An
int64
tensor with shape[1]
provided by the caller, representing the current time step number of decoding.inputs (Tensor) -- A tensor variable. It is same as
initial_inputs
returned byinitialize()
for the first decoding step andnext_inputs
returned bystep()
for the others.states (Tensor) -- A structure of tensor variables. It is same as the
initial_cell_states
returned byinitialize()
for the first decoding step andnext_states
returned bystep()
for the others.kwargs (dict, optional) -- Additional keyword arguments, provided by the caller
dynamic_decode
.
- 返回:
Returns tuple (
beam_search_output, beam_search_state, next_inputs, finished
).beam_search_state
andnext_inputs
have the same structure, shape and data type as the input arguments states and inputs separately.beam_search_output
is a namedtuple(including scores, predicted_ids, parent_ids as fields) of tensor variables, wherescores, predicted_ids, parent_ids
all has a tensor value shaped [batch_size, beam_size] with data type float32, int64, int64.finished
is a bool tensor with shape [batch_size, beam_size].- 返回类型:
tuple
- class TransformerModel(src_vocab_size, trg_vocab_size, max_length, num_encoder_layers, num_decoder_layers, n_head, d_model, d_inner_hid, dropout, weight_sharing, attn_dropout=None, act_dropout=None, bos_id=0, eos_id=1, pad_id=None, activation='relu', normalize_before=True)[源代码]#
基类:
Layer
The Transformer model.
This model is a Paddle paddle.nn.Layer subclass. Use it as a regular Paddle Layer and refer to the Paddle documentation for all matter related to general usage and behavior.
- 参数:
src_vocab_size (int) -- The size of source vocabulary.
trg_vocab_size (int) -- The size of target vocabulary.
max_length (int) -- The maximum length of input sequences.
num_encoder_layers (int) -- The number of sub-layers to be stacked in the encoder.
num_decoder_layers (int) -- The number of sub-layers to be stacked in the decoder.
n_head (int) -- The number of head used in multi-head attention.
d_model (int) -- The dimension for word embeddings, which is also the last dimension of the input and output of multi-head attention, position-wise feed-forward networks, encoder and decoder.
d_inner_hid (int) -- Size of the hidden layer in position-wise feed-forward networks.
dropout (float) -- Dropout rates. Used for pre-process, activation and inside attention.
weight_sharing (bool) -- Whether to use weight sharing.
attn_dropout (float) -- The dropout probability used in MHA to drop some attention target. If None, use the value of dropout. Defaults to None.
act_dropout (float) -- The dropout probability used after FFN activation. If None, use the value of dropout. Defaults to None.
bos_id (int, optional) -- The start token id and also be used as padding id. Defaults to 0.
eos_id (int, optional) -- The end token id. Defaults to 1.
pad_id (int, optional) -- The pad token id. Defaults to None. If it's None, the bos_id will be used as pad_id.
activation (str, optional) -- The activation used in FFN. Defaults to "relu".
normalize_before (bool, optional) -- Whether to apply pre-normalization. Defaults to True.
- forward(src_word, trg_word)[源代码]#
The Transformer forward methods. The input are source/target sequences, and returns logits.
- 参数:
src_word (Tensor) -- The ids of source sequences words. It is a tensor with shape
[batch_size, source_sequence_length]
and its data type can be int or int64.trg_word (Tensor) -- The ids of target sequences words. It is a tensor with shape
[batch_size, target_sequence_length]
and its data type can be int or int64.
- 返回:
Output tensor of the final layer of the model whose data type can be float32 or float64 with shape
[batch_size, sequence_length, vocab_size]
.- 返回类型:
Tensor
示例
import paddle from paddlenlp.transformers import TransformerModel transformer = TransformerModel( src_vocab_size=30000, trg_vocab_size=30000, max_length=257, num_encoder_layers=6, num_decoder_layers=6, n_head=8, d_model=512, d_inner_hid=2048, dropout=0.1, weight_sharing=True, bos_id=0, eos_id=1) batch_size = 5 seq_len = 10 predict = transformer( src_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]), trg_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]))
- class InferTransformerModel(src_vocab_size, trg_vocab_size, max_length, num_encoder_layers, num_decoder_layers, n_head, d_model, d_inner_hid, dropout, weight_sharing, attn_dropout=None, act_dropout=None, bos_id=0, eos_id=1, pad_id=None, beam_size=4, max_out_len=256, output_time_major=False, beam_search_version='v1', activation='relu', normalize_before=True, **kwargs)[源代码]#
-
The Transformer model for auto-regressive generation.
- 参数:
src_vocab_size (int) -- The size of source vocabulary.
trg_vocab_size (int) -- The size of target vocabulary.
max_length (int) -- The maximum length of input sequences.
num_encoder_layers (int) -- The number of sub-layers to be stacked in the encoder.
num_decoder_layers (int) -- The number of sub-layers to be stacked in the decoder.
n_head (int) -- The number of head used in multi-head attention.
d_model (int) -- The dimension for word embeddings, which is also the last dimension of the input and output of multi-head attention, position-wise feed-forward networks, encoder and decoder.
d_inner_hid (int) -- Size of the hidden layer in position-wise feed-forward networks.
dropout (float) -- Dropout rates. Used for pre-process, activation and inside attention.
weight_sharing (bool) -- Whether to use weight sharing.
attn_dropout (float) -- The dropout probability used in MHA to drop some attention target. If None, use the value of dropout. Defaults to None.
act_dropout (float) -- The dropout probability used after FFN activition. If None, use the value of dropout. Defaults to None.
bos_id (int, optional) -- The start token id and also is used as padding id. Defaults to 0.
eos_id (int, optional) -- The end token id. Defaults to 1.
pad_id (int, optional) -- The pad token id. Defaults to None. If it's None, the bos_id will be used as pad_id.
beam_size (int, optional) -- The beam width for beam search. Defaults to 4.
max_out_len (int, optional) -- The maximum output length. Defaults to 256.
output_time_major (bool, optional) -- Indicate the data layout of predicted Tensor. If
False
, the data layout would be batch major with shape[batch_size, seq_len, beam_size]
. IfTrue
, the data layout would be time major with shape[seq_len, batch_size, beam_size]
. Default toFalse
.beam_search_version (str) -- Specify beam search version. It should be in one of [
v1
,v2
]. Ifv2
, need to setalpha`(default to 0.6) for length penalty. Default to `v1
.activation (str, optional) -- The activation used in FFN. Defaults to "relu".
normalize_before (bool, optional) -- Whether to apply pre-normalization. Defaults to True.
kwargs --
The key word arguments can be
rel_len
andalpha
:rel_len(bool, optional)
: Indicating whethermax_out_len
in
is the length relative to that of source text. Only works in
v2
temporarily. It is suggest to set a smallmax_out_len
and userel_len=True
. Default to False if not set.alpha(float, optional)
: The power number in length penalty
calculation. Refer to GNMT. Only works in
v2
temporarily. Default to 0.6 if not set.
- forward(src_word, trg_word=None)[源代码]#
The Transformer forward method.
- 参数:
src_word (Tensor) -- The ids of source sequence words. It is a tensor with shape
[batch_size, source_sequence_length]
and its data type can be int or int64.trg_word (Tensor) -- The ids of target sequence words. Normally, it should NOT be given. If it's given, force decoding with previous output token will be trigger. Defaults to None.
- 返回:
An int64 tensor shaped indicating the predicted ids. Its shape is
[batch_size, seq_len, beam_size]
or[seq_len, batch_size, beam_size]
according tooutput_time_major
.- 返回类型:
Tensor
示例
import paddle from paddlenlp.transformers import InferTransformerModel transformer = InferTransformerModel( src_vocab_size=30000, trg_vocab_size=30000, max_length=256, num_encoder_layers=6, num_decoder_layers=6, n_head=8, d_model=512, d_inner_hid=2048, dropout=0.1, weight_sharing=True, bos_id=0, eos_id=1, beam_size=4, max_out_len=256) batch_size = 5 seq_len = 10 transformer( src_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]))
- beam_search_v2(src_word, beam_size=4, max_len=None, alpha=0.6, trg_word=None, trg_length=None)[源代码]#
Beam search with the alive and finished two queues, both have a beam size capicity separately. It includes
grow_topk
grow_alive
grow_finish
as steps. 1.grow_topk
selects the top2*beam_size
candidates to avoid all getting EOS. 2.grow_alive
selects the topbeam_size
non-EOS candidates as the inputs of next decoding step. 3.grow_finish
compares the already finished candidates in the finished queue and newly added finished candidates fromgrow_topk
, and selects the topbeam_size
finished candidates.