encoder#

infer_transformer_encoder(input, attn_mask, q_weight, q_bias, k_weight, k_bias, v_weight, v_bias, attn_out_weight, attn_out_bias, norm1_weight, norm1_bias, norm2_weight, norm2_bias, ffn_inter_weight, ffn_inter_bias, ffn_out_weight, ffn_out_bias, n_head, size_per_head, n_layer=12, use_gelu=True, remove_padding=False, int8_mode=0, layer_idx=0, allow_gemm_test=False, use_trt_kernel=False, normalize_before=False)[source]#: Fusion Encoder API intergrating Encoder inference in FastGeneration. It accepts the weight and bias of TransformerEncoder and some other parameters for inference.

encoder_layer_forward(self, src, src_mask, cache=None, sequence_id_offset=None, trt_seq_len=None)[source]#

Redefines forward function of paddle.nn.TransformerEncoderLayer for integrating FastGeneration for inference.

The original forward function would not be replaced unless enable_fast_encoder is called by objects of its base class. After replacing, objects of paddle.nn.TransformerEncoderLayer also have the same member variables as before.

After inference, disable_fast_encoder could be called to restore the forward function of paddle.nn.TransformerEncoder and paddle.nn.TransformerEncoderLayer.

Parameters:

src (Tensor) – The input of Transformer encoder layer. It is a tensor with shape [batch_size, sequence_length, d_model]. The data type should be float32 or float64.
src_mask (Tensor, optional) – A tensor used in multi-head attention to prevents attention to some unwanted positions, usually the paddings or the subsequent positions. It is a tensor with shape [batch_size, 1, 1, sequence_length]. When the data type is bool, the unwanted positions have False values and the others have True values. When the data type is int, the unwanted positions have 0 values and the others have 1 values. When the data type is float, the unwanted positions have -INF values and the others have 0 values. It can be None when nothing wanted or needed to be prevented attention to. Defaults to None.

Returns:

It is a tensor that has the same shape and data type as enc_input, representing the output of Transformer encoder layer. Or a tuple if cache is not None, except for encoder layer output, the tuple includes the new cache which is same as input cache argument but incremental_cache has an incremental length. See paddle.nn.MultiHeadAttention.gen_cache and paddle.nn.MultiHeadAttention.forward for more details.

Return type:

src(Tensor|tuple)

encoder_forward(self, src, src_mask=None, cache=None)[source]#

Redefines forward function of paddle.nn.TransformerEncoder for integrating FastGeneration for inference.

After inference, disable_fast_encoder could be called to restore the forward function of paddle.nn.TransformerEncoder and paddle.nn.TransformerEncoderLayer.

Parameters:

src (Tensor) – The input of Transformer encoder. It is a tensor with shape [batch_size, sequence_length, d_model]. The data type should be float32 or float16.
src_mask (Tensor, optional) – A tensor used in multi-head attention to prevents attention to some unwanted positions, usually the paddings or the subsequent positions. It is a tensor with shape [batch_size, 1, 1, sequence_length]. The data type must be float, the unwanted positions have -INF values or other non-zeros and the wanted positions must be 0.0.

Returns:

It is a tensor that has the same shape and data type as src, representing the output of Transformer encoder. Or a tuple if cache is not None, except for encoder output, the tuple includes the new cache which is same as input cache argument but incremental_cache in it has an incremental length. See paddle.nn.MultiHeadAttention.gen_cache and paddle.nn.MultiHeadAttention.forward for more details.

Return type:

output (Tensor|tuple)

enable_fast_encoder(self, use_fp16=False, encoder_lib=None)[source]#

Compiles fusion encoder operator intergrated FastGeneration using the method of JIT(Just-In-Time) and replaces the forward function of paddle.nn.TransformerEncoder and paddle.nn.TransformerEncoderLayer objects inherited from self to support inference using FastGeneration.

Examples

from paddlenlp.ops import enable_fast_encoder, disable_fast_encoder

model.eval()
model = enable_fast_encoder(model)
enc_out = model(src, src_mask)
model = disable_fast_encoder(model)

disable_fast_encoder(self)[source]#

Restores the original forward function of paddle.nn.TransformerEncoder and paddle.nn.TransformerEncoderLayer objects inherited from self.

Examples

from paddlenlp.ops import enable_fast_encoder, disable_fast_encoder

model.eval()
model = enable_fast_encoder(model)
enc_out = model(src, src_mask)
model = disable_fast_encoder(model)

convert_to_fp16(transformer_encoder)[source]#

Convert paddle.nn.TransformerEncoder’s parameter from float32 to float16

Parameters:: transformer_encoder (obeject, paddle.nn.TransformerEncoder) – The object to be converted to float16 inplaced, it must be an isinstance of paddle.nn.TransformerEncoder.

encoder

Contents

encoder#