encoder#
- infer_transformer_encoder(input, attn_mask, q_weight, q_bias, k_weight, k_bias, v_weight, v_bias, attn_out_weight, attn_out_bias, norm1_weight, norm1_bias, norm2_weight, norm2_bias, ffn_inter_weight, ffn_inter_bias, ffn_out_weight, ffn_out_bias, n_head, size_per_head, n_layer=12, use_gelu=True, remove_padding=False, int8_mode=0, layer_idx=0, allow_gemm_test=False, use_trt_kernel=False, normalize_before=False)[source]#
Fusion Encoder API intergrating Encoder inference in FastGeneration. It accepts the weight and bias of TransformerEncoder and some other parameters for inference.
- encoder_layer_forward(self, src, src_mask, cache=None, sequence_id_offset=None, trt_seq_len=None)[source]#
Redefines
forwardfunction ofpaddle.nn.TransformerEncoderLayerfor integrating FastGeneration for inference.The original
forwardfunction would not be replaced unlessenable_fast_encoderis called by objects of its base class. After replacing, objects ofpaddle.nn.TransformerEncoderLayeralso have the same member variables as before.After inference,
disable_fast_encodercould be called to restore theforwardfunction ofpaddle.nn.TransformerEncoderandpaddle.nn.TransformerEncoderLayer.- Parameters:
src (Tensor) – The input of Transformer encoder layer. It is a tensor with shape
[batch_size, sequence_length, d_model]. The data type should be float32 or float64.src_mask (Tensor, optional) – A tensor used in multi-head attention to prevents attention to some unwanted positions, usually the paddings or the subsequent positions. It is a tensor with shape
[batch_size, 1, 1, sequence_length]. When the data type is bool, the unwanted positions haveFalsevalues and the others haveTruevalues. When the data type is int, the unwanted positions have 0 values and the others have 1 values. When the data type is float, the unwanted positions have-INFvalues and the others have 0 values. It can be None when nothing wanted or needed to be prevented attention to. Defaults to None.
- Returns:
It is a tensor that has the same shape and data type as
enc_input, representing the output of Transformer encoder layer. Or a tuple ifcacheis not None, except for encoder layer output, the tuple includes the new cache which is same as inputcacheargument butincremental_cachehas an incremental length. Seepaddle.nn.MultiHeadAttention.gen_cacheandpaddle.nn.MultiHeadAttention.forwardfor more details.- Return type:
src(Tensor|tuple)
- encoder_forward(self, src, src_mask=None, cache=None)[source]#
Redefines
forwardfunction ofpaddle.nn.TransformerEncoderfor integrating FastGeneration for inference.The original
forwardfunction would not be replaced unlessenable_fast_encoderis called by objects of its base class. After replacing, objects ofpaddle.nn.TransformerEncoderalso have the same member variables as before.After inference,
disable_fast_encodercould be called to restore theforwardfunction ofpaddle.nn.TransformerEncoderandpaddle.nn.TransformerEncoderLayer.- Parameters:
src (Tensor) – The input of Transformer encoder. It is a tensor with shape
[batch_size, sequence_length, d_model]. The data type should be float32 or float16.src_mask (Tensor, optional) – A tensor used in multi-head attention to prevents attention to some unwanted positions, usually the paddings or the subsequent positions. It is a tensor with shape
[batch_size, 1, 1, sequence_length]. The data type must be float, the unwanted positions have-INFvalues or other non-zeros and the wanted positions must be 0.0.
- Returns:
It is a tensor that has the same shape and data type as
src, representing the output of Transformer encoder. Or a tuple ifcacheis not None, except for encoder output, the tuple includes the new cache which is same as inputcacheargument butincremental_cachein it has an incremental length. Seepaddle.nn.MultiHeadAttention.gen_cacheandpaddle.nn.MultiHeadAttention.forwardfor more details.- Return type:
output (Tensor|tuple)
- enable_fast_encoder(self, use_fp16=False, encoder_lib=None)[source]#
Compiles fusion encoder operator intergrated FastGeneration using the method of JIT(Just-In-Time) and replaces the
forwardfunction ofpaddle.nn.TransformerEncoderandpaddle.nn.TransformerEncoderLayerobjects inherited fromselfto support inference using FastGeneration.Examples
from paddlenlp.ops import enable_fast_encoder, disable_fast_encoder model.eval() model = enable_fast_encoder(model) enc_out = model(src, src_mask) model = disable_fast_encoder(model)
- disable_fast_encoder(self)[source]#
Restores the original
forwardfunction ofpaddle.nn.TransformerEncoderandpaddle.nn.TransformerEncoderLayerobjects inherited fromself.Examples
from paddlenlp.ops import enable_fast_encoder, disable_fast_encoder model.eval() model = enable_fast_encoder(model) enc_out = model(src, src_mask) model = disable_fast_encoder(model)