encoder¶
-
infer_transformer_encoder
(input, attn_mask, q_weight, q_bias, k_weight, k_bias, v_weight, v_bias, attn_out_weight, attn_out_bias, norm1_weight, norm1_bias, norm2_weight, norm2_bias, ffn_inter_weight, ffn_inter_bias, ffn_out_weight, ffn_out_bias, n_head, size_per_head, n_layer=12, use_gelu=True, remove_padding=False, int8_mode=0, layer_idx=0, allow_gemm_test=False, use_trt_kernel=False, normalize_before=False)[source]¶ Fusion Encoder API intergrating Encoder inference in FastGeneration. It accepts the weight and bias of TransformerEncoder and some other parameters for inference.
-
encoder_layer_forward
(self, src, src_mask, cache=None, sequence_id_offset=None, trt_seq_len=None)[source]¶ Redefines
forward
function ofpaddle.nn.TransformerEncoderLayer
for integrating FastGeneration for inference.The original
forward
function would not be replaced unlessenable_fast_encoder
is called by objects of its base class. After replacing, objects ofpaddle.nn.TransformerEncoderLayer
also have the same member variables as before.After inference,
disable_fast_encoder
could be called to restore theforward
function ofpaddle.nn.TransformerEncoder
andpaddle.nn.TransformerEncoderLayer
.- Parameters
src (Tensor) – The input of Transformer encoder layer. It is a tensor with shape
[batch_size, sequence_length, d_model]
. The data type should be float32 or float64.src_mask (Tensor, optional) – A tensor used in multi-head attention to prevents attention to some unwanted positions, usually the paddings or the subsequent positions. It is a tensor with shape
[batch_size, 1, 1, sequence_length]
. When the data type is bool, the unwanted positions haveFalse
values and the others haveTrue
values. When the data type is int, the unwanted positions have 0 values and the others have 1 values. When the data type is float, the unwanted positions have-INF
values and the others have 0 values. It can be None when nothing wanted or needed to be prevented attention to. Defaults to None.
- Returns
It is a tensor that has the same shape and data type as
enc_input
, representing the output of Transformer encoder layer. Or a tuple ifcache
is not None, except for encoder layer output, the tuple includes the new cache which is same as inputcache
argument butincremental_cache
has an incremental length. Seepaddle.nn.MultiHeadAttention.gen_cache
andpaddle.nn.MultiHeadAttention.forward
for more details.- Return type
src(Tensor|tuple)
-
encoder_forward
(self, src, src_mask=None, cache=None)[source]¶ Redefines
forward
function ofpaddle.nn.TransformerEncoder
for integrating FastGeneration for inference.The original
forward
function would not be replaced unlessenable_fast_encoder
is called by objects of its base class. After replacing, objects ofpaddle.nn.TransformerEncoder
also have the same member variables as before.After inference,
disable_fast_encoder
could be called to restore theforward
function ofpaddle.nn.TransformerEncoder
andpaddle.nn.TransformerEncoderLayer
.- Parameters
src (Tensor) – The input of Transformer encoder. It is a tensor with shape
[batch_size, sequence_length, d_model]
. The data type should be float32 or float16.src_mask (Tensor, optional) – A tensor used in multi-head attention to prevents attention to some unwanted positions, usually the paddings or the subsequent positions. It is a tensor with shape
[batch_size, 1, 1, sequence_length]
. The data type must be float, the unwanted positions have-INF
values or other non-zeros and the wanted positions must be 0.0.
- Returns
It is a tensor that has the same shape and data type as
src
, representing the output of Transformer encoder. Or a tuple ifcache
is not None, except for encoder output, the tuple includes the new cache which is same as inputcache
argument butincremental_cache
in it has an incremental length. Seepaddle.nn.MultiHeadAttention.gen_cache
andpaddle.nn.MultiHeadAttention.forward
for more details.- Return type
output (Tensor|tuple)
-
enable_fast_encoder
(self, use_fp16=False, encoder_lib=None)[source]¶ Compiles fusion encoder operator intergrated FastGeneration using the method of JIT(Just-In-Time) and replaces the
forward
function ofpaddle.nn.TransformerEncoder
andpaddle.nn.TransformerEncoderLayer
objects inherited fromself
to support inference using FastGeneration.Examples
from paddlenlp.ops import enable_fast_encoder, disable_fast_encoder model.eval() model = enable_fast_encoder(model) enc_out = model(src, src_mask) model = disable_fast_encoder(model)
-
disable_fast_encoder
(self)[source]¶ Restores the original
forward
function ofpaddle.nn.TransformerEncoder
andpaddle.nn.TransformerEncoderLayer
objects inherited fromself
.Examples
from paddlenlp.ops import enable_fast_encoder, disable_fast_encoder model.eval() model = enable_fast_encoder(model) enc_out = model(src, src_mask) model = disable_fast_encoder(model)