faster_transformer¶
-
class
FasterTransformer
(src_vocab_size, trg_vocab_size, max_length, num_encoder_layers, num_decoder_layers, n_head, d_model, d_inner_hid, dropout, weight_sharing, attn_dropout=None, act_dropout=None, bos_id=0, eos_id=1, decoding_strategy='beam_search', beam_size=4, topk=1, topp=0.0, max_out_len=256, diversity_rate=0.0, decoding_lib=None, use_fp16_decoding=False, enable_faster_encoder=False, use_fp16_encoder=False, rel_len=False, alpha=0.6)[source]¶ Bases:
paddlenlp.transformers.transformer.modeling.TransformerModel
FasterTransformer is a faster version for generation with the Transformer model. It uses a custom op based on and enhancing NV FasterTransformer to do fast generation.
- Parameters
src_vocab_size (int) – The size of source vocabulary.
trg_vocab_size (int) – The size of target vocabulary.
max_length (int) – The maximum length of input sequences.
num_encoder_layers (int) – The number of sub-layers to be stacked in the encoder.
num_decoder_layers (int) – The number of sub-layers to be stacked in the decoder.
n_head (int) – The number of head used in multi-head attention.
d_model (int) – The dimension for word embeddings, which is also the last dimension of the input and output of multi-head attention, position-wise feed-forward networks, encoder and decoder.
d_inner_hid (int) – Size of the hidden layer in position-wise feed-forward networks.
dropout (float) – Dropout rates. Used for pre-process, activation and inside attention.
weight_sharing (bool) – Whether to use weight sharing.
attn_dropout (float) – The dropout probability used in MHA to drop some attention target. If None, use the value of dropout. Defaults to None.
act_dropout (float) – The dropout probability used after FFN activition. If None, use the value of dropout. Defaults to None.
bos_id (int, optional) – The start token id and also is used as padding id. Defaults to 0.
eos_id (int, optional) – The end token id. Defaults to 1.
decoding_strategy (str, optional) – Indicating the strategy of decoding. It can be ‘beam_search’, ‘beam_search_v2’, ‘topk_sampling’ and ‘topp_sampling’. For beam search strategies, ‘v2’ would select the top
beam_size * 2
beams and process the topbeam_size
alive and finish beams in them separately, while ‘v1’ would only select the topbeam_size
beams and mix up the alive and finish beams. ‘v2’ always searchs more and get better results, since the alive beams would always bebeam_size
while the number of alive beams inv1
might decrease when meeting the end token. However, ‘v2’ always generates longer results thus might do more calculation and be slower.beam_size (int, optional) – The beam width for beam search. Defaults to 4.
topk (int, optional) – The number of highest probability tokens to keep for top-k sampling. Defaults to 4.
topp (float, optional) – The most probable tokens whose cumulative probability is not less than
topp
are kept for top-p sampling. Defaults to 4.max_out_len (int, optional) – The maximum output length. Defaults to 256.
diversity_rate (float, optional) – Refer to A Simple, Fast Diverse Decoding Algorithm for Neural Generation for details. Bigger
diversity_rate
would lead to more diversity. ifdiversity_rate == 0
is equivalent to naive BeamSearch. Default to 0 if not set.use_fp16_decoding (bool, optional) – Whether to use fp16 for decoding.
enable_faster_encoder (bool, optional) – Whether to use the faster version of encoder. This is experimental option for now. Defaults to False.
use_fp16_encoder (bool, optional) – Whether to use fp16 for encoder. Only works when enable_faster_encoder is True. Defaults to False.
rel_len (bool, optional) – Indicating whether
max_out_len
in is the length relative to that of source text. Only works inv2
temporarily. It is suggest to set a smallmax_out_len
and userel_len=True
. Default to False if not set.alpha (float, optional) – The power number in length penalty calculation. Only works in
v2
temporarily. Refer to GNMT. Default to 0.6 if not set.
-
forward
(src_word, trg_word=None)[source]¶ The Transformer forward methods. The input are source/target sequences, and returns logits.
- Parameters
src_word (Tensor) – The ids of source sequences words. It is a tensor with shape
[batch_size, source_sequence_length]
and its data type can be int or int64.trg_word (Tensor) – The ids of target sequences words. It is a tensor with shape
[batch_size, target_sequence_length]
and its data type can be int or int64.
- Returns
Output tensor of the final layer of the model whose data type can be float32 or float64 with shape
[batch_size, sequence_length, vocab_size]
.- Return type
Tensor
Example
import paddle from paddlenlp.transformers import TransformerModel transformer = TransformerModel( src_vocab_size=30000, trg_vocab_size=30000, max_length=257, num_encoder_layers=6, num_decoder_layers=6, n_head=8, d_model=512, d_inner_hid=2048, dropout=0.1, weight_sharing=True, bos_id=0, eos_id=1) batch_size = 5 seq_len = 10 predict = transformer( src_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]), trg_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]))
-
export_params
(init_from_params, place)[source]¶ This method is used for load static graph from dygraph checkpoint or export inference model using static graph. Do NOT support faster encoder.
- Parameters
init_from_params (string) – The path to dygraph checkpoint.
place (paddle.Place) – The place to execute static graph.
Example
-
class
TransformerGenerator
(src_vocab_size, trg_vocab_size, max_length, num_encoder_layers, num_decoder_layers, n_head, d_model, d_inner_hid, dropout, weight_sharing, bos_id=0, eos_id=1, beam_size=4, max_out_len=256, **kwargs)[source]¶ Bases:
paddle.fluid.dygraph.layers.Layer
The Transformer model for auto-regressive generation with beam search. It wraps
FasterTransformer
andInferTransformerModel
, and automatically chioces usingFasterTransformer
(with jit building) or the slower verisonInferTransformerModel
.- Parameters
src_vocab_size (int) – The size of source vocabulary.
trg_vocab_size (int) – The size of target vocabulary.
max_length (int) – The maximum length of input sequences.
num_encoder_layers (int) – The number of sub-layers to be stacked in the encoder.
num_decoder_layers (int) – The number of sub-layers to be stacked in the decoder.
n_head (int) – The number of head used in multi-head attention.
d_model (int) – The dimension for word embeddings, which is also the last dimension of the input and output of multi-head attention, position-wise feed-forward networks, encoder and decoder.
d_inner_hid (int) – Size of the hidden layer in position-wise feed-forward networks.
dropout (float) – Dropout rates. Used for pre-process, activation and inside attention.
weight_sharing (bool) – Whether to use weight sharing.
bos_id (int, optional) – The start token id and also is used as padding id. Defaults to 0.
eos_id (int, optional) – The end token id. Defaults to 1.
beam_size (int, optional) – The beam width for beam search. Defaults to 4.
max_out_len (int, optional) – The maximum output length. Defaults to 256.
kwargs –
The key word arguments can be
output_time_major
,use_ft
,use_fp16_decoding
,rel_len
,alpha
:output_time_major(bool, optional)
: Indicate the data layout of predicted
Tensor. If
False
, the data layout would be batch major with shape[batch_size, seq_len, beam_size]
. IfTrue
, the data layout would be time major with shape[seq_len, batch_size, beam_size]
. Default toFalse
.use_ft(bool, optional)
: Whether to use FasterTransformer
for decoding. Default to True if not set.
use_fp16_decoding(bool, optional)
: Whether to use fp16
for decoding. Only works when using FasterTransformer.
beam_search_version(str, optional)
: Indicating the strategy of
beam search. It can be ‘v1’ or ‘v2’. ‘v2’ would select the top
beam_size * 2
beams and process the topbeam_size
alive and finish beams in them separately, while ‘v1’ would only select the topbeam_size
beams and mix up the alive and finish beams. ‘v2’ always searchs more and get better results, since the alive beams would always bebeam_size
while the number of alive beams inv1
might decrease when meeting the end token. However, ‘v2’ always generates longer results thus might do more calculation and be slower.rel_len(bool, optional)
: Indicating whethermax_out_len
in is
the length relative to that of source text. Only works in
v2
temporarily. It is suggest to set a smallmax_out_len
and userel_len=True
. Default to False if not set.alpha(float, optional)
: The power number in length penalty
calculation. Refer to GNMT. Only works in
v2
temporarily. Default to 0.6 if not set.diversity_rate(float, optional): Refer to `A Simple, Fast Diverse
Decoding Algorithm for Neural Generation <https://arxiv.org/abs/1611.08562>`_ for details. Bigger
diversity_rate
would lead to more diversity. ifdiversity_rate == 0
is equivalent to naive BeamSearch. Default to 0 if not set. NOTE: Only works when using FasterTransformer temporarily.
-
forward
(src_word, trg_word=None)[source]¶ Performs decoding for transformer model.
- Parameters
src_word (Tensor) – The ids of source sequence words. It is a tensor with shape
[batch_size, source_sequence_length]
and its data type can be int or int64.trg_word (Tensor) – The ids of target sequence words. Normally, it should NOT be given. If it’s given, force decoding with previous output token will be trigger. Defaults to None.
- Returns
An int64 tensor shaped indicating the predicted ids. Its shape is
[batch_size, seq_len, beam_size]
or[seq_len, batch_size, beam_size]
according tooutput_time_major
. While, when using FasterTransformer and beam search v2, the beam dimension would be doubled to include both the topbeam_size
alive and finish beams, thus the tensor shape is[batch_size, seq_len, beam_size * 2]
or[seq_len, batch_size, beam_size * 2]
.- Return type
Tensor
Example
import paddle from paddlenlp.ops import TransformerGenerator transformer = TransformerGenerator( src_vocab_size=30000, trg_vocab_size=30000, max_length=256, num_encoder_layers=6, num_decoder_layers=6, n_head=8, d_model=512, d_inner_hid=2048, dropout=0.1, weight_sharing=True, bos_id=0, eos_id=1, beam_size=4, max_out_len=256) batch_size = 5 seq_len = 10 transformer( src_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]))
-
class
FasterGPT
(model, decoding_lib=None, use_fp16_decoding=False)[source]¶ Bases:
paddlenlp.transformers.gpt.modeling.GPTPretrainedModel
-
forward
(input_ids, seq_len=None, attention_mask=None, top_k=4, top_p=0.0, max_length=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=0, decode_strategy='sample', num_return_sequences=1, **model_kwargs)[source]¶ Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
-
generate
(input_ids, seq_len=None, attention_mask=None, top_k=4, top_p=0.0, max_length=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=0, decode_strategy='sample', num_return_sequences=1, **model_kwargs)¶ Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
-
-
class
FasterUnifiedTransformer
(model, decoding_lib=None, use_fp16_decoding=False)[source]¶ Bases:
paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerPretrainedModel
-
forward
(input_ids, token_type_ids, attention_mask, seq_len=None, role_ids=None, position_ids=None, max_length=128, min_length=0, top_k=4, top_p=0.0, decode_strategy='sampling', bos_token_id=None, eos_token_id=None, pad_token_id=None, num_beams=4, diversity_rate=0.0, temperature=1.0, num_return_sequences=1, length_penalty=0.6, early_stopping=False, forced_eos_token_id=None, **model_kwargs)[source]¶ Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
-
generate
(input_ids, token_type_ids, attention_mask, seq_len=None, role_ids=None, position_ids=None, max_length=128, min_length=0, top_k=4, top_p=0.0, decode_strategy='sampling', bos_token_id=None, eos_token_id=None, pad_token_id=None, num_beams=4, diversity_rate=0.0, temperature=1.0, num_return_sequences=1, length_penalty=0.6, early_stopping=False, forced_eos_token_id=None, **model_kwargs)¶ Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
-
-
class
FasterUNIMOText
(model, decoding_lib=None, use_fp16_decoding=False)[source]¶ Bases:
paddlenlp.transformers.unimo.modeling.UNIMOPretrainedModel
-
forward
(input_ids, token_type_ids, attention_mask, seq_len=None, max_length=128, min_length=0, top_k=4, top_p=0.0, num_beams=4, decode_strategy='sampling', bos_token_id=None, eos_token_id=None, pad_token_id=None, diversity_rate=0.0, temperature=1.0, num_return_sequences=1, length_penalty=0.6, early_stopping=False, forced_eos_token_id=None, position_ids=None, **model_kwargs)[source]¶ Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
-
generate
(input_ids, token_type_ids, attention_mask, seq_len=None, max_length=128, min_length=0, top_k=4, top_p=0.0, num_beams=4, decode_strategy='sampling', bos_token_id=None, eos_token_id=None, pad_token_id=None, diversity_rate=0.0, temperature=1.0, num_return_sequences=1, length_penalty=0.6, early_stopping=False, forced_eos_token_id=None, position_ids=None, **model_kwargs)¶ Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
-
-
class
FasterBART
(model, decoding_lib=None, use_fp16_decoding=False, enable_faster_encoder=True)[source]¶ Bases:
paddlenlp.transformers.bart.modeling.BartPretrainedModel
-
enable_faster_encoder_func
(use_fp16=False, encoder_lib=None)¶ Compiles fusion encoder operator intergrated FasterTransformer using the method of JIT(Just-In-Time) and replaces the
forward
function ofpaddle.nn.TransformerEncoder
andpaddle.nn.TransformerEncoderLayer
objects inherited fromself
to support inference using FasterTransformer.Examples
from paddlenlp.ops import enable_faster_encoder, disable_faster_encoder model.eval() model = enable_faster_encoder(model) enc_out = model(src, src_mask) model = disable_faster_encoder(model)
-
forward
(input_ids=None, encoder_output=None, seq_len=None, num_beams=4, top_k=1, top_p=0.0, decode_strategy='beam_search', bos_token_id=None, eos_token_id=None, pad_token_id=None, decoder_start_token_id=None, max_length=256, diversity_rate=0.0, length_penalty=0.6, num_return_sequences=1, early_stopping=False, forced_eos_token_id=None, **model_kwargs)[source]¶ Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
-
generate
(input_ids=None, encoder_output=None, seq_len=None, num_beams=4, top_k=1, top_p=0.0, decode_strategy='beam_search', bos_token_id=None, eos_token_id=None, pad_token_id=None, decoder_start_token_id=None, max_length=256, diversity_rate=0.0, length_penalty=0.6, num_return_sequences=1, early_stopping=False, forced_eos_token_id=None, **model_kwargs)¶ Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
-
-
class
FasterMBART
(model, decoding_lib=None, use_fp16_decoding=False)[source]¶ Bases:
paddlenlp.transformers.mbart.modeling.MBartPretrainedModel
-
forward
(input_ids=None, encoder_output=None, seq_len=None, forced_bos_token_id=None, num_beams=4, top_k=1, top_p=0.0, decode_strategy='beam_search_v3', bos_token_id=None, eos_token_id=None, pad_token_id=None, decoder_start_token_id=None, max_length=256, diversity_rate=0.0, length_penalty=0.6, temperature=1.0, num_return_sequences=1, early_stopping=False, forced_eos_token_id=None, **model_kwargs)[source]¶ Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
-
generate
(input_ids=None, encoder_output=None, seq_len=None, forced_bos_token_id=None, num_beams=4, top_k=1, top_p=0.0, decode_strategy='beam_search_v3', bos_token_id=None, eos_token_id=None, pad_token_id=None, decoder_start_token_id=None, max_length=256, diversity_rate=0.0, length_penalty=0.6, temperature=1.0, num_return_sequences=1, early_stopping=False, forced_eos_token_id=None, **model_kwargs)¶ Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
-