
class FasterTransformer(src_vocab_size, trg_vocab_size, max_length, num_encoder_layers, num_decoder_layers, n_head, d_model, d_inner_hid, dropout, weight_sharing, attn_dropout=None, act_dropout=None, bos_id=0, eos_id=1, pad_id=None, decoding_strategy='beam_search', beam_size=4, topk=1, topp=0.0, max_out_len=256, diversity_rate=0.0, decoding_lib=None, use_fp16_decoding=False, enable_fast_encoder=False, use_fp16_encoder=False, rel_len=False, alpha=0.6)[source]#

Bases: TransformerModel

FasterTransformer is a fast version for generation with the Transformer model. It uses a custom op based on and enhancing NV FasterTransformer to do fast generation.

  • src_vocab_size (int) – The size of source vocabulary.

  • trg_vocab_size (int) – The size of target vocabulary.

  • max_length (int) – The maximum length of input sequences.

  • num_encoder_layers (int) – The number of sub-layers to be stacked in the encoder.

  • num_decoder_layers (int) – The number of sub-layers to be stacked in the decoder.

  • n_head (int) – The number of head used in multi-head attention.

  • d_model (int) – The dimension for word embeddings, which is also the last dimension of the input and output of multi-head attention, position-wise feed-forward networks, encoder and decoder.

  • d_inner_hid (int) – Size of the hidden layer in position-wise feed-forward networks.

  • dropout (float) – Dropout rates. Used for pre-process, activation and inside attention.

  • weight_sharing (bool) – Whether to use weight sharing.

  • attn_dropout (float) – The dropout probability used in MHA to drop some attention target. If None, use the value of dropout. Defaults to None.

  • act_dropout (float) – The dropout probability used after FFN activition. If None, use the value of dropout. Defaults to None.

  • bos_id (int, optional) – The start token id and also is used as padding id. Defaults to 0.

  • eos_id (int, optional) – The end token id. Defaults to 1.

  • pad_id (int, optional) – The pad token id. Defaults to None. If it’s None, the bos_id will be used as pad_id.

  • decoding_strategy (str, optional) – Indicating the strategy of decoding. It can be ‘beam_search’, ‘beam_search_v2’, ‘topk_sampling’ and ‘topp_sampling’. For beam search strategies, ‘v2’ would select the top beam_size * 2 beams and process the top beam_size alive and finish beams in them separately, while ‘v1’ would only select the top beam_size beams and mix up the alive and finish beams. ‘v2’ always searchs more and get better results, since the alive beams would always be beam_size while the number of alive beams in v1 might decrease when meeting the end token. However, ‘v2’ always generates longer results thus might do more calculation and be slower.

  • beam_size (int, optional) – The beam width for beam search. Defaults to 4.

  • topk (int, optional) – The number of highest probability tokens to keep for top-k sampling. Defaults to 4.

  • topp (float, optional) – The most probable tokens whose cumulative probability is not less than topp are kept for top-p sampling. Defaults to 4.

  • max_out_len (int, optional) – The maximum output length. Defaults to 256.

  • diversity_rate (float, optional) – Refer to A Simple, Fast Diverse Decoding Algorithm for Neural Generation for details. Bigger diversity_rate would lead to more diversity. if diversity_rate == 0 is equivalent to naive BeamSearch. Default to 0 if not set.

  • use_fp16_decoding (bool, optional) – Whether to use fp16 for decoding.

  • enable_fast_encoder (bool, optional) – Whether to use the fast version of encoder. This is experimental option for now. Defaults to False.

  • use_fp16_encoder (bool, optional) – Whether to use fp16 for encoder. Only works when enable_fast_encoder is True. Defaults to False.

  • rel_len (bool, optional) – Indicating whether max_out_len in is the length relative to that of source text. Only works in v2 temporarily. It is suggest to set a small max_out_len and use rel_len=True. Default to False if not set.

  • alpha (float, optional) – The power number in length penalty calculation. Only works in v2 temporarily. Refer to GNMT. Default to 0.6 if not set.

forward(src_word, trg_word=None)[source]#

The Transformer forward methods. The input are source/target sequences, and returns logits.

  • src_word (Tensor) – The ids of source sequences words. It is a tensor with shape [batch_size, source_sequence_length] and its data type can be int or int64.

  • trg_word (Tensor) – The ids of target sequences words. It is a tensor with shape [batch_size, target_sequence_length] and its data type can be int or int64.


Output tensor of the final layer of the model whose data type can be float32 or float64 with shape [batch_size, sequence_length, vocab_size].

Return type:



import paddle
from paddlenlp.transformers import TransformerModel

transformer = TransformerModel(

batch_size = 5
seq_len = 10
predict = transformer(
    src_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]),
    trg_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]))
export_params(init_from_params, place)[source]#

This method is used for load static graph from dygraph checkpoint or export inference model using static graph. Do NOT support faster encoder.

  • init_from_params (string) – The path to dygraph checkpoint.

  • place (paddle.Place) – The place to execute static graph.


class TransformerGenerator(src_vocab_size, trg_vocab_size, max_length, num_encoder_layers, num_decoder_layers, n_head, d_model, d_inner_hid, dropout, weight_sharing, bos_id=0, eos_id=1, pad_id=None, beam_size=4, max_out_len=256, activation='relu', normalize_before=True, **kwargs)[source]#

Bases: Layer

The Transformer model for auto-regressive generation with beam search. It wraps FasterTransformer and InferTransformerModel, and automatically chioces using FasterTransformer (with jit building) or the slower verison InferTransformerModel.

  • src_vocab_size (int) – The size of source vocabulary.

  • trg_vocab_size (int) – The size of target vocabulary.

  • max_length (int) – The maximum length of input sequences.

  • num_encoder_layers (int) – The number of sub-layers to be stacked in the encoder.

  • num_decoder_layers (int) – The number of sub-layers to be stacked in the decoder.

  • n_head (int) – The number of head used in multi-head attention.

  • d_model (int) – The dimension for word embeddings, which is also the last dimension of the input and output of multi-head attention, position-wise feed-forward networks, encoder and decoder.

  • d_inner_hid (int) – Size of the hidden layer in position-wise feed-forward networks.

  • dropout (float) – Dropout rates. Used for pre-process, activation and inside attention.

  • weight_sharing (bool) – Whether to use weight sharing.

  • bos_id (int, optional) – The start token id and also is used as padding id. Defaults to 0.

  • eos_id (int, optional) – The end token id. Defaults to 1.

  • beam_size (int, optional) – The beam width for beam search. Defaults to 4.

  • max_out_len (int, optional) – The maximum output length. Defaults to 256.

  • activation (str, optional) – The activation used in FFN. Defaults to “relu”.

  • normalize_before (bool, optional) – Whether to apply pre-normalization. Defaults to True.

  • kwargs

    The key word arguments can be output_time_major, use_ft, use_fp16_decoding, rel_len, alpha:

    • output_time_major(bool, optional): Indicate the data layout of predicted

    Tensor. If False, the data layout would be batch major with shape [batch_size, seq_len, beam_size]. If True, the data layout would be time major with shape [seq_len, batch_size, beam_size]. Default to False.

    • use_ft(bool, optional): Whether to use FastGeneration

    for decoding. Default to True if not set.

    • use_fp16_decoding(bool, optional): Whether to use fp16

    for decoding. Only works when using FastGeneration.

    • beam_search_version(str, optional): Indicating the strategy of

    beam search. It can be ‘v1’ or ‘v2’. ‘v2’ would select the top beam_size * 2 beams and process the top beam_size alive and finish beams in them separately, while ‘v1’ would only select the top beam_size beams and mix up the alive and finish beams. ‘v2’ always searchs more and get better results, since the alive beams would always be beam_size while the number of alive beams in v1 might decrease when meeting the end token. However, ‘v2’ always generates longer results thus might do more calculation and be slower.

    • rel_len(bool, optional): Indicating whether max_out_len in is

    the length relative to that of source text. Only works in v2 temporarily. It is suggest to set a small max_out_len and use rel_len=True. Default to False if not set.

    • alpha(float, optional): The power number in length penalty

    calculation. Refer to GNMT. Only works in v2 temporarily. Default to 0.6 if not set.

    • diversity_rate(float, optional): Refer to `A Simple, Fast Diverse

    Decoding Algorithm for Neural Generation <>`_ for details. Bigger diversity_rate would lead to more diversity. if diversity_rate == 0 is equivalent to naive BeamSearch. Default to 0 if not set. NOTE: Only works when using FastGeneration temporarily.

forward(src_word, trg_word=None)[source]#

Performs decoding for transformer model.

  • src_word (Tensor) – The ids of source sequence words. It is a tensor with shape [batch_size, source_sequence_length] and its data type can be int or int64.

  • trg_word (Tensor) – The ids of target sequence words. Normally, it should NOT be given. If it’s given, force decoding with previous output token will be trigger. Defaults to None.


An int64 tensor shaped indicating the predicted ids. Its shape is [batch_size, seq_len, beam_size] or [seq_len, batch_size, beam_size] according to output_time_major. While, when using FastGeneration and beam search v2, the beam dimension would be doubled to include both the top beam_size alive and finish beams, thus the tensor shape is [batch_size, seq_len, beam_size * 2] or [seq_len, batch_size, beam_size * 2].

Return type:



import paddle
from paddlenlp.ops import TransformerGenerator

transformer = TransformerGenerator(

batch_size = 5
seq_len = 10
    src_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]))
class FasterOPT(model, decoding_lib=None, use_fp16_decoding=False)[source]#

Bases: OPTPretrainedModel

forward(input_ids, seq_len=None, attention_mask=None, top_k=4, top_p=0.0, max_length=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=0, decode_strategy='sample', num_return_sequences=1, **model_kwargs)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

generate(input_ids, seq_len=None, attention_mask=None, top_k=4, top_p=0.0, max_length=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=0, decode_strategy='sample', num_return_sequences=1, **model_kwargs)#

The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.

  • input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value bos_token_id.

  • generation_config (GenerationConfig, optional) – The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default will be used, which had the following loading priority: 1) from the generation_config.json model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [GenerationConfig]’s default values, whose documentation should be checked to parameterize generation.

  • stopping_criteria (StoppingCriteriaList, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.

  • streamer (BaseStreamer, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed through streamer.put(token_ids) and the streamer is responsible for any further processing.

  • synced_gpus (bool, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set to True under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set to False.

  • kwargs (dict) – It can be used to specify additional kwargs passed to the model.


It is a tuple contains two elements: ids and scores. Each element is a Tensor.

With the fields:

  • ids (Tensor):

    The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input input_ids.

  • scores (Tensor):

    The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.

Return type:



import paddle
from paddlenlp.transformers import (


# Initialize the model and tokenizer
model_name_or_path = 'unified_transformer-12L-cn-luge'
model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path)
tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path)

# Prepare the model inputs.
history = "早上好,今天空气质量不错。"
inputs = tokenizer.dialogue_encode(history, task_type='chitchat',
    add_start_token_as_response=True, return_tensors=True)
# Generate the sequence by using "greedy_search" strategy
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [1, 3] [1, 1]
sequence_ids = ids.cpu().numpy().tolist()[0]
sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# 是的
# Generate 2 sequences by using "sampling" strategy (top_k=5)
generation_config = GenerationConfig(
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [2, 7] [2, 1]
response = []
for sequence_ids in ids.cpu().numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# ['天气好,心情也好', '你也是']
# Generate 2 sequences by using "beam_search" strategy (num_beams=5)
generation_config = GenerationConfig(
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [2, 3] [2, 1]
response = []
for sequence_ids in ids.cpu().numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# ['是的', '嗯嗯']
class FasterGPT(model, decoding_lib=None, use_fp16_decoding=False)[source]#

Bases: GPTPretrainedModel

forward(input_ids, seq_len=None, attention_mask=None, top_k=4, top_p=0.0, max_length=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=0, decode_strategy='sample', num_return_sequences=1, **model_kwargs)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

generate(input_ids, seq_len=None, attention_mask=None, top_k=4, top_p=0.0, max_length=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=0, decode_strategy='sample', num_return_sequences=1, **model_kwargs)#

The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.

  • input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value bos_token_id.

  • generation_config (GenerationConfig, optional) – The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default will be used, which had the following loading priority: 1) from the generation_config.json model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [GenerationConfig]’s default values, whose documentation should be checked to parameterize generation.

  • stopping_criteria (StoppingCriteriaList, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.

  • streamer (BaseStreamer, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed through streamer.put(token_ids) and the streamer is responsible for any further processing.

  • synced_gpus (bool, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set to True under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set to False.

  • kwargs (dict) – It can be used to specify additional kwargs passed to the model.


It is a tuple contains two elements: ids and scores. Each element is a Tensor.

With the fields:

  • ids (Tensor):

    The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input input_ids.

  • scores (Tensor):

    The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.

Return type:



import paddle
from paddlenlp.transformers import (


# Initialize the model and tokenizer
model_name_or_path = 'unified_transformer-12L-cn-luge'
model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path)
tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path)

# Prepare the model inputs.
history = "早上好,今天空气质量不错。"
inputs = tokenizer.dialogue_encode(history, task_type='chitchat',
    add_start_token_as_response=True, return_tensors=True)
# Generate the sequence by using "greedy_search" strategy
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [1, 3] [1, 1]
sequence_ids = ids.cpu().numpy().tolist()[0]
sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# 是的
# Generate 2 sequences by using "sampling" strategy (top_k=5)
generation_config = GenerationConfig(
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [2, 7] [2, 1]
response = []
for sequence_ids in ids.cpu().numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# ['天气好,心情也好', '你也是']
# Generate 2 sequences by using "beam_search" strategy (num_beams=5)
generation_config = GenerationConfig(
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [2, 3] [2, 1]
response = []
for sequence_ids in ids.cpu().numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# ['是的', '嗯嗯']
class FasterUnifiedTransformer(model, decoding_lib=None, use_fp16_decoding=False)[source]#

Bases: UnifiedTransformerPretrainedModel

forward(input_ids, token_type_ids, attention_mask, seq_len=None, role_ids=None, position_ids=None, max_length=128, min_length=0, top_k=4, top_p=0.0, decode_strategy='sampling', bos_token_id=None, eos_token_id=None, pad_token_id=None, num_beams=4, diversity_rate=0.0, temperature=1.0, num_return_sequences=1, length_penalty=0.6, early_stopping=False, forced_eos_token_id=None, **model_kwargs)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

generate(input_ids, token_type_ids, attention_mask, seq_len=None, role_ids=None, position_ids=None, max_length=128, min_length=0, top_k=4, top_p=0.0, decode_strategy='sampling', bos_token_id=None, eos_token_id=None, pad_token_id=None, num_beams=4, diversity_rate=0.0, temperature=1.0, num_return_sequences=1, length_penalty=0.6, early_stopping=False, forced_eos_token_id=None, **model_kwargs)#

The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.

  • input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value bos_token_id.

  • generation_config (GenerationConfig, optional) – The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default will be used, which had the following loading priority: 1) from the generation_config.json model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [GenerationConfig]’s default values, whose documentation should be checked to parameterize generation.

  • stopping_criteria (StoppingCriteriaList, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.

  • streamer (BaseStreamer, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed through streamer.put(token_ids) and the streamer is responsible for any further processing.

  • synced_gpus (bool, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set to True under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set to False.

  • kwargs (dict) – It can be used to specify additional kwargs passed to the model.


It is a tuple contains two elements: ids and scores. Each element is a Tensor.

With the fields:

  • ids (Tensor):

    The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input input_ids.

  • scores (Tensor):

    The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.

Return type:



import paddle
from paddlenlp.transformers import (


# Initialize the model and tokenizer
model_name_or_path = 'unified_transformer-12L-cn-luge'
model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path)
tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path)

# Prepare the model inputs.
history = "早上好,今天空气质量不错。"
inputs = tokenizer.dialogue_encode(history, task_type='chitchat',
    add_start_token_as_response=True, return_tensors=True)
# Generate the sequence by using "greedy_search" strategy
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [1, 3] [1, 1]
sequence_ids = ids.cpu().numpy().tolist()[0]
sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# 是的
# Generate 2 sequences by using "sampling" strategy (top_k=5)
generation_config = GenerationConfig(
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [2, 7] [2, 1]
response = []
for sequence_ids in ids.cpu().numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# ['天气好,心情也好', '你也是']
# Generate 2 sequences by using "beam_search" strategy (num_beams=5)
generation_config = GenerationConfig(
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [2, 3] [2, 1]
response = []
for sequence_ids in ids.cpu().numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# ['是的', '嗯嗯']
class FasterUNIMOText(model, decoding_lib=None, use_fp16_decoding=False, **kwargs)[source]#

Bases: UNIMOPretrainedModel

forward(input_ids, token_type_ids, attention_mask, seq_len=None, max_length=128, min_length=0, top_k=4, top_p=0.0, num_beams=4, decode_strategy='sampling', bos_token_id=None, eos_token_id=None, pad_token_id=None, diversity_rate=0.0, temperature=1.0, num_return_sequences=1, length_penalty=0.6, early_stopping=False, forced_eos_token_id=None, position_ids=None, **model_kwargs)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

generate(input_ids, token_type_ids, attention_mask, seq_len=None, max_length=128, min_length=0, top_k=4, top_p=0.0, num_beams=4, decode_strategy='sampling', bos_token_id=None, eos_token_id=None, pad_token_id=None, diversity_rate=0.0, temperature=1.0, num_return_sequences=1, length_penalty=0.6, early_stopping=False, forced_eos_token_id=None, position_ids=None, **model_kwargs)#

The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.

  • input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value bos_token_id.

  • generation_config (GenerationConfig, optional) – The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default will be used, which had the following loading priority: 1) from the generation_config.json model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [GenerationConfig]’s default values, whose documentation should be checked to parameterize generation.

  • stopping_criteria (StoppingCriteriaList, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.

  • streamer (BaseStreamer, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed through streamer.put(token_ids) and the streamer is responsible for any further processing.

  • synced_gpus (bool, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set to True under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set to False.

  • kwargs (dict) – It can be used to specify additional kwargs passed to the model.


It is a tuple contains two elements: ids and scores. Each element is a Tensor.

With the fields:

  • ids (Tensor):

    The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input input_ids.

  • scores (Tensor):

    The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.

Return type:



import paddle
from paddlenlp.transformers import (


# Initialize the model and tokenizer
model_name_or_path = 'unified_transformer-12L-cn-luge'
model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path)
tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path)

# Prepare the model inputs.
history = "早上好,今天空气质量不错。"
inputs = tokenizer.dialogue_encode(history, task_type='chitchat',
    add_start_token_as_response=True, return_tensors=True)
# Generate the sequence by using "greedy_search" strategy
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [1, 3] [1, 1]
sequence_ids = ids.cpu().numpy().tolist()[0]
sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# 是的
# Generate 2 sequences by using "sampling" strategy (top_k=5)
generation_config = GenerationConfig(
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [2, 7] [2, 1]
response = []
for sequence_ids in ids.cpu().numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# ['天气好,心情也好', '你也是']
# Generate 2 sequences by using "beam_search" strategy (num_beams=5)
generation_config = GenerationConfig(
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [2, 3] [2, 1]
response = []
for sequence_ids in ids.cpu().numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# ['是的', '嗯嗯']
class FasterMIRO(model, decoding_lib=None, use_fp16_decoding=False, **kwargs)[source]#

Bases: UNIMOPretrainedModel

forward(input_ids, token_type_ids, attention_mask, seq_len=None, max_length=128, min_length=0, top_k=4, top_p=0.0, num_beams=4, decode_strategy='sampling', bos_token_id=None, eos_token_id=None, pad_token_id=None, diversity_rate=0.0, temperature=1.0, num_return_sequences=1, length_penalty=0.6, early_stopping=False, forced_eos_token_id=None, position_ids=None, **model_kwargs)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

generate(input_ids, token_type_ids, attention_mask, seq_len=None, max_length=128, min_length=0, top_k=4, top_p=0.0, num_beams=4, decode_strategy='sampling', bos_token_id=None, eos_token_id=None, pad_token_id=None, diversity_rate=0.0, temperature=1.0, num_return_sequences=1, length_penalty=0.6, early_stopping=False, forced_eos_token_id=None, position_ids=None, **model_kwargs)#

The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.

  • input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value bos_token_id.

  • generation_config (GenerationConfig, optional) – The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default will be used, which had the following loading priority: 1) from the generation_config.json model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [GenerationConfig]’s default values, whose documentation should be checked to parameterize generation.

  • stopping_criteria (StoppingCriteriaList, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.

  • streamer (BaseStreamer, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed through streamer.put(token_ids) and the streamer is responsible for any further processing.

  • synced_gpus (bool, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set to True under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set to False.

  • kwargs (dict) – It can be used to specify additional kwargs passed to the model.


It is a tuple contains two elements: ids and scores. Each element is a Tensor.

With the fields:

  • ids (Tensor):

    The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input input_ids.

  • scores (Tensor):

    The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.

Return type:



import paddle
from paddlenlp.transformers import (


# Initialize the model and tokenizer
model_name_or_path = 'unified_transformer-12L-cn-luge'
model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path)
tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path)

# Prepare the model inputs.
history = "早上好,今天空气质量不错。"
inputs = tokenizer.dialogue_encode(history, task_type='chitchat',
    add_start_token_as_response=True, return_tensors=True)
# Generate the sequence by using "greedy_search" strategy
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [1, 3] [1, 1]
sequence_ids = ids.cpu().numpy().tolist()[0]
sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# 是的
# Generate 2 sequences by using "sampling" strategy (top_k=5)
generation_config = GenerationConfig(
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [2, 7] [2, 1]
response = []
for sequence_ids in ids.cpu().numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# ['天气好,心情也好', '你也是']
# Generate 2 sequences by using "beam_search" strategy (num_beams=5)
generation_config = GenerationConfig(
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [2, 3] [2, 1]
response = []
for sequence_ids in ids.cpu().numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# ['是的', '嗯嗯']
class FasterBART(model, decoding_lib=None, use_fp16_decoding=False, enable_fast_encoder=True)[source]#

Bases: BartPretrainedModel

enable_faster_encoder_func(use_fp16=False, encoder_lib=None)#

Compiles fusion encoder operator intergrated FastGeneration using the method of JIT(Just-In-Time) and replaces the forward function of paddle.nn.TransformerEncoder and paddle.nn.TransformerEncoderLayer objects inherited from self to support inference using FastGeneration.


from paddlenlp.ops import enable_fast_encoder, disable_fast_encoder

model = enable_fast_encoder(model)
enc_out = model(src, src_mask)
model = disable_fast_encoder(model)
forward(input_ids=None, encoder_output=None, seq_len=None, num_beams=4, top_k=1, top_p=0.0, temperature=1.0, decode_strategy='beam_search', bos_token_id=None, eos_token_id=None, pad_token_id=None, decoder_start_token_id=None, min_length=0, max_length=20, diversity_rate=0.0, length_penalty=0.6, num_return_sequences=1, early_stopping=False, forced_eos_token_id=None, **model_kwargs)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

generate(input_ids=None, encoder_output=None, seq_len=None, num_beams=4, top_k=1, top_p=0.0, temperature=1.0, decode_strategy='beam_search', bos_token_id=None, eos_token_id=None, pad_token_id=None, decoder_start_token_id=None, min_length=0, max_length=20, diversity_rate=0.0, length_penalty=0.6, num_return_sequences=1, early_stopping=False, forced_eos_token_id=None, **model_kwargs)#

The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.

  • input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value bos_token_id.

  • generation_config (GenerationConfig, optional) – The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default will be used, which had the following loading priority: 1) from the generation_config.json model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [GenerationConfig]’s default values, whose documentation should be checked to parameterize generation.

  • stopping_criteria (StoppingCriteriaList, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.

  • streamer (BaseStreamer, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed through streamer.put(token_ids) and the streamer is responsible for any further processing.

  • synced_gpus (bool, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set to True under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set to False.

  • kwargs (dict) – It can be used to specify additional kwargs passed to the model.


It is a tuple contains two elements: ids and scores. Each element is a Tensor.

With the fields:

  • ids (Tensor):

    The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input input_ids.

  • scores (Tensor):

    The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.

Return type:



import paddle
from paddlenlp.transformers import (


# Initialize the model and tokenizer
model_name_or_path = 'unified_transformer-12L-cn-luge'
model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path)
tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path)

# Prepare the model inputs.
history = "早上好,今天空气质量不错。"
inputs = tokenizer.dialogue_encode(history, task_type='chitchat',
    add_start_token_as_response=True, return_tensors=True)
# Generate the sequence by using "greedy_search" strategy
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [1, 3] [1, 1]
sequence_ids = ids.cpu().numpy().tolist()[0]
sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# 是的
# Generate 2 sequences by using "sampling" strategy (top_k=5)
generation_config = GenerationConfig(
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [2, 7] [2, 1]
response = []
for sequence_ids in ids.cpu().numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# ['天气好,心情也好', '你也是']
# Generate 2 sequences by using "beam_search" strategy (num_beams=5)
generation_config = GenerationConfig(
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [2, 3] [2, 1]
response = []
for sequence_ids in ids.cpu().numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# ['是的', '嗯嗯']
class FasterMBART(model, decoding_lib=None, use_fp16_decoding=False, enable_fast_encoder=False)[source]#

Bases: MBartPretrainedModel

enable_faster_encoder_func(use_fp16=False, encoder_lib=None)#

Compiles fusion encoder operator intergrated FastGeneration using the method of JIT(Just-In-Time) and replaces the forward function of paddle.nn.TransformerEncoder and paddle.nn.TransformerEncoderLayer objects inherited from self to support inference using FastGeneration.


from paddlenlp.ops import enable_fast_encoder, disable_fast_encoder

model = enable_fast_encoder(model)
enc_out = model(src, src_mask)
model = disable_fast_encoder(model)
forward(input_ids=None, encoder_output=None, seq_len=None, forced_bos_token_id=None, num_beams=4, top_k=1, top_p=0.0, decode_strategy='beam_search_v3', bos_token_id=None, eos_token_id=None, pad_token_id=None, decoder_start_token_id=None, max_length=256, diversity_rate=0.0, length_penalty=0.6, temperature=1.0, num_return_sequences=1, early_stopping=False, forced_eos_token_id=None, **model_kwargs)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

generate(input_ids=None, encoder_output=None, seq_len=None, forced_bos_token_id=None, num_beams=4, top_k=1, top_p=0.0, decode_strategy='beam_search_v3', bos_token_id=None, eos_token_id=None, pad_token_id=None, decoder_start_token_id=None, max_length=256, diversity_rate=0.0, length_penalty=0.6, temperature=1.0, num_return_sequences=1, early_stopping=False, forced_eos_token_id=None, **model_kwargs)#

The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.

  • input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value bos_token_id.

  • generation_config (GenerationConfig, optional) – The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default will be used, which had the following loading priority: 1) from the generation_config.json model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [GenerationConfig]’s default values, whose documentation should be checked to parameterize generation.

  • stopping_criteria (StoppingCriteriaList, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.

  • streamer (BaseStreamer, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed through streamer.put(token_ids) and the streamer is responsible for any further processing.

  • synced_gpus (bool, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set to True under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set to False.

  • kwargs (dict) – It can be used to specify additional kwargs passed to the model.


It is a tuple contains two elements: ids and scores. Each element is a Tensor.

With the fields:

  • ids (Tensor):

    The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input input_ids.

  • scores (Tensor):

    The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.

Return type:



import paddle
from paddlenlp.transformers import (


# Initialize the model and tokenizer
model_name_or_path = 'unified_transformer-12L-cn-luge'
model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path)
tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path)

# Prepare the model inputs.
history = "早上好,今天空气质量不错。"
inputs = tokenizer.dialogue_encode(history, task_type='chitchat',
    add_start_token_as_response=True, return_tensors=True)
# Generate the sequence by using "greedy_search" strategy
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [1, 3] [1, 1]
sequence_ids = ids.cpu().numpy().tolist()[0]
sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# 是的
# Generate 2 sequences by using "sampling" strategy (top_k=5)
generation_config = GenerationConfig(
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [2, 7] [2, 1]
response = []
for sequence_ids in ids.cpu().numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# ['天气好,心情也好', '你也是']
# Generate 2 sequences by using "beam_search" strategy (num_beams=5)
generation_config = GenerationConfig(
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [2, 3] [2, 1]
response = []
for sequence_ids in ids.cpu().numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# ['是的', '嗯嗯']
class FasterGPTJ(model, decoding_lib=None, use_fp16_decoding=False)[source]#

Bases: GPTJPretrainedModel

forward(input_ids, seq_len=None, attention_mask=None, top_k=4, top_p=0.0, min_length=0, max_length=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=0, repetition_penalty=1.0, decode_strategy='sampling', num_return_sequences=1, **model_kwargs)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

generate(input_ids, seq_len=None, attention_mask=None, top_k=4, top_p=0.0, min_length=0, max_length=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=0, repetition_penalty=1.0, decode_strategy='sampling', num_return_sequences=1, **model_kwargs)#

The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.

  • input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value bos_token_id.

  • generation_config (GenerationConfig, optional) – The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default will be used, which had the following loading priority: 1) from the generation_config.json model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [GenerationConfig]’s default values, whose documentation should be checked to parameterize generation.

  • stopping_criteria (StoppingCriteriaList, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.

  • streamer (BaseStreamer, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed through streamer.put(token_ids) and the streamer is responsible for any further processing.

  • synced_gpus (bool, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set to True under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set to False.

  • kwargs (dict) – It can be used to specify additional kwargs passed to the model.


It is a tuple contains two elements: ids and scores. Each element is a Tensor.

With the fields:

  • ids (Tensor):

    The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input input_ids.

  • scores (Tensor):

    The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.

Return type:



import paddle
from paddlenlp.transformers import (


# Initialize the model and tokenizer
model_name_or_path = 'unified_transformer-12L-cn-luge'
model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path)
tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path)

# Prepare the model inputs.
history = "早上好,今天空气质量不错。"
inputs = tokenizer.dialogue_encode(history, task_type='chitchat',
    add_start_token_as_response=True, return_tensors=True)
# Generate the sequence by using "greedy_search" strategy
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [1, 3] [1, 1]
sequence_ids = ids.cpu().numpy().tolist()[0]
sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# 是的
# Generate 2 sequences by using "sampling" strategy (top_k=5)
generation_config = GenerationConfig(
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [2, 7] [2, 1]
response = []
for sequence_ids in ids.cpu().numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# ['天气好,心情也好', '你也是']
# Generate 2 sequences by using "beam_search" strategy (num_beams=5)
generation_config = GenerationConfig(
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [2, 3] [2, 1]
response = []
for sequence_ids in ids.cpu().numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# ['是的', '嗯嗯']
class FasterCodeGen(model, decoding_lib=None, use_fp16_decoding=False)[source]#

Bases: CodeGenPreTrainedModel

forward(input_ids, seq_len=None, attention_mask=None, top_k=4, top_p=0.0, min_length=0, max_length=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=0, repetition_penalty=1.0, decode_strategy='sampling', num_return_sequences=1, **model_kwargs)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

generate(input_ids, seq_len=None, attention_mask=None, top_k=4, top_p=0.0, min_length=0, max_length=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=0, repetition_penalty=1.0, decode_strategy='sampling', num_return_sequences=1, **model_kwargs)#

The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.

  • input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value bos_token_id.

  • generation_config (GenerationConfig, optional) – The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default will be used, which had the following loading priority: 1) from the generation_config.json model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [GenerationConfig]’s default values, whose documentation should be checked to parameterize generation.

  • stopping_criteria (StoppingCriteriaList, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.

  • streamer (BaseStreamer, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed through streamer.put(token_ids) and the streamer is responsible for any further processing.

  • synced_gpus (bool, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set to True under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set to False.

  • kwargs (dict) – It can be used to specify additional kwargs passed to the model.


It is a tuple contains two elements: ids and scores. Each element is a Tensor.

With the fields:

  • ids (Tensor):

    The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input input_ids.

  • scores (Tensor):

    The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.

Return type:



import paddle
from paddlenlp.transformers import (


# Initialize the model and tokenizer
model_name_or_path = 'unified_transformer-12L-cn-luge'
model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path)
tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path)

# Prepare the model inputs.
history = "早上好,今天空气质量不错。"
inputs = tokenizer.dialogue_encode(history, task_type='chitchat',
    add_start_token_as_response=True, return_tensors=True)
# Generate the sequence by using "greedy_search" strategy
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [1, 3] [1, 1]
sequence_ids = ids.cpu().numpy().tolist()[0]
sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# 是的
# Generate 2 sequences by using "sampling" strategy (top_k=5)
generation_config = GenerationConfig(
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [2, 7] [2, 1]
response = []
for sequence_ids in ids.cpu().numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# ['天气好,心情也好', '你也是']
# Generate 2 sequences by using "beam_search" strategy (num_beams=5)
generation_config = GenerationConfig(
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [2, 3] [2, 1]
response = []
for sequence_ids in ids.cpu().numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# ['是的', '嗯嗯']
class FasterPegasus(model, decoding_lib=None, use_fp16_decoding=False, enable_fast_encoder=False, **kwargs)[source]#

Bases: PegasusPretrainedModel

enable_faster_encoder_func(use_fp16=False, encoder_lib=None)#

Compiles fusion encoder operator intergrated FastGeneration using the method of JIT(Just-In-Time) and replaces the forward function of paddle.nn.TransformerEncoder and paddle.nn.TransformerEncoderLayer objects inherited from self to support inference using FastGeneration.


from paddlenlp.ops import enable_fast_encoder, disable_fast_encoder

model = enable_fast_encoder(model)
enc_out = model(src, src_mask)
model = disable_fast_encoder(model)
forward(input_ids=None, encoder_output=None, seq_len=None, min_length=0, max_length=256, num_beams=4, decode_strategy='beam_search_v3', decoder_start_token_id=None, bos_token_id=None, eos_token_id=None, pad_token_id=None, diversity_rate=0.0, length_penalty=0.6, top_k=1, top_p=0.0, temperature=1.0, num_return_sequences=1, early_stopping=False, forced_bos_token_id=None, forced_eos_token_id=None, **model_kwargs)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

generate(input_ids=None, encoder_output=None, seq_len=None, min_length=0, max_length=256, num_beams=4, decode_strategy='beam_search_v3', decoder_start_token_id=None, bos_token_id=None, eos_token_id=None, pad_token_id=None, diversity_rate=0.0, length_penalty=0.6, top_k=1, top_p=0.0, temperature=1.0, num_return_sequences=1, early_stopping=False, forced_bos_token_id=None, forced_eos_token_id=None, **model_kwargs)#

The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.

  • input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value bos_token_id.

  • generation_config (GenerationConfig, optional) – The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default will be used, which had the following loading priority: 1) from the generation_config.json model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [GenerationConfig]’s default values, whose documentation should be checked to parameterize generation.

  • stopping_criteria (StoppingCriteriaList, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.

  • streamer (BaseStreamer, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed through streamer.put(token_ids) and the streamer is responsible for any further processing.

  • synced_gpus (bool, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set to True under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set to False.

  • kwargs (dict) – It can be used to specify additional kwargs passed to the model.


It is a tuple contains two elements: ids and scores. Each element is a Tensor.

With the fields:

  • ids (Tensor):

    The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input input_ids.

  • scores (Tensor):

    The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.

Return type:



import paddle
from paddlenlp.transformers import (


# Initialize the model and tokenizer
model_name_or_path = 'unified_transformer-12L-cn-luge'
model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path)
tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path)

# Prepare the model inputs.
history = "早上好,今天空气质量不错。"
inputs = tokenizer.dialogue_encode(history, task_type='chitchat',
    add_start_token_as_response=True, return_tensors=True)
# Generate the sequence by using "greedy_search" strategy
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [1, 3] [1, 1]
sequence_ids = ids.cpu().numpy().tolist()[0]
sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# 是的
# Generate 2 sequences by using "sampling" strategy (top_k=5)
generation_config = GenerationConfig(
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [2, 7] [2, 1]
response = []
for sequence_ids in ids.cpu().numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# ['天气好,心情也好', '你也是']
# Generate 2 sequences by using "beam_search" strategy (num_beams=5)
generation_config = GenerationConfig(
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [2, 3] [2, 1]
response = []
for sequence_ids in ids.cpu().numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# ['是的', '嗯嗯']
class FasterT5(model, decoding_lib=None, use_fp16_decoding=False)[source]#

Bases: T5PretrainedModel

forward(input_ids=None, encoder_output=None, seq_len=None, max_length=128, min_length=0, top_k=4, top_p=0.0, num_beams=4, decode_strategy='sampling', decoder_start_token_id=None, bos_token_id=None, eos_token_id=None, pad_token_id=None, diversity_rate=0.0, temperature=1.0, num_return_sequences=1, length_penalty=0.6, early_stopping=False, forced_eos_token_id=None, **model_kwargs)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

generate(input_ids=None, encoder_output=None, seq_len=None, max_length=128, min_length=0, top_k=4, top_p=0.0, num_beams=4, decode_strategy='sampling', decoder_start_token_id=None, bos_token_id=None, eos_token_id=None, pad_token_id=None, diversity_rate=0.0, temperature=1.0, num_return_sequences=1, length_penalty=0.6, early_stopping=False, forced_eos_token_id=None, **model_kwargs)#

The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.

  • input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value bos_token_id.

  • generation_config (GenerationConfig, optional) – The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default will be used, which had the following loading priority: 1) from the generation_config.json model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [GenerationConfig]’s default values, whose documentation should be checked to parameterize generation.

  • stopping_criteria (StoppingCriteriaList, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.

  • streamer (BaseStreamer, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed through streamer.put(token_ids) and the streamer is responsible for any further processing.

  • synced_gpus (bool, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set to True under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set to False.

  • kwargs (dict) – It can be used to specify additional kwargs passed to the model.


It is a tuple contains two elements: ids and scores. Each element is a Tensor.

With the fields:

  • ids (Tensor):

    The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input input_ids.

  • scores (Tensor):

    The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.

Return type:



import paddle
from paddlenlp.transformers import (


# Initialize the model and tokenizer
model_name_or_path = 'unified_transformer-12L-cn-luge'
model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path)
tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path)

# Prepare the model inputs.
history = "早上好,今天空气质量不错。"
inputs = tokenizer.dialogue_encode(history, task_type='chitchat',
    add_start_token_as_response=True, return_tensors=True)
# Generate the sequence by using "greedy_search" strategy
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [1, 3] [1, 1]
sequence_ids = ids.cpu().numpy().tolist()[0]
sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# 是的
# Generate 2 sequences by using "sampling" strategy (top_k=5)
generation_config = GenerationConfig(
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [2, 7] [2, 1]
response = []
for sequence_ids in ids.cpu().numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# ['天气好,心情也好', '你也是']
# Generate 2 sequences by using "beam_search" strategy (num_beams=5)
generation_config = GenerationConfig(
ids, scores = model.generate(
print(ids.shape, scores.shape)
# [2, 3] [2, 1]
response = []
for sequence_ids in ids.cpu().numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
# ['是的', '嗯嗯']