fast_transformer#
- class FasterTransformer(src_vocab_size, trg_vocab_size, max_length, num_encoder_layers, num_decoder_layers, n_head, d_model, d_inner_hid, dropout, weight_sharing, attn_dropout=None, act_dropout=None, bos_id=0, eos_id=1, pad_id=None, decoding_strategy='beam_search', beam_size=4, topk=1, topp=0.0, max_out_len=256, diversity_rate=0.0, decoding_lib=None, use_fp16_decoding=False, enable_fast_encoder=False, use_fp16_encoder=False, rel_len=False, alpha=0.6)[source]#
Bases:
TransformerModel
FasterTransformer is a fast version for generation with the Transformer model. It uses a custom op based on and enhancing NV FasterTransformer to do fast generation.
- Parameters:
src_vocab_size (int) – The size of source vocabulary.
trg_vocab_size (int) – The size of target vocabulary.
max_length (int) – The maximum length of input sequences.
num_encoder_layers (int) – The number of sub-layers to be stacked in the encoder.
num_decoder_layers (int) – The number of sub-layers to be stacked in the decoder.
n_head (int) – The number of head used in multi-head attention.
d_model (int) – The dimension for word embeddings, which is also the last dimension of the input and output of multi-head attention, position-wise feed-forward networks, encoder and decoder.
d_inner_hid (int) – Size of the hidden layer in position-wise feed-forward networks.
dropout (float) – Dropout rates. Used for pre-process, activation and inside attention.
weight_sharing (bool) – Whether to use weight sharing.
attn_dropout (float) – The dropout probability used in MHA to drop some attention target. If None, use the value of dropout. Defaults to None.
act_dropout (float) – The dropout probability used after FFN activition. If None, use the value of dropout. Defaults to None.
bos_id (int, optional) – The start token id and also is used as padding id. Defaults to 0.
eos_id (int, optional) – The end token id. Defaults to 1.
pad_id (int, optional) – The pad token id. Defaults to None. If it’s None, the bos_id will be used as pad_id.
decoding_strategy (str, optional) – Indicating the strategy of decoding. It can be ‘beam_search’, ‘beam_search_v2’, ‘topk_sampling’ and ‘topp_sampling’. For beam search strategies, ‘v2’ would select the top
beam_size * 2
beams and process the topbeam_size
alive and finish beams in them separately, while ‘v1’ would only select the topbeam_size
beams and mix up the alive and finish beams. ‘v2’ always searchs more and get better results, since the alive beams would always bebeam_size
while the number of alive beams inv1
might decrease when meeting the end token. However, ‘v2’ always generates longer results thus might do more calculation and be slower.beam_size (int, optional) – The beam width for beam search. Defaults to 4.
topk (int, optional) – The number of highest probability tokens to keep for top-k sampling. Defaults to 4.
topp (float, optional) – The most probable tokens whose cumulative probability is not less than
topp
are kept for top-p sampling. Defaults to 4.max_out_len (int, optional) – The maximum output length. Defaults to 256.
diversity_rate (float, optional) – Refer to A Simple, Fast Diverse Decoding Algorithm for Neural Generation for details. Bigger
diversity_rate
would lead to more diversity. ifdiversity_rate == 0
is equivalent to naive BeamSearch. Default to 0 if not set.use_fp16_decoding (bool, optional) – Whether to use fp16 for decoding.
enable_fast_encoder (bool, optional) – Whether to use the fast version of encoder. This is experimental option for now. Defaults to False.
use_fp16_encoder (bool, optional) – Whether to use fp16 for encoder. Only works when enable_fast_encoder is True. Defaults to False.
rel_len (bool, optional) – Indicating whether
max_out_len
in is the length relative to that of source text. Only works inv2
temporarily. It is suggest to set a smallmax_out_len
and userel_len=True
. Default to False if not set.alpha (float, optional) – The power number in length penalty calculation. Only works in
v2
temporarily. Refer to GNMT. Default to 0.6 if not set.
- forward(src_word, trg_word=None)[source]#
The Transformer forward methods. The input are source/target sequences, and returns logits.
- Parameters:
src_word (Tensor) – The ids of source sequences words. It is a tensor with shape
[batch_size, source_sequence_length]
and its data type can be int or int64.trg_word (Tensor) – The ids of target sequences words. It is a tensor with shape
[batch_size, target_sequence_length]
and its data type can be int or int64.
- Returns:
Output tensor of the final layer of the model whose data type can be float32 or float64 with shape
[batch_size, sequence_length, vocab_size]
.- Return type:
Tensor
Example
import paddle from paddlenlp.transformers import TransformerModel transformer = TransformerModel( src_vocab_size=30000, trg_vocab_size=30000, max_length=257, num_encoder_layers=6, num_decoder_layers=6, n_head=8, d_model=512, d_inner_hid=2048, dropout=0.1, weight_sharing=True, bos_id=0, eos_id=1) batch_size = 5 seq_len = 10 predict = transformer( src_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]), trg_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]))
- export_params(init_from_params, place)[source]#
This method is used for load static graph from dygraph checkpoint or export inference model using static graph. Do NOT support faster encoder.
- Parameters:
init_from_params (string) – The path to dygraph checkpoint.
place (paddle.Place) – The place to execute static graph.
Example
- class TransformerGenerator(src_vocab_size, trg_vocab_size, max_length, num_encoder_layers, num_decoder_layers, n_head, d_model, d_inner_hid, dropout, weight_sharing, bos_id=0, eos_id=1, pad_id=None, beam_size=4, max_out_len=256, activation='relu', normalize_before=True, **kwargs)[source]#
Bases:
Layer
The Transformer model for auto-regressive generation with beam search. It wraps
FasterTransformer
andInferTransformerModel
, and automatically chioces usingFasterTransformer
(with jit building) or the slower verisonInferTransformerModel
.- Parameters:
src_vocab_size (int) – The size of source vocabulary.
trg_vocab_size (int) – The size of target vocabulary.
max_length (int) – The maximum length of input sequences.
num_encoder_layers (int) – The number of sub-layers to be stacked in the encoder.
num_decoder_layers (int) – The number of sub-layers to be stacked in the decoder.
n_head (int) – The number of head used in multi-head attention.
d_model (int) – The dimension for word embeddings, which is also the last dimension of the input and output of multi-head attention, position-wise feed-forward networks, encoder and decoder.
d_inner_hid (int) – Size of the hidden layer in position-wise feed-forward networks.
dropout (float) – Dropout rates. Used for pre-process, activation and inside attention.
weight_sharing (bool) – Whether to use weight sharing.
bos_id (int, optional) – The start token id and also is used as padding id. Defaults to 0.
eos_id (int, optional) – The end token id. Defaults to 1.
beam_size (int, optional) – The beam width for beam search. Defaults to 4.
max_out_len (int, optional) – The maximum output length. Defaults to 256.
activation (str, optional) – The activation used in FFN. Defaults to “relu”.
normalize_before (bool, optional) – Whether to apply pre-normalization. Defaults to True.
kwargs –
The key word arguments can be
output_time_major
,use_ft
,use_fp16_decoding
,rel_len
,alpha
:output_time_major(bool, optional)
: Indicate the data layout of predicted
Tensor. If
False
, the data layout would be batch major with shape[batch_size, seq_len, beam_size]
. IfTrue
, the data layout would be time major with shape[seq_len, batch_size, beam_size]
. Default toFalse
.use_ft(bool, optional)
: Whether to use FastGeneration
for decoding. Default to True if not set.
use_fp16_decoding(bool, optional)
: Whether to use fp16
for decoding. Only works when using FastGeneration.
beam_search_version(str, optional)
: Indicating the strategy of
beam search. It can be ‘v1’ or ‘v2’. ‘v2’ would select the top
beam_size * 2
beams and process the topbeam_size
alive and finish beams in them separately, while ‘v1’ would only select the topbeam_size
beams and mix up the alive and finish beams. ‘v2’ always searchs more and get better results, since the alive beams would always bebeam_size
while the number of alive beams inv1
might decrease when meeting the end token. However, ‘v2’ always generates longer results thus might do more calculation and be slower.rel_len(bool, optional)
: Indicating whethermax_out_len
in is
the length relative to that of source text. Only works in
v2
temporarily. It is suggest to set a smallmax_out_len
and userel_len=True
. Default to False if not set.alpha(float, optional)
: The power number in length penalty
calculation. Refer to GNMT. Only works in
v2
temporarily. Default to 0.6 if not set.diversity_rate(float, optional): Refer to `A Simple, Fast Diverse
Decoding Algorithm for Neural Generation <https://arxiv.org/abs/1611.08562>`_ for details. Bigger
diversity_rate
would lead to more diversity. ifdiversity_rate == 0
is equivalent to naive BeamSearch. Default to 0 if not set. NOTE: Only works when using FastGeneration temporarily.
- forward(src_word, trg_word=None)[source]#
Performs decoding for transformer model.
- Parameters:
src_word (Tensor) – The ids of source sequence words. It is a tensor with shape
[batch_size, source_sequence_length]
and its data type can be int or int64.trg_word (Tensor) – The ids of target sequence words. Normally, it should NOT be given. If it’s given, force decoding with previous output token will be trigger. Defaults to None.
- Returns:
An int64 tensor shaped indicating the predicted ids. Its shape is
[batch_size, seq_len, beam_size]
or[seq_len, batch_size, beam_size]
according tooutput_time_major
. While, when using FastGeneration and beam search v2, the beam dimension would be doubled to include both the topbeam_size
alive and finish beams, thus the tensor shape is[batch_size, seq_len, beam_size * 2]
or[seq_len, batch_size, beam_size * 2]
.- Return type:
Tensor
Example
import paddle from paddlenlp.ops import TransformerGenerator transformer = TransformerGenerator( src_vocab_size=30000, trg_vocab_size=30000, max_length=256, num_encoder_layers=6, num_decoder_layers=6, n_head=8, d_model=512, d_inner_hid=2048, dropout=0.1, weight_sharing=True, bos_id=0, eos_id=1, beam_size=4, max_out_len=256) batch_size = 5 seq_len = 10 transformer( src_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]))
- class FasterOPT(model, decoding_lib=None, use_fp16_decoding=False)[source]#
Bases:
OPTPretrainedModel
- forward(input_ids, seq_len=None, attention_mask=None, top_k=4, top_p=0.0, max_length=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=0, decode_strategy='sample', num_return_sequences=1, **model_kwargs)[source]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters:
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
- generate(input_ids, seq_len=None, attention_mask=None, top_k=4, top_p=0.0, max_length=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=0, decode_strategy='sample', num_return_sequences=1, **model_kwargs)#
The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.
- Parameters:
input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value
bos_token_id
.generation_config (
GenerationConfig
, optional) – The generation configuration to be used as base parametrization for the generation call.**kwargs
passed to generate matching the attributes ofgeneration_config
will override them. Ifgeneration_config
is not provided, the default will be used, which had the following loading priority: 1) from thegeneration_config.json
model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [GenerationConfig
]’s default values, whose documentation should be checked to parameterize generation.stopping_criteria (
StoppingCriteriaList
, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.streamer (
BaseStreamer
, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed throughstreamer.put(token_ids)
and the streamer is responsible for any further processing.synced_gpus (
bool
, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set toTrue
under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set toFalse
.kwargs (dict) – It can be used to specify additional kwargs passed to the model.
- Returns:
It is a tuple contains two elements: ids and scores. Each element is a Tensor.
With the fields:
- ids (Tensor):
The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input
input_ids
.
- scores (Tensor):
The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.
- Return type:
tuple[Tensor]
Example
import paddle from paddlenlp.transformers import ( UnifiedTransformerLMHeadModel, UnifiedTransformerTokenizer ) paddle.seed(2) # Initialize the model and tokenizer model_name_or_path = 'unified_transformer-12L-cn-luge' model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path) tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path) # Prepare the model inputs. history = "早上好,今天空气质量不错。" inputs = tokenizer.dialogue_encode(history, task_type='chitchat', add_start_token_as_response=True, return_tensors=True)
# Generate the sequence by using "greedy_search" strategy ids, scores = model.generate( **inputs, decode_strategy="greedy_search") print(ids.shape, scores.shape) # [1, 3] [1, 1] sequence_ids = ids.cpu().numpy().tolist()[0] sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) print(response) # 是的
# Generate 2 sequences by using "sampling" strategy (top_k=5) generation_config = GenerationConfig( decode_strategy="sampling", top_k=5, num_return_sequences=2 ) ids, scores = model.generate( **inputs, generation_config=generation_config, ) print(ids.shape, scores.shape) # [2, 7] [2, 1] response = [] for sequence_ids in ids.cpu().numpy().tolist(): sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) response.append(text) print(response) # ['天气好,心情也好', '你也是']
# Generate 2 sequences by using "beam_search" strategy (num_beams=5) generation_config = GenerationConfig( decode_strategy="beam_search", num_beams=5, num_return_sequences=2 ) ids, scores = model.generate( **inputs, generation_config=generation_config, ) print(ids.shape, scores.shape) # [2, 3] [2, 1] response = [] for sequence_ids in ids.cpu().numpy().tolist(): sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) response.append(text) print(response) # ['是的', '嗯嗯']
- class FasterGPT(model, decoding_lib=None, use_fp16_decoding=False)[source]#
Bases:
GPTPretrainedModel
- forward(input_ids, seq_len=None, attention_mask=None, top_k=4, top_p=0.0, max_length=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=0, decode_strategy='sample', num_return_sequences=1, **model_kwargs)[source]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters:
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
- generate(input_ids, seq_len=None, attention_mask=None, top_k=4, top_p=0.0, max_length=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=0, decode_strategy='sample', num_return_sequences=1, **model_kwargs)#
The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.
- Parameters:
input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value
bos_token_id
.generation_config (
GenerationConfig
, optional) – The generation configuration to be used as base parametrization for the generation call.**kwargs
passed to generate matching the attributes ofgeneration_config
will override them. Ifgeneration_config
is not provided, the default will be used, which had the following loading priority: 1) from thegeneration_config.json
model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [GenerationConfig
]’s default values, whose documentation should be checked to parameterize generation.stopping_criteria (
StoppingCriteriaList
, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.streamer (
BaseStreamer
, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed throughstreamer.put(token_ids)
and the streamer is responsible for any further processing.synced_gpus (
bool
, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set toTrue
under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set toFalse
.kwargs (dict) – It can be used to specify additional kwargs passed to the model.
- Returns:
It is a tuple contains two elements: ids and scores. Each element is a Tensor.
With the fields:
- ids (Tensor):
The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input
input_ids
.
- scores (Tensor):
The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.
- Return type:
tuple[Tensor]
Example
import paddle from paddlenlp.transformers import ( UnifiedTransformerLMHeadModel, UnifiedTransformerTokenizer ) paddle.seed(2) # Initialize the model and tokenizer model_name_or_path = 'unified_transformer-12L-cn-luge' model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path) tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path) # Prepare the model inputs. history = "早上好,今天空气质量不错。" inputs = tokenizer.dialogue_encode(history, task_type='chitchat', add_start_token_as_response=True, return_tensors=True)
# Generate the sequence by using "greedy_search" strategy ids, scores = model.generate( **inputs, decode_strategy="greedy_search") print(ids.shape, scores.shape) # [1, 3] [1, 1] sequence_ids = ids.cpu().numpy().tolist()[0] sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) print(response) # 是的
# Generate 2 sequences by using "sampling" strategy (top_k=5) generation_config = GenerationConfig( decode_strategy="sampling", top_k=5, num_return_sequences=2 ) ids, scores = model.generate( **inputs, generation_config=generation_config, ) print(ids.shape, scores.shape) # [2, 7] [2, 1] response = [] for sequence_ids in ids.cpu().numpy().tolist(): sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) response.append(text) print(response) # ['天气好,心情也好', '你也是']
# Generate 2 sequences by using "beam_search" strategy (num_beams=5) generation_config = GenerationConfig( decode_strategy="beam_search", num_beams=5, num_return_sequences=2 ) ids, scores = model.generate( **inputs, generation_config=generation_config, ) print(ids.shape, scores.shape) # [2, 3] [2, 1] response = [] for sequence_ids in ids.cpu().numpy().tolist(): sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) response.append(text) print(response) # ['是的', '嗯嗯']
- class FasterUnifiedTransformer(model, decoding_lib=None, use_fp16_decoding=False)[source]#
Bases:
UnifiedTransformerPretrainedModel
- forward(input_ids, token_type_ids, attention_mask, seq_len=None, role_ids=None, position_ids=None, max_length=128, min_length=0, top_k=4, top_p=0.0, decode_strategy='sampling', bos_token_id=None, eos_token_id=None, pad_token_id=None, num_beams=4, diversity_rate=0.0, temperature=1.0, num_return_sequences=1, length_penalty=0.6, early_stopping=False, forced_eos_token_id=None, **model_kwargs)[source]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters:
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
- generate(input_ids, token_type_ids, attention_mask, seq_len=None, role_ids=None, position_ids=None, max_length=128, min_length=0, top_k=4, top_p=0.0, decode_strategy='sampling', bos_token_id=None, eos_token_id=None, pad_token_id=None, num_beams=4, diversity_rate=0.0, temperature=1.0, num_return_sequences=1, length_penalty=0.6, early_stopping=False, forced_eos_token_id=None, **model_kwargs)#
The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.
- Parameters:
input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value
bos_token_id
.generation_config (
GenerationConfig
, optional) – The generation configuration to be used as base parametrization for the generation call.**kwargs
passed to generate matching the attributes ofgeneration_config
will override them. Ifgeneration_config
is not provided, the default will be used, which had the following loading priority: 1) from thegeneration_config.json
model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [GenerationConfig
]’s default values, whose documentation should be checked to parameterize generation.stopping_criteria (
StoppingCriteriaList
, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.streamer (
BaseStreamer
, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed throughstreamer.put(token_ids)
and the streamer is responsible for any further processing.synced_gpus (
bool
, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set toTrue
under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set toFalse
.kwargs (dict) – It can be used to specify additional kwargs passed to the model.
- Returns:
It is a tuple contains two elements: ids and scores. Each element is a Tensor.
With the fields:
- ids (Tensor):
The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input
input_ids
.
- scores (Tensor):
The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.
- Return type:
tuple[Tensor]
Example
import paddle from paddlenlp.transformers import ( UnifiedTransformerLMHeadModel, UnifiedTransformerTokenizer ) paddle.seed(2) # Initialize the model and tokenizer model_name_or_path = 'unified_transformer-12L-cn-luge' model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path) tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path) # Prepare the model inputs. history = "早上好,今天空气质量不错。" inputs = tokenizer.dialogue_encode(history, task_type='chitchat', add_start_token_as_response=True, return_tensors=True)
# Generate the sequence by using "greedy_search" strategy ids, scores = model.generate( **inputs, decode_strategy="greedy_search") print(ids.shape, scores.shape) # [1, 3] [1, 1] sequence_ids = ids.cpu().numpy().tolist()[0] sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) print(response) # 是的
# Generate 2 sequences by using "sampling" strategy (top_k=5) generation_config = GenerationConfig( decode_strategy="sampling", top_k=5, num_return_sequences=2 ) ids, scores = model.generate( **inputs, generation_config=generation_config, ) print(ids.shape, scores.shape) # [2, 7] [2, 1] response = [] for sequence_ids in ids.cpu().numpy().tolist(): sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) response.append(text) print(response) # ['天气好,心情也好', '你也是']
# Generate 2 sequences by using "beam_search" strategy (num_beams=5) generation_config = GenerationConfig( decode_strategy="beam_search", num_beams=5, num_return_sequences=2 ) ids, scores = model.generate( **inputs, generation_config=generation_config, ) print(ids.shape, scores.shape) # [2, 3] [2, 1] response = [] for sequence_ids in ids.cpu().numpy().tolist(): sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) response.append(text) print(response) # ['是的', '嗯嗯']
- class FasterUNIMOText(model, decoding_lib=None, use_fp16_decoding=False, **kwargs)[source]#
Bases:
UNIMOPretrainedModel
- forward(input_ids, token_type_ids, attention_mask, seq_len=None, max_length=128, min_length=0, top_k=4, top_p=0.0, num_beams=4, decode_strategy='sampling', bos_token_id=None, eos_token_id=None, pad_token_id=None, diversity_rate=0.0, temperature=1.0, num_return_sequences=1, length_penalty=0.6, early_stopping=False, forced_eos_token_id=None, position_ids=None, **model_kwargs)[source]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters:
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
- generate(input_ids, token_type_ids, attention_mask, seq_len=None, max_length=128, min_length=0, top_k=4, top_p=0.0, num_beams=4, decode_strategy='sampling', bos_token_id=None, eos_token_id=None, pad_token_id=None, diversity_rate=0.0, temperature=1.0, num_return_sequences=1, length_penalty=0.6, early_stopping=False, forced_eos_token_id=None, position_ids=None, **model_kwargs)#
The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.
- Parameters:
input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value
bos_token_id
.generation_config (
GenerationConfig
, optional) – The generation configuration to be used as base parametrization for the generation call.**kwargs
passed to generate matching the attributes ofgeneration_config
will override them. Ifgeneration_config
is not provided, the default will be used, which had the following loading priority: 1) from thegeneration_config.json
model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [GenerationConfig
]’s default values, whose documentation should be checked to parameterize generation.stopping_criteria (
StoppingCriteriaList
, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.streamer (
BaseStreamer
, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed throughstreamer.put(token_ids)
and the streamer is responsible for any further processing.synced_gpus (
bool
, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set toTrue
under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set toFalse
.kwargs (dict) – It can be used to specify additional kwargs passed to the model.
- Returns:
It is a tuple contains two elements: ids and scores. Each element is a Tensor.
With the fields:
- ids (Tensor):
The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input
input_ids
.
- scores (Tensor):
The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.
- Return type:
tuple[Tensor]
Example
import paddle from paddlenlp.transformers import ( UnifiedTransformerLMHeadModel, UnifiedTransformerTokenizer ) paddle.seed(2) # Initialize the model and tokenizer model_name_or_path = 'unified_transformer-12L-cn-luge' model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path) tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path) # Prepare the model inputs. history = "早上好,今天空气质量不错。" inputs = tokenizer.dialogue_encode(history, task_type='chitchat', add_start_token_as_response=True, return_tensors=True)
# Generate the sequence by using "greedy_search" strategy ids, scores = model.generate( **inputs, decode_strategy="greedy_search") print(ids.shape, scores.shape) # [1, 3] [1, 1] sequence_ids = ids.cpu().numpy().tolist()[0] sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) print(response) # 是的
# Generate 2 sequences by using "sampling" strategy (top_k=5) generation_config = GenerationConfig( decode_strategy="sampling", top_k=5, num_return_sequences=2 ) ids, scores = model.generate( **inputs, generation_config=generation_config, ) print(ids.shape, scores.shape) # [2, 7] [2, 1] response = [] for sequence_ids in ids.cpu().numpy().tolist(): sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) response.append(text) print(response) # ['天气好,心情也好', '你也是']
# Generate 2 sequences by using "beam_search" strategy (num_beams=5) generation_config = GenerationConfig( decode_strategy="beam_search", num_beams=5, num_return_sequences=2 ) ids, scores = model.generate( **inputs, generation_config=generation_config, ) print(ids.shape, scores.shape) # [2, 3] [2, 1] response = [] for sequence_ids in ids.cpu().numpy().tolist(): sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) response.append(text) print(response) # ['是的', '嗯嗯']
- class FasterMIRO(model, decoding_lib=None, use_fp16_decoding=False, **kwargs)[source]#
Bases:
UNIMOPretrainedModel
- forward(input_ids, token_type_ids, attention_mask, seq_len=None, max_length=128, min_length=0, top_k=4, top_p=0.0, num_beams=4, decode_strategy='sampling', bos_token_id=None, eos_token_id=None, pad_token_id=None, diversity_rate=0.0, temperature=1.0, num_return_sequences=1, length_penalty=0.6, early_stopping=False, forced_eos_token_id=None, position_ids=None, **model_kwargs)[source]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters:
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
- generate(input_ids, token_type_ids, attention_mask, seq_len=None, max_length=128, min_length=0, top_k=4, top_p=0.0, num_beams=4, decode_strategy='sampling', bos_token_id=None, eos_token_id=None, pad_token_id=None, diversity_rate=0.0, temperature=1.0, num_return_sequences=1, length_penalty=0.6, early_stopping=False, forced_eos_token_id=None, position_ids=None, **model_kwargs)#
The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.
- Parameters:
input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value
bos_token_id
.generation_config (
GenerationConfig
, optional) – The generation configuration to be used as base parametrization for the generation call.**kwargs
passed to generate matching the attributes ofgeneration_config
will override them. Ifgeneration_config
is not provided, the default will be used, which had the following loading priority: 1) from thegeneration_config.json
model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [GenerationConfig
]’s default values, whose documentation should be checked to parameterize generation.stopping_criteria (
StoppingCriteriaList
, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.streamer (
BaseStreamer
, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed throughstreamer.put(token_ids)
and the streamer is responsible for any further processing.synced_gpus (
bool
, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set toTrue
under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set toFalse
.kwargs (dict) – It can be used to specify additional kwargs passed to the model.
- Returns:
It is a tuple contains two elements: ids and scores. Each element is a Tensor.
With the fields:
- ids (Tensor):
The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input
input_ids
.
- scores (Tensor):
The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.
- Return type:
tuple[Tensor]
Example
import paddle from paddlenlp.transformers import ( UnifiedTransformerLMHeadModel, UnifiedTransformerTokenizer ) paddle.seed(2) # Initialize the model and tokenizer model_name_or_path = 'unified_transformer-12L-cn-luge' model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path) tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path) # Prepare the model inputs. history = "早上好,今天空气质量不错。" inputs = tokenizer.dialogue_encode(history, task_type='chitchat', add_start_token_as_response=True, return_tensors=True)
# Generate the sequence by using "greedy_search" strategy ids, scores = model.generate( **inputs, decode_strategy="greedy_search") print(ids.shape, scores.shape) # [1, 3] [1, 1] sequence_ids = ids.cpu().numpy().tolist()[0] sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) print(response) # 是的
# Generate 2 sequences by using "sampling" strategy (top_k=5) generation_config = GenerationConfig( decode_strategy="sampling", top_k=5, num_return_sequences=2 ) ids, scores = model.generate( **inputs, generation_config=generation_config, ) print(ids.shape, scores.shape) # [2, 7] [2, 1] response = [] for sequence_ids in ids.cpu().numpy().tolist(): sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) response.append(text) print(response) # ['天气好,心情也好', '你也是']
# Generate 2 sequences by using "beam_search" strategy (num_beams=5) generation_config = GenerationConfig( decode_strategy="beam_search", num_beams=5, num_return_sequences=2 ) ids, scores = model.generate( **inputs, generation_config=generation_config, ) print(ids.shape, scores.shape) # [2, 3] [2, 1] response = [] for sequence_ids in ids.cpu().numpy().tolist(): sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) response.append(text) print(response) # ['是的', '嗯嗯']
- class FasterBART(model, decoding_lib=None, use_fp16_decoding=False, enable_fast_encoder=True)[source]#
Bases:
BartPretrainedModel
- enable_faster_encoder_func(use_fp16=False, encoder_lib=None)#
Compiles fusion encoder operator intergrated FastGeneration using the method of JIT(Just-In-Time) and replaces the
forward
function ofpaddle.nn.TransformerEncoder
andpaddle.nn.TransformerEncoderLayer
objects inherited fromself
to support inference using FastGeneration.Examples
from paddlenlp.ops import enable_fast_encoder, disable_fast_encoder model.eval() model = enable_fast_encoder(model) enc_out = model(src, src_mask) model = disable_fast_encoder(model)
- forward(input_ids=None, encoder_output=None, seq_len=None, num_beams=4, top_k=1, top_p=0.0, temperature=1.0, decode_strategy='beam_search', bos_token_id=None, eos_token_id=None, pad_token_id=None, decoder_start_token_id=None, min_length=0, max_length=20, diversity_rate=0.0, length_penalty=0.6, num_return_sequences=1, early_stopping=False, forced_eos_token_id=None, **model_kwargs)[source]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters:
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
- generate(input_ids=None, encoder_output=None, seq_len=None, num_beams=4, top_k=1, top_p=0.0, temperature=1.0, decode_strategy='beam_search', bos_token_id=None, eos_token_id=None, pad_token_id=None, decoder_start_token_id=None, min_length=0, max_length=20, diversity_rate=0.0, length_penalty=0.6, num_return_sequences=1, early_stopping=False, forced_eos_token_id=None, **model_kwargs)#
The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.
- Parameters:
input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value
bos_token_id
.generation_config (
GenerationConfig
, optional) – The generation configuration to be used as base parametrization for the generation call.**kwargs
passed to generate matching the attributes ofgeneration_config
will override them. Ifgeneration_config
is not provided, the default will be used, which had the following loading priority: 1) from thegeneration_config.json
model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [GenerationConfig
]’s default values, whose documentation should be checked to parameterize generation.stopping_criteria (
StoppingCriteriaList
, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.streamer (
BaseStreamer
, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed throughstreamer.put(token_ids)
and the streamer is responsible for any further processing.synced_gpus (
bool
, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set toTrue
under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set toFalse
.kwargs (dict) – It can be used to specify additional kwargs passed to the model.
- Returns:
It is a tuple contains two elements: ids and scores. Each element is a Tensor.
With the fields:
- ids (Tensor):
The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input
input_ids
.
- scores (Tensor):
The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.
- Return type:
tuple[Tensor]
Example
import paddle from paddlenlp.transformers import ( UnifiedTransformerLMHeadModel, UnifiedTransformerTokenizer ) paddle.seed(2) # Initialize the model and tokenizer model_name_or_path = 'unified_transformer-12L-cn-luge' model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path) tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path) # Prepare the model inputs. history = "早上好,今天空气质量不错。" inputs = tokenizer.dialogue_encode(history, task_type='chitchat', add_start_token_as_response=True, return_tensors=True)
# Generate the sequence by using "greedy_search" strategy ids, scores = model.generate( **inputs, decode_strategy="greedy_search") print(ids.shape, scores.shape) # [1, 3] [1, 1] sequence_ids = ids.cpu().numpy().tolist()[0] sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) print(response) # 是的
# Generate 2 sequences by using "sampling" strategy (top_k=5) generation_config = GenerationConfig( decode_strategy="sampling", top_k=5, num_return_sequences=2 ) ids, scores = model.generate( **inputs, generation_config=generation_config, ) print(ids.shape, scores.shape) # [2, 7] [2, 1] response = [] for sequence_ids in ids.cpu().numpy().tolist(): sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) response.append(text) print(response) # ['天气好,心情也好', '你也是']
# Generate 2 sequences by using "beam_search" strategy (num_beams=5) generation_config = GenerationConfig( decode_strategy="beam_search", num_beams=5, num_return_sequences=2 ) ids, scores = model.generate( **inputs, generation_config=generation_config, ) print(ids.shape, scores.shape) # [2, 3] [2, 1] response = [] for sequence_ids in ids.cpu().numpy().tolist(): sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) response.append(text) print(response) # ['是的', '嗯嗯']
- class FasterMBART(model, decoding_lib=None, use_fp16_decoding=False, enable_fast_encoder=False)[source]#
Bases:
MBartPretrainedModel
- enable_faster_encoder_func(use_fp16=False, encoder_lib=None)#
Compiles fusion encoder operator intergrated FastGeneration using the method of JIT(Just-In-Time) and replaces the
forward
function ofpaddle.nn.TransformerEncoder
andpaddle.nn.TransformerEncoderLayer
objects inherited fromself
to support inference using FastGeneration.Examples
from paddlenlp.ops import enable_fast_encoder, disable_fast_encoder model.eval() model = enable_fast_encoder(model) enc_out = model(src, src_mask) model = disable_fast_encoder(model)
- forward(input_ids=None, encoder_output=None, seq_len=None, forced_bos_token_id=None, num_beams=4, top_k=1, top_p=0.0, decode_strategy='beam_search_v3', bos_token_id=None, eos_token_id=None, pad_token_id=None, decoder_start_token_id=None, max_length=256, diversity_rate=0.0, length_penalty=0.6, temperature=1.0, num_return_sequences=1, early_stopping=False, forced_eos_token_id=None, **model_kwargs)[source]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters:
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
- generate(input_ids=None, encoder_output=None, seq_len=None, forced_bos_token_id=None, num_beams=4, top_k=1, top_p=0.0, decode_strategy='beam_search_v3', bos_token_id=None, eos_token_id=None, pad_token_id=None, decoder_start_token_id=None, max_length=256, diversity_rate=0.0, length_penalty=0.6, temperature=1.0, num_return_sequences=1, early_stopping=False, forced_eos_token_id=None, **model_kwargs)#
The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.
- Parameters:
input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value
bos_token_id
.generation_config (
GenerationConfig
, optional) – The generation configuration to be used as base parametrization for the generation call.**kwargs
passed to generate matching the attributes ofgeneration_config
will override them. Ifgeneration_config
is not provided, the default will be used, which had the following loading priority: 1) from thegeneration_config.json
model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [GenerationConfig
]’s default values, whose documentation should be checked to parameterize generation.stopping_criteria (
StoppingCriteriaList
, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.streamer (
BaseStreamer
, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed throughstreamer.put(token_ids)
and the streamer is responsible for any further processing.synced_gpus (
bool
, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set toTrue
under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set toFalse
.kwargs (dict) – It can be used to specify additional kwargs passed to the model.
- Returns:
It is a tuple contains two elements: ids and scores. Each element is a Tensor.
With the fields:
- ids (Tensor):
The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input
input_ids
.
- scores (Tensor):
The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.
- Return type:
tuple[Tensor]
Example
import paddle from paddlenlp.transformers import ( UnifiedTransformerLMHeadModel, UnifiedTransformerTokenizer ) paddle.seed(2) # Initialize the model and tokenizer model_name_or_path = 'unified_transformer-12L-cn-luge' model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path) tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path) # Prepare the model inputs. history = "早上好,今天空气质量不错。" inputs = tokenizer.dialogue_encode(history, task_type='chitchat', add_start_token_as_response=True, return_tensors=True)
# Generate the sequence by using "greedy_search" strategy ids, scores = model.generate( **inputs, decode_strategy="greedy_search") print(ids.shape, scores.shape) # [1, 3] [1, 1] sequence_ids = ids.cpu().numpy().tolist()[0] sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) print(response) # 是的
# Generate 2 sequences by using "sampling" strategy (top_k=5) generation_config = GenerationConfig( decode_strategy="sampling", top_k=5, num_return_sequences=2 ) ids, scores = model.generate( **inputs, generation_config=generation_config, ) print(ids.shape, scores.shape) # [2, 7] [2, 1] response = [] for sequence_ids in ids.cpu().numpy().tolist(): sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) response.append(text) print(response) # ['天气好,心情也好', '你也是']
# Generate 2 sequences by using "beam_search" strategy (num_beams=5) generation_config = GenerationConfig( decode_strategy="beam_search", num_beams=5, num_return_sequences=2 ) ids, scores = model.generate( **inputs, generation_config=generation_config, ) print(ids.shape, scores.shape) # [2, 3] [2, 1] response = [] for sequence_ids in ids.cpu().numpy().tolist(): sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) response.append(text) print(response) # ['是的', '嗯嗯']
- class FasterGPTJ(model, decoding_lib=None, use_fp16_decoding=False)[source]#
Bases:
GPTJPretrainedModel
- forward(input_ids, seq_len=None, attention_mask=None, top_k=4, top_p=0.0, min_length=0, max_length=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=0, repetition_penalty=1.0, decode_strategy='sampling', num_return_sequences=1, **model_kwargs)[source]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters:
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
- generate(input_ids, seq_len=None, attention_mask=None, top_k=4, top_p=0.0, min_length=0, max_length=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=0, repetition_penalty=1.0, decode_strategy='sampling', num_return_sequences=1, **model_kwargs)#
The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.
- Parameters:
input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value
bos_token_id
.generation_config (
GenerationConfig
, optional) – The generation configuration to be used as base parametrization for the generation call.**kwargs
passed to generate matching the attributes ofgeneration_config
will override them. Ifgeneration_config
is not provided, the default will be used, which had the following loading priority: 1) from thegeneration_config.json
model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [GenerationConfig
]’s default values, whose documentation should be checked to parameterize generation.stopping_criteria (
StoppingCriteriaList
, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.streamer (
BaseStreamer
, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed throughstreamer.put(token_ids)
and the streamer is responsible for any further processing.synced_gpus (
bool
, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set toTrue
under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set toFalse
.kwargs (dict) – It can be used to specify additional kwargs passed to the model.
- Returns:
It is a tuple contains two elements: ids and scores. Each element is a Tensor.
With the fields:
- ids (Tensor):
The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input
input_ids
.
- scores (Tensor):
The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.
- Return type:
tuple[Tensor]
Example
import paddle from paddlenlp.transformers import ( UnifiedTransformerLMHeadModel, UnifiedTransformerTokenizer ) paddle.seed(2) # Initialize the model and tokenizer model_name_or_path = 'unified_transformer-12L-cn-luge' model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path) tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path) # Prepare the model inputs. history = "早上好,今天空气质量不错。" inputs = tokenizer.dialogue_encode(history, task_type='chitchat', add_start_token_as_response=True, return_tensors=True)
# Generate the sequence by using "greedy_search" strategy ids, scores = model.generate( **inputs, decode_strategy="greedy_search") print(ids.shape, scores.shape) # [1, 3] [1, 1] sequence_ids = ids.cpu().numpy().tolist()[0] sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) print(response) # 是的
# Generate 2 sequences by using "sampling" strategy (top_k=5) generation_config = GenerationConfig( decode_strategy="sampling", top_k=5, num_return_sequences=2 ) ids, scores = model.generate( **inputs, generation_config=generation_config, ) print(ids.shape, scores.shape) # [2, 7] [2, 1] response = [] for sequence_ids in ids.cpu().numpy().tolist(): sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) response.append(text) print(response) # ['天气好,心情也好', '你也是']
# Generate 2 sequences by using "beam_search" strategy (num_beams=5) generation_config = GenerationConfig( decode_strategy="beam_search", num_beams=5, num_return_sequences=2 ) ids, scores = model.generate( **inputs, generation_config=generation_config, ) print(ids.shape, scores.shape) # [2, 3] [2, 1] response = [] for sequence_ids in ids.cpu().numpy().tolist(): sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) response.append(text) print(response) # ['是的', '嗯嗯']
- class FasterCodeGen(model, decoding_lib=None, use_fp16_decoding=False)[source]#
Bases:
CodeGenPreTrainedModel
- forward(input_ids, seq_len=None, attention_mask=None, top_k=4, top_p=0.0, min_length=0, max_length=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=0, repetition_penalty=1.0, decode_strategy='sampling', num_return_sequences=1, **model_kwargs)[source]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters:
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
- generate(input_ids, seq_len=None, attention_mask=None, top_k=4, top_p=0.0, min_length=0, max_length=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=0, repetition_penalty=1.0, decode_strategy='sampling', num_return_sequences=1, **model_kwargs)#
The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.
- Parameters:
input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value
bos_token_id
.generation_config (
GenerationConfig
, optional) – The generation configuration to be used as base parametrization for the generation call.**kwargs
passed to generate matching the attributes ofgeneration_config
will override them. Ifgeneration_config
is not provided, the default will be used, which had the following loading priority: 1) from thegeneration_config.json
model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [GenerationConfig
]’s default values, whose documentation should be checked to parameterize generation.stopping_criteria (
StoppingCriteriaList
, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.streamer (
BaseStreamer
, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed throughstreamer.put(token_ids)
and the streamer is responsible for any further processing.synced_gpus (
bool
, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set toTrue
under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set toFalse
.kwargs (dict) – It can be used to specify additional kwargs passed to the model.
- Returns:
It is a tuple contains two elements: ids and scores. Each element is a Tensor.
With the fields:
- ids (Tensor):
The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input
input_ids
.
- scores (Tensor):
The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.
- Return type:
tuple[Tensor]
Example
import paddle from paddlenlp.transformers import ( UnifiedTransformerLMHeadModel, UnifiedTransformerTokenizer ) paddle.seed(2) # Initialize the model and tokenizer model_name_or_path = 'unified_transformer-12L-cn-luge' model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path) tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path) # Prepare the model inputs. history = "早上好,今天空气质量不错。" inputs = tokenizer.dialogue_encode(history, task_type='chitchat', add_start_token_as_response=True, return_tensors=True)
# Generate the sequence by using "greedy_search" strategy ids, scores = model.generate( **inputs, decode_strategy="greedy_search") print(ids.shape, scores.shape) # [1, 3] [1, 1] sequence_ids = ids.cpu().numpy().tolist()[0] sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) print(response) # 是的
# Generate 2 sequences by using "sampling" strategy (top_k=5) generation_config = GenerationConfig( decode_strategy="sampling", top_k=5, num_return_sequences=2 ) ids, scores = model.generate( **inputs, generation_config=generation_config, ) print(ids.shape, scores.shape) # [2, 7] [2, 1] response = [] for sequence_ids in ids.cpu().numpy().tolist(): sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) response.append(text) print(response) # ['天气好,心情也好', '你也是']
# Generate 2 sequences by using "beam_search" strategy (num_beams=5) generation_config = GenerationConfig( decode_strategy="beam_search", num_beams=5, num_return_sequences=2 ) ids, scores = model.generate( **inputs, generation_config=generation_config, ) print(ids.shape, scores.shape) # [2, 3] [2, 1] response = [] for sequence_ids in ids.cpu().numpy().tolist(): sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) response.append(text) print(response) # ['是的', '嗯嗯']
- class FasterPegasus(model, decoding_lib=None, use_fp16_decoding=False, enable_fast_encoder=False, **kwargs)[source]#
Bases:
PegasusPretrainedModel
- enable_faster_encoder_func(use_fp16=False, encoder_lib=None)#
Compiles fusion encoder operator intergrated FastGeneration using the method of JIT(Just-In-Time) and replaces the
forward
function ofpaddle.nn.TransformerEncoder
andpaddle.nn.TransformerEncoderLayer
objects inherited fromself
to support inference using FastGeneration.Examples
from paddlenlp.ops import enable_fast_encoder, disable_fast_encoder model.eval() model = enable_fast_encoder(model) enc_out = model(src, src_mask) model = disable_fast_encoder(model)
- forward(input_ids=None, encoder_output=None, seq_len=None, min_length=0, max_length=256, num_beams=4, decode_strategy='beam_search_v3', decoder_start_token_id=None, bos_token_id=None, eos_token_id=None, pad_token_id=None, diversity_rate=0.0, length_penalty=0.6, top_k=1, top_p=0.0, temperature=1.0, num_return_sequences=1, early_stopping=False, forced_bos_token_id=None, forced_eos_token_id=None, **model_kwargs)[source]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters:
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
- generate(input_ids=None, encoder_output=None, seq_len=None, min_length=0, max_length=256, num_beams=4, decode_strategy='beam_search_v3', decoder_start_token_id=None, bos_token_id=None, eos_token_id=None, pad_token_id=None, diversity_rate=0.0, length_penalty=0.6, top_k=1, top_p=0.0, temperature=1.0, num_return_sequences=1, early_stopping=False, forced_bos_token_id=None, forced_eos_token_id=None, **model_kwargs)#
The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.
- Parameters:
input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value
bos_token_id
.generation_config (
GenerationConfig
, optional) – The generation configuration to be used as base parametrization for the generation call.**kwargs
passed to generate matching the attributes ofgeneration_config
will override them. Ifgeneration_config
is not provided, the default will be used, which had the following loading priority: 1) from thegeneration_config.json
model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [GenerationConfig
]’s default values, whose documentation should be checked to parameterize generation.stopping_criteria (
StoppingCriteriaList
, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.streamer (
BaseStreamer
, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed throughstreamer.put(token_ids)
and the streamer is responsible for any further processing.synced_gpus (
bool
, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set toTrue
under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set toFalse
.kwargs (dict) – It can be used to specify additional kwargs passed to the model.
- Returns:
It is a tuple contains two elements: ids and scores. Each element is a Tensor.
With the fields:
- ids (Tensor):
The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input
input_ids
.
- scores (Tensor):
The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.
- Return type:
tuple[Tensor]
Example
import paddle from paddlenlp.transformers import ( UnifiedTransformerLMHeadModel, UnifiedTransformerTokenizer ) paddle.seed(2) # Initialize the model and tokenizer model_name_or_path = 'unified_transformer-12L-cn-luge' model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path) tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path) # Prepare the model inputs. history = "早上好,今天空气质量不错。" inputs = tokenizer.dialogue_encode(history, task_type='chitchat', add_start_token_as_response=True, return_tensors=True)
# Generate the sequence by using "greedy_search" strategy ids, scores = model.generate( **inputs, decode_strategy="greedy_search") print(ids.shape, scores.shape) # [1, 3] [1, 1] sequence_ids = ids.cpu().numpy().tolist()[0] sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) print(response) # 是的
# Generate 2 sequences by using "sampling" strategy (top_k=5) generation_config = GenerationConfig( decode_strategy="sampling", top_k=5, num_return_sequences=2 ) ids, scores = model.generate( **inputs, generation_config=generation_config, ) print(ids.shape, scores.shape) # [2, 7] [2, 1] response = [] for sequence_ids in ids.cpu().numpy().tolist(): sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) response.append(text) print(response) # ['天气好,心情也好', '你也是']
# Generate 2 sequences by using "beam_search" strategy (num_beams=5) generation_config = GenerationConfig( decode_strategy="beam_search", num_beams=5, num_return_sequences=2 ) ids, scores = model.generate( **inputs, generation_config=generation_config, ) print(ids.shape, scores.shape) # [2, 3] [2, 1] response = [] for sequence_ids in ids.cpu().numpy().tolist(): sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) response.append(text) print(response) # ['是的', '嗯嗯']
- class FasterT5(model, decoding_lib=None, use_fp16_decoding=False)[source]#
Bases:
T5PretrainedModel
- forward(input_ids=None, encoder_output=None, seq_len=None, max_length=128, min_length=0, top_k=4, top_p=0.0, num_beams=4, decode_strategy='sampling', decoder_start_token_id=None, bos_token_id=None, eos_token_id=None, pad_token_id=None, diversity_rate=0.0, temperature=1.0, num_return_sequences=1, length_penalty=0.6, early_stopping=False, forced_eos_token_id=None, **model_kwargs)[source]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters:
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
- generate(input_ids=None, encoder_output=None, seq_len=None, max_length=128, min_length=0, top_k=4, top_p=0.0, num_beams=4, decode_strategy='sampling', decoder_start_token_id=None, bos_token_id=None, eos_token_id=None, pad_token_id=None, diversity_rate=0.0, temperature=1.0, num_return_sequences=1, length_penalty=0.6, early_stopping=False, forced_eos_token_id=None, **model_kwargs)#
The interface for generation task. This method can generate sequences by using decoding strategy. Currently, there are three decoding strategies supported: “greedy_search”, “sampling” and “beam_search”.
- Parameters:
input_ids (Tensor, optional) – The input sequence ids for the generation. It is a Tensor with shape [batch_size, sequence_length]. The data type should be int32 or int64. Default to None, which we will initialize it as a Tensor with shape [1, 1], filled with the value
bos_token_id
.generation_config (
GenerationConfig
, optional) – The generation configuration to be used as base parametrization for the generation call.**kwargs
passed to generate matching the attributes ofgeneration_config
will override them. Ifgeneration_config
is not provided, the default will be used, which had the following loading priority: 1) from thegeneration_config.json
model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [GenerationConfig
]’s default values, whose documentation should be checked to parameterize generation.stopping_criteria (
StoppingCriteriaList
, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.streamer (
BaseStreamer
, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed throughstreamer.put(token_ids)
and the streamer is responsible for any further processing.synced_gpus (
bool
, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set toTrue
under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set toFalse
.kwargs (dict) – It can be used to specify additional kwargs passed to the model.
- Returns:
It is a tuple contains two elements: ids and scores. Each element is a Tensor.
With the fields:
- ids (Tensor):
The ids of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, sequence_length]. The data type is same as the input
input_ids
.
- scores (Tensor):
The scores of the generated sequences. It is a Tensor with shape [batch_size * num_return_sequences, 1]. The data type is float32 or float64, which is the same as the parameters in the model.
- Return type:
tuple[Tensor]
Example
import paddle from paddlenlp.transformers import ( UnifiedTransformerLMHeadModel, UnifiedTransformerTokenizer ) paddle.seed(2) # Initialize the model and tokenizer model_name_or_path = 'unified_transformer-12L-cn-luge' model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path) tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path) # Prepare the model inputs. history = "早上好,今天空气质量不错。" inputs = tokenizer.dialogue_encode(history, task_type='chitchat', add_start_token_as_response=True, return_tensors=True)
# Generate the sequence by using "greedy_search" strategy ids, scores = model.generate( **inputs, decode_strategy="greedy_search") print(ids.shape, scores.shape) # [1, 3] [1, 1] sequence_ids = ids.cpu().numpy().tolist()[0] sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) print(response) # 是的
# Generate 2 sequences by using "sampling" strategy (top_k=5) generation_config = GenerationConfig( decode_strategy="sampling", top_k=5, num_return_sequences=2 ) ids, scores = model.generate( **inputs, generation_config=generation_config, ) print(ids.shape, scores.shape) # [2, 7] [2, 1] response = [] for sequence_ids in ids.cpu().numpy().tolist(): sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) response.append(text) print(response) # ['天气好,心情也好', '你也是']
# Generate 2 sequences by using "beam_search" strategy (num_beams=5) generation_config = GenerationConfig( decode_strategy="beam_search", num_beams=5, num_return_sequences=2 ) ids, scores = model.generate( **inputs, generation_config=generation_config, ) print(ids.shape, scores.shape) # [2, 3] [2, 1] response = [] for sequence_ids in ids.cpu().numpy().tolist(): sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)] text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False) response.append(text) print(response) # ['是的', '嗯嗯']