decoding#
- convert_params(fast_model, model, fuse_qkv=1, use_fp16=False, restore_data=False)[source]#
Convert parameters included in Transformer layer (
nn.TransformerEncoder
andgpt.modeling.TransformerDecoder
) from original models to the format of faster models.- Parameters:
fast_model (Layer) – The faster model object.
model (Layer) – The Transformer layer. It can be an instance of
nn.TransformerEncoder
orgpt.modeling.TransformerDecoder
currently, andnn.TransformerDecoder
would be supported soon.fuse_qkv (int) – 0 for nofuse, 1 for fuse, 2 for fuse and delete the unfused parameters. If environment variable
PPFG_QKV_MEM_OPT
is set and the weights of q/k/v is fused, it will try to delete the original unfused weights. Note the rollback to original model would not be guarantee anymore when the faster model failed if the original weights are deleted. Default to 1.use_fp16 (bool) – Whether to use float16. Maybe we should use the default dtype as the highest priority later. Default to
False
.restore_data (bool) – If
False
, need to reload the weight values. It should beTrue
for weight loaded models. Default toFalse
.
- Returns:
- Each value is a list including converted parameters in all
layers. For other parameters not included in Transformer module to be converted, such as embeddings, you can achieve it by using the returned dict
params
thoughparams['word_emb'].append()
directly which would do CPU/GPU and fp32/fp16 transfer automatically.
- Return type:
defaultdict
- class InferTransformerDecoding(decoder, word_embedding, positional_embedding, linear, num_decoder_layers, n_head, d_model, bos_id=0, eos_id=1, decoding_strategy='beam_search', beam_size=4, topk=1, topp=0.0, max_out_len=256, diversity_rate=0.0, decoding_lib=None, use_fp16_decoding=False, rel_len=False, alpha=0.6)[source]#
- class FTParaConf(tensor_para_size=None, layer_para_size=None, layer_para_batch_size=1)[source]#
Configurations for model parallel in FastGeneration. Currently only support GPT. Please refer to Megatron for details.
- Parameters:
tensor_para_size (int, optional) – The size for tensor parallel. If it is 1, tensor parallel would not be used. Default to 1.
layer_para_size (int, optional) – The size for layer parallel. If it is 1, layer parallel would not be used. Default to 1.
layer_para_batch_size (int, optional) – The local batch size for pipeline parallel. It is suggested to use
batch_size // layer_para_size
. Default to 1.
- is_last_group()[source]#
For layer parallel, only the process corresponding to the last layer group can get the predict results. It is used to check whether this is the process corresponding to the last layer group.
- is_load(i, num_layer)[source]#
Whether or not the given transformer layer of should be loaded to the current parallel model. For layer parallel, there is no need not to load other layer groups.
- Parameters:
i (int) – The index of Transformer layer.
num_layer (int) – The number of Transformer layers.
- Returns:
- Indicate whether or not the given transformer layer of should
be loaded to the current parallel model.
- Return type:
bool
- slice_weight(weight, axis, phase=1, out_param=False)[source]#
Get the weight slice for tensor parallel.
- Parameters:
weight (Tensor or ndarray) – The weight or bias to be sliced.
axis (int) – The axis to perform slice.
phase (int, optional) – 0 is used for creating partial model when initializing and
from_pretrained
. While 1 is used in converting parameters to FastGeneration. No slice would be performed if it is 1, since parameters have been sliced inphase=0
.out_param (bool, optional) – If true,
weight
should be a Parameter and force the output to be a Parameter.
- Returns:
The sliced weight.
- Return type:
Tensor or ndarray
- set_partial_model(is_partial_model)[source]#
This is used to set whether or not the current model has complete parameters.
- Parameters:
is_partial_model (bool) – It is used to set whether or not the current model has complete parameters.
- fit_partial_model(model, state_to_load)[source]#
Slice every values included in
state_to_load
according to the shape of corresponding parameters inmodel
. This is used infrom_pratrained
to get sliced parameter values.- Parameters:
model (PretrainedModel) – The model to use.
state_to_load (dict) – The state dict including complete parameter values of model.
- Returns:
The state dict contains adjusted values.
- Return type:
dict
- get_ft_para_conf()[source]#
Get settings for model parallel.
- Returns:
The settings for model parallel.
- Return type:
- enable_ft_para(tensor_para_size=None, layer_para_size=None, layer_para_batch_size=1)[source]#
Enable model parallel with the given settings in FastGeneration. Currently only support GPT. Please refer to Megatron for details.
- Parameters:
tensor_para_size (int, optional) – The size for tensor parallel. If it is 1, tensor parallel would not be used. When it is None, tensor parallel size would be set as
world_size / layer_para_size
. Default to None.layer_para_size (int, optional) – The size for layer parallel. If it is 1, layer parallel would not be used. When it is None, it would be set as 1. Default to None.
layer_para_batch_size (int, optional) – The local batch size for pipeline parallel. It is suggested to use
batch_size // layer_para_size
. Default to 1.
- class InferOptDecoding(model: OPTForCausalLM, decoding_lib=None, use_fp16_decoding=False)[source]#
extract infer model parameters and feed it into the cuda decoder
- forward(input_ids, mem_seq_len, attention_mask=None, topk=4, topp=0.0, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, max_out_len=256, temperature=1)[source]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters:
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
- class InferGptDecoding(model, decoding_lib=None, use_fp16_decoding=False)[source]#
- forward(input_ids, mem_seq_len, attention_mask=None, topk=4, topp=0.0, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, max_out_len=256, temperature=1)[source]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters:
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
- class InferUnifiedDecoding(model, decoding_lib=None, use_fp16_decoding=False, logits_mask=None, n_head=8, hidden_dims=512, size_per_head=64, n_layer=6, unk_id=0, mask_id=30000, normalize_before=True, hidden_act='gelu')[source]#
- forward(input_ids, attn_mask, memory_seq_lens, type_id, decoder_type_id, role_id=None, decoder_role_id=None, position_id=None, decoder_position_id=None, beam_size=4, topk=4, topp=0.0, decoding_strategy='greedy_search', max_out_len=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=1.0, length_penalty=1.0, diversity_rate=0.0, pos_bias=True, rel_len=False, early_stopping=False, min_length=0)[source]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters:
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
- class InferMIRODecoding(model, decoding_lib=None, use_fp16_decoding=False, logits_mask=None, n_head=8, hidden_dims=512, size_per_head=64, n_layer=6, unk_id=0, mask_id=30000, normalize_before=True, hidden_act='relu')[source]#
- forward(input_ids, attn_mask, memory_seq_lens, type_id, decoder_type_id, role_id=None, decoder_role_id=None, position_id=None, decoder_position_id=None, beam_size=4, topk=4, topp=0.0, decoding_strategy='greedy_search', max_out_len=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=1.0, length_penalty=1.0, diversity_rate=0.0, pos_bias=True, rel_len=False, early_stopping=False, min_length=0)[source]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters:
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
- class InferBartDecoding(model, decoding_lib=None, use_fp16_decoding=False)[source]#
- forward(enc_output, memory_seq_lens, beam_size=4, top_k=1, top_p=0.0, temperature=1.0, decoding_strategy='beam_search_v3', max_out_len=256, min_out_len=256, diversity_rate=0.0, rel_len=False, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, alpha=0.6, early_stopping=False)[source]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters:
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
- class InferMBartDecoding(model, decoding_lib=None, use_fp16_decoding=False, hidden_act='gelu')[source]#
- forward(enc_output, memory_seq_lens, trg_word=None, beam_size=4, top_k=1, top_p=0.0, decoding_strategy='beam_search_v3', max_out_len=256, diversity_rate=0.0, rel_len=False, bos_token_id=None, eos_token_id=None, pad_token_id=None, alpha=0.6, temperature=1.0, early_stopping=False)[source]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters:
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
- convert_gptj_params(fast_model, model, fuse_qkv=1, use_fp16=False, restore_data=False, permutation=None)[source]#
Convert parameters included in Transformer layer from original models to the format of faster models.
- Parameters:
fast_model (Layer) – The faster model object.
model (Layer) – The Transformer layer.
fuse_qkv (int) – 0 for nofuse, 1 for fuse, 2 for fuse and delete the unfused parameters. If environment variable
PPFG_QKV_MEM_OPT
is set and the weights of q/k/v is fused, it will try to delete the original unfused weights. Note the rollback to original model would not be guarantee anymore when the faster model failed if the original weights are deleted. Default to 1.use_fp16 (bool) – Whether to use float16. Maybe we should use the default dtype as the highest priority later. Default to
False
.restore_data (bool) – If
False
, need to reload the weight values. It should beTrue
for weight loaded models. Default toFalse
.
- Returns:
- Each value is a list including converted parameters in all
layers. For other parameters not included in Transformer module to be converted, such as embeddings, you can achieve it by using the returned dict
params
thoughparams['word_emb'].append()
directly which would do CPU/GPU and fp32/fp16 transfer automatically.
- Return type:
defaultdict
- class InferGptJDecoding(model, decoding_lib=None, use_fp16_decoding=False, transpose_qkv=False)[source]#
- forward(input_ids, mem_seq_len, attention_mask=None, topk=4, topp=0.0, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, max_out_len=256, temperature=1, repetition_penalty=1.0, min_length=0)[source]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters:
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
- class InferPegasusDecoding(model, decoding_lib=None, use_fp16_decoding=False, hidden_act='gelu')[source]#
- forward(enc_output, memory_seq_lens, beam_size=4, top_k=1, top_p=0.0, decoding_strategy='beam_search_v3', max_out_len=256, min_out_len=256, diversity_rate=0.0, rel_len=False, bos_token_id=None, eos_token_id=None, pad_token_id=None, alpha=0.6, temperature=1.0, early_stopping=False, forced_eos_token_id=None)[source]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters:
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments
- class InferT5Decoding(model, decoding_lib=None, use_fp16_decoding=False)[source]#
- forward(enc_output, memory_seq_lens, beam_size=4, top_k=1, top_p=0.0, decoding_strategy='beam_search_v3', max_out_len=256, diversity_rate=0.0, rel_len=False, bos_token_id=None, eos_token_id=None, pad_token_id=None, alpha=0.6, temperature=1.0, early_stopping=False)[source]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- Parameters:
*inputs (tuple) – unpacked tuple arguments
**kwargs (dict) – unpacked dict arguments