decoding#
- convert_params(fast_model, model, fuse_qkv=1, use_fp16=False, restore_data=False)[源代码]#
Convert parameters included in Transformer layer (
nn.TransformerEncoder
andgpt.modeling.TransformerDecoder
) from original models to the format of faster models.- 参数:
fast_model (Layer) -- The faster model object.
model (Layer) -- The Transformer layer. It can be an instance of
nn.TransformerEncoder
orgpt.modeling.TransformerDecoder
currently, andnn.TransformerDecoder
would be supported soon.fuse_qkv (int) -- 0 for nofuse, 1 for fuse, 2 for fuse and delete the unfused parameters. If environment variable
PPFG_QKV_MEM_OPT
is set and the weights of q/k/v is fused, it will try to delete the original unfused weights. Note the rollback to original model would not be guarantee anymore when the faster model failed if the original weights are deleted. Default to 1.use_fp16 (bool) -- Whether to use float16. Maybe we should use the default dtype as the highest priority later. Default to
False
.restore_data (bool) -- If
False
, need to reload the weight values. It should beTrue
for weight loaded models. Default toFalse
.
- 返回:
- Each value is a list including converted parameters in all
layers. For other parameters not included in Transformer module to be converted, such as embeddings, you can achieve it by using the returned dict
params
thoughparams['word_emb'].append()
directly which would do CPU/GPU and fp32/fp16 transfer automatically.
- 返回类型:
defaultdict
- class InferTransformerDecoding(decoder, word_embedding, positional_embedding, linear, num_decoder_layers, n_head, d_model, bos_id=0, eos_id=1, decoding_strategy='beam_search', beam_size=4, topk=1, topp=0.0, max_out_len=256, diversity_rate=0.0, decoding_lib=None, use_fp16_decoding=False, rel_len=False, alpha=0.6)[源代码]#
- class FTParaConf(tensor_para_size=None, layer_para_size=None, layer_para_batch_size=1)[源代码]#
Configurations for model parallel in FastGeneration. Currently only support GPT. Please refer to Megatron for details.
- 参数:
tensor_para_size (int, optional) -- The size for tensor parallel. If it is 1, tensor parallel would not be used. Default to 1.
layer_para_size (int, optional) -- The size for layer parallel. If it is 1, layer parallel would not be used. Default to 1.
layer_para_batch_size (int, optional) -- The local batch size for pipeline parallel. It is suggested to use
batch_size // layer_para_size
. Default to 1.
- is_last_group()[源代码]#
For layer parallel, only the process corresponding to the last layer group can get the predict results. It is used to check whether this is the process corresponding to the last layer group.
- is_load(i, num_layer)[源代码]#
Whether or not the given transformer layer of should be loaded to the current parallel model. For layer parallel, there is no need not to load other layer groups.
- 参数:
i (int) -- The index of Transformer layer.
num_layer (int) -- The number of Transformer layers.
- 返回:
- Indicate whether or not the given transformer layer of should
be loaded to the current parallel model.
- 返回类型:
bool
- slice_weight(weight, axis, phase=1, out_param=False)[源代码]#
Get the weight slice for tensor parallel.
- 参数:
weight (Tensor or ndarray) -- The weight or bias to be sliced.
axis (int) -- The axis to perform slice.
phase (int, optional) -- 0 is used for creating partial model when initializing and
from_pretrained
. While 1 is used in converting parameters to FastGeneration. No slice would be performed if it is 1, since parameters have been sliced inphase=0
.out_param (bool, optional) -- If true,
weight
should be a Parameter and force the output to be a Parameter.
- 返回:
The sliced weight.
- 返回类型:
Tensor or ndarray
- set_partial_model(is_partial_model)[源代码]#
This is used to set whether or not the current model has complete parameters.
- 参数:
is_partial_model (bool) -- It is used to set whether or not the current model has complete parameters.
- fit_partial_model(model, state_to_load)[源代码]#
Slice every values included in
state_to_load
according to the shape of corresponding parameters inmodel
. This is used infrom_pratrained
to get sliced parameter values.- 参数:
model (PretrainedModel) -- The model to use.
state_to_load (dict) -- The state dict including complete parameter values of model.
- 返回:
The state dict contains adjusted values.
- 返回类型:
dict
- get_ft_para_conf()[源代码]#
Get settings for model parallel.
- 返回:
The settings for model parallel.
- 返回类型:
- enable_ft_para(tensor_para_size=None, layer_para_size=None, layer_para_batch_size=1)[源代码]#
Enable model parallel with the given settings in FastGeneration. Currently only support GPT. Please refer to Megatron for details.
- 参数:
tensor_para_size (int, optional) -- The size for tensor parallel. If it is 1, tensor parallel would not be used. When it is None, tensor parallel size would be set as
world_size / layer_para_size
. Default to None.layer_para_size (int, optional) -- The size for layer parallel. If it is 1, layer parallel would not be used. When it is None, it would be set as 1. Default to None.
layer_para_batch_size (int, optional) -- The local batch size for pipeline parallel. It is suggested to use
batch_size // layer_para_size
. Default to 1.
- class InferOptDecoding(model: OPTForCausalLM, decoding_lib=None, use_fp16_decoding=False)[源代码]#
extract infer model parameters and feed it into the cuda decoder
- forward(input_ids, mem_seq_len, attention_mask=None, topk=4, topp=0.0, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, max_out_len=256, temperature=1)[源代码]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- 参数:
*inputs (tuple) -- unpacked tuple arguments
**kwargs (dict) -- unpacked dict arguments
- class InferGptDecoding(model, decoding_lib=None, use_fp16_decoding=False)[源代码]#
- forward(input_ids, mem_seq_len, attention_mask=None, topk=4, topp=0.0, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, max_out_len=256, temperature=1)[源代码]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- 参数:
*inputs (tuple) -- unpacked tuple arguments
**kwargs (dict) -- unpacked dict arguments
- class InferUnifiedDecoding(model, decoding_lib=None, use_fp16_decoding=False, logits_mask=None, n_head=8, hidden_dims=512, size_per_head=64, n_layer=6, unk_id=0, mask_id=30000, normalize_before=True, hidden_act='gelu')[源代码]#
- forward(input_ids, attn_mask, memory_seq_lens, type_id, decoder_type_id, role_id=None, decoder_role_id=None, position_id=None, decoder_position_id=None, beam_size=4, topk=4, topp=0.0, decoding_strategy='greedy_search', max_out_len=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=1.0, length_penalty=1.0, diversity_rate=0.0, pos_bias=True, rel_len=False, early_stopping=False, min_length=0)[源代码]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- 参数:
*inputs (tuple) -- unpacked tuple arguments
**kwargs (dict) -- unpacked dict arguments
- class InferMIRODecoding(model, decoding_lib=None, use_fp16_decoding=False, logits_mask=None, n_head=8, hidden_dims=512, size_per_head=64, n_layer=6, unk_id=0, mask_id=30000, normalize_before=True, hidden_act='relu')[源代码]#
- forward(input_ids, attn_mask, memory_seq_lens, type_id, decoder_type_id, role_id=None, decoder_role_id=None, position_id=None, decoder_position_id=None, beam_size=4, topk=4, topp=0.0, decoding_strategy='greedy_search', max_out_len=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=1.0, length_penalty=1.0, diversity_rate=0.0, pos_bias=True, rel_len=False, early_stopping=False, min_length=0)[源代码]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- 参数:
*inputs (tuple) -- unpacked tuple arguments
**kwargs (dict) -- unpacked dict arguments
- class InferBartDecoding(model, decoding_lib=None, use_fp16_decoding=False)[源代码]#
- forward(enc_output, memory_seq_lens, beam_size=4, top_k=1, top_p=0.0, temperature=1.0, decoding_strategy='beam_search_v3', max_out_len=256, min_out_len=256, diversity_rate=0.0, rel_len=False, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, alpha=0.6, early_stopping=False)[源代码]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- 参数:
*inputs (tuple) -- unpacked tuple arguments
**kwargs (dict) -- unpacked dict arguments
- class InferMBartDecoding(model, decoding_lib=None, use_fp16_decoding=False, hidden_act='gelu')[源代码]#
- forward(enc_output, memory_seq_lens, trg_word=None, beam_size=4, top_k=1, top_p=0.0, decoding_strategy='beam_search_v3', max_out_len=256, diversity_rate=0.0, rel_len=False, bos_token_id=None, eos_token_id=None, pad_token_id=None, alpha=0.6, temperature=1.0, early_stopping=False)[源代码]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- 参数:
*inputs (tuple) -- unpacked tuple arguments
**kwargs (dict) -- unpacked dict arguments
- convert_gptj_params(fast_model, model, fuse_qkv=1, use_fp16=False, restore_data=False, permutation=None)[源代码]#
Convert parameters included in Transformer layer from original models to the format of faster models.
- 参数:
fast_model (Layer) -- The faster model object.
model (Layer) -- The Transformer layer.
fuse_qkv (int) -- 0 for nofuse, 1 for fuse, 2 for fuse and delete the unfused parameters. If environment variable
PPFG_QKV_MEM_OPT
is set and the weights of q/k/v is fused, it will try to delete the original unfused weights. Note the rollback to original model would not be guarantee anymore when the faster model failed if the original weights are deleted. Default to 1.use_fp16 (bool) -- Whether to use float16. Maybe we should use the default dtype as the highest priority later. Default to
False
.restore_data (bool) -- If
False
, need to reload the weight values. It should beTrue
for weight loaded models. Default toFalse
.
- 返回:
- Each value is a list including converted parameters in all
layers. For other parameters not included in Transformer module to be converted, such as embeddings, you can achieve it by using the returned dict
params
thoughparams['word_emb'].append()
directly which would do CPU/GPU and fp32/fp16 transfer automatically.
- 返回类型:
defaultdict
- class InferGptJDecoding(model, decoding_lib=None, use_fp16_decoding=False, transpose_qkv=False)[源代码]#
- forward(input_ids, mem_seq_len, attention_mask=None, topk=4, topp=0.0, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, max_out_len=256, temperature=1, repetition_penalty=1.0, min_length=0)[源代码]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- 参数:
*inputs (tuple) -- unpacked tuple arguments
**kwargs (dict) -- unpacked dict arguments
- class InferPegasusDecoding(model, decoding_lib=None, use_fp16_decoding=False, hidden_act='gelu')[源代码]#
- forward(enc_output, memory_seq_lens, beam_size=4, top_k=1, top_p=0.0, decoding_strategy='beam_search_v3', max_out_len=256, min_out_len=256, diversity_rate=0.0, rel_len=False, bos_token_id=None, eos_token_id=None, pad_token_id=None, alpha=0.6, temperature=1.0, early_stopping=False, forced_eos_token_id=None)[源代码]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- 参数:
*inputs (tuple) -- unpacked tuple arguments
**kwargs (dict) -- unpacked dict arguments
- class InferT5Decoding(model, decoding_lib=None, use_fp16_decoding=False)[源代码]#
- forward(enc_output, memory_seq_lens, beam_size=4, top_k=1, top_p=0.0, decoding_strategy='beam_search_v3', max_out_len=256, diversity_rate=0.0, rel_len=False, bos_token_id=None, eos_token_id=None, pad_token_id=None, alpha=0.6, temperature=1.0, early_stopping=False)[源代码]#
Defines the computation performed at every call. Should be overridden by all subclasses.
- 参数:
*inputs (tuple) -- unpacked tuple arguments
**kwargs (dict) -- unpacked dict arguments