decoding#

convert_params(fast_model, model, fuse_qkv=1, use_fp16=False, restore_data=False)[源代码]#

Convert parameters included in Transformer layer (nn.TransformerEncoder and gpt.modeling.TransformerDecoder) from original models to the format of faster models.

参数:
  • fast_model (Layer) -- The faster model object.

  • model (Layer) -- The Transformer layer. It can be an instance of nn.TransformerEncoder or gpt.modeling.TransformerDecoder currently, and nn.TransformerDecoder would be supported soon.

  • fuse_qkv (int) -- 0 for nofuse, 1 for fuse, 2 for fuse and delete the unfused parameters. If environment variable PPFG_QKV_MEM_OPT is set and the weights of q/k/v is fused, it will try to delete the original unfused weights. Note the rollback to original model would not be guarantee anymore when the faster model failed if the original weights are deleted. Default to 1.

  • use_fp16 (bool) -- Whether to use float16. Maybe we should use the default dtype as the highest priority later. Default to False.

  • restore_data (bool) -- If False, need to reload the weight values. It should be True for weight loaded models. Default to False.

返回:

Each value is a list including converted parameters in all

layers. For other parameters not included in Transformer module to be converted, such as embeddings, you can achieve it by using the returned dict params though params['word_emb'].append() directly which would do CPU/GPU and fp32/fp16 transfer automatically.

返回类型:

defaultdict

class InferBase(use_fp16_decoding)[源代码]#
class InferTransformerDecoding(decoder, word_embedding, positional_embedding, linear, num_decoder_layers, n_head, d_model, bos_id=0, eos_id=1, decoding_strategy='beam_search', beam_size=4, topk=1, topp=0.0, max_out_len=256, diversity_rate=0.0, decoding_lib=None, use_fp16_decoding=False, rel_len=False, alpha=0.6)[源代码]#
forward(enc_output, memory_seq_lens, trg_word=None)[源代码]#

Defines the computation performed at every call. Should be overridden by all subclasses.

参数:
  • *inputs (tuple) -- unpacked tuple arguments

  • **kwargs (dict) -- unpacked dict arguments

class FTParaConf(tensor_para_size=None, layer_para_size=None, layer_para_batch_size=1)[源代码]#

Configurations for model parallel in FastGeneration. Currently only support GPT. Please refer to Megatron for details.

参数:
  • tensor_para_size (int, optional) -- The size for tensor parallel. If it is 1, tensor parallel would not be used. Default to 1.

  • layer_para_size (int, optional) -- The size for layer parallel. If it is 1, layer parallel would not be used. Default to 1.

  • layer_para_batch_size (int, optional) -- The local batch size for pipeline parallel. It is suggested to use batch_size // layer_para_size. Default to 1.

is_last_group()[源代码]#

For layer parallel, only the process corresponding to the last layer group can get the predict results. It is used to check whether this is the process corresponding to the last layer group.

is_load(i, num_layer)[源代码]#

Whether or not the given transformer layer of should be loaded to the current parallel model. For layer parallel, there is no need not to load other layer groups.

参数:
  • i (int) -- The index of Transformer layer.

  • num_layer (int) -- The number of Transformer layers.

返回:

Indicate whether or not the given transformer layer of should

be loaded to the current parallel model.

返回类型:

bool

slice_weight(weight, axis, phase=1, out_param=False)[源代码]#

Get the weight slice for tensor parallel.

参数:
  • weight (Tensor or ndarray) -- The weight or bias to be sliced.

  • axis (int) -- The axis to perform slice.

  • phase (int, optional) -- 0 is used for creating partial model when initializing and from_pretrained. While 1 is used in converting parameters to FastGeneration. No slice would be performed if it is 1, since parameters have been sliced in phase=0.

  • out_param (bool, optional) -- If true, weight should be a Parameter and force the output to be a Parameter.

返回:

The sliced weight.

返回类型:

Tensor or ndarray

set_partial_model(is_partial_model)[源代码]#

This is used to set whether or not the current model has complete parameters.

参数:

is_partial_model (bool) -- It is used to set whether or not the current model has complete parameters.

fit_partial_model(model, state_to_load)[源代码]#

Slice every values included in state_to_load according to the shape of corresponding parameters in model. This is used in from_pratrained to get sliced parameter values.

参数:
  • model (PretrainedModel) -- The model to use.

  • state_to_load (dict) -- The state dict including complete parameter values of model.

返回:

The state dict contains adjusted values.

返回类型:

dict

get_ft_para_conf()[源代码]#

Get settings for model parallel.

返回:

The settings for model parallel.

返回类型:

FTParaConf

enable_ft_para(tensor_para_size=None, layer_para_size=None, layer_para_batch_size=1)[源代码]#

Enable model parallel with the given settings in FastGeneration. Currently only support GPT. Please refer to Megatron for details.

参数:
  • tensor_para_size (int, optional) -- The size for tensor parallel. If it is 1, tensor parallel would not be used. When it is None, tensor parallel size would be set as world_size / layer_para_size. Default to None.

  • layer_para_size (int, optional) -- The size for layer parallel. If it is 1, layer parallel would not be used. When it is None, it would be set as 1. Default to None.

  • layer_para_batch_size (int, optional) -- The local batch size for pipeline parallel. It is suggested to use batch_size // layer_para_size. Default to 1.

class InferOptDecoding(model: OPTForCausalLM, decoding_lib=None, use_fp16_decoding=False)[源代码]#

extract infer model parameters and feed it into the cuda decoder

forward(input_ids, mem_seq_len, attention_mask=None, topk=4, topp=0.0, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, max_out_len=256, temperature=1)[源代码]#

Defines the computation performed at every call. Should be overridden by all subclasses.

参数:
  • *inputs (tuple) -- unpacked tuple arguments

  • **kwargs (dict) -- unpacked dict arguments

class InferGptDecoding(model, decoding_lib=None, use_fp16_decoding=False)[源代码]#
forward(input_ids, mem_seq_len, attention_mask=None, topk=4, topp=0.0, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, max_out_len=256, temperature=1)[源代码]#

Defines the computation performed at every call. Should be overridden by all subclasses.

参数:
  • *inputs (tuple) -- unpacked tuple arguments

  • **kwargs (dict) -- unpacked dict arguments

class InferUnifiedDecoding(model, decoding_lib=None, use_fp16_decoding=False, logits_mask=None, n_head=8, hidden_dims=512, size_per_head=64, n_layer=6, unk_id=0, mask_id=30000, normalize_before=True, hidden_act='gelu')[源代码]#
forward(input_ids, attn_mask, memory_seq_lens, type_id, decoder_type_id, role_id=None, decoder_role_id=None, position_id=None, decoder_position_id=None, beam_size=4, topk=4, topp=0.0, decoding_strategy='greedy_search', max_out_len=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=1.0, length_penalty=1.0, diversity_rate=0.0, pos_bias=True, rel_len=False, early_stopping=False, min_length=0)[源代码]#

Defines the computation performed at every call. Should be overridden by all subclasses.

参数:
  • *inputs (tuple) -- unpacked tuple arguments

  • **kwargs (dict) -- unpacked dict arguments

class InferMIRODecoding(model, decoding_lib=None, use_fp16_decoding=False, logits_mask=None, n_head=8, hidden_dims=512, size_per_head=64, n_layer=6, unk_id=0, mask_id=30000, normalize_before=True, hidden_act='relu')[源代码]#
forward(input_ids, attn_mask, memory_seq_lens, type_id, decoder_type_id, role_id=None, decoder_role_id=None, position_id=None, decoder_position_id=None, beam_size=4, topk=4, topp=0.0, decoding_strategy='greedy_search', max_out_len=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=1.0, length_penalty=1.0, diversity_rate=0.0, pos_bias=True, rel_len=False, early_stopping=False, min_length=0)[源代码]#

Defines the computation performed at every call. Should be overridden by all subclasses.

参数:
  • *inputs (tuple) -- unpacked tuple arguments

  • **kwargs (dict) -- unpacked dict arguments

class InferBartDecoding(model, decoding_lib=None, use_fp16_decoding=False)[源代码]#
forward(enc_output, memory_seq_lens, beam_size=4, top_k=1, top_p=0.0, temperature=1.0, decoding_strategy='beam_search_v3', max_out_len=256, min_out_len=256, diversity_rate=0.0, rel_len=False, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, alpha=0.6, early_stopping=False)[源代码]#

Defines the computation performed at every call. Should be overridden by all subclasses.

参数:
  • *inputs (tuple) -- unpacked tuple arguments

  • **kwargs (dict) -- unpacked dict arguments

class InferMBartDecoding(model, decoding_lib=None, use_fp16_decoding=False, hidden_act='gelu')[源代码]#
forward(enc_output, memory_seq_lens, trg_word=None, beam_size=4, top_k=1, top_p=0.0, decoding_strategy='beam_search_v3', max_out_len=256, diversity_rate=0.0, rel_len=False, bos_token_id=None, eos_token_id=None, pad_token_id=None, alpha=0.6, temperature=1.0, early_stopping=False)[源代码]#

Defines the computation performed at every call. Should be overridden by all subclasses.

参数:
  • *inputs (tuple) -- unpacked tuple arguments

  • **kwargs (dict) -- unpacked dict arguments

convert_gptj_params(fast_model, model, fuse_qkv=1, use_fp16=False, restore_data=False, permutation=None)[源代码]#

Convert parameters included in Transformer layer from original models to the format of faster models.

参数:
  • fast_model (Layer) -- The faster model object.

  • model (Layer) -- The Transformer layer.

  • fuse_qkv (int) -- 0 for nofuse, 1 for fuse, 2 for fuse and delete the unfused parameters. If environment variable PPFG_QKV_MEM_OPT is set and the weights of q/k/v is fused, it will try to delete the original unfused weights. Note the rollback to original model would not be guarantee anymore when the faster model failed if the original weights are deleted. Default to 1.

  • use_fp16 (bool) -- Whether to use float16. Maybe we should use the default dtype as the highest priority later. Default to False.

  • restore_data (bool) -- If False, need to reload the weight values. It should be True for weight loaded models. Default to False.

返回:

Each value is a list including converted parameters in all

layers. For other parameters not included in Transformer module to be converted, such as embeddings, you can achieve it by using the returned dict params though params['word_emb'].append() directly which would do CPU/GPU and fp32/fp16 transfer automatically.

返回类型:

defaultdict

class InferGptJDecoding(model, decoding_lib=None, use_fp16_decoding=False, transpose_qkv=False)[源代码]#
forward(input_ids, mem_seq_len, attention_mask=None, topk=4, topp=0.0, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, max_out_len=256, temperature=1, repetition_penalty=1.0, min_length=0)[源代码]#

Defines the computation performed at every call. Should be overridden by all subclasses.

参数:
  • *inputs (tuple) -- unpacked tuple arguments

  • **kwargs (dict) -- unpacked dict arguments

class InferPegasusDecoding(model, decoding_lib=None, use_fp16_decoding=False, hidden_act='gelu')[源代码]#
forward(enc_output, memory_seq_lens, beam_size=4, top_k=1, top_p=0.0, decoding_strategy='beam_search_v3', max_out_len=256, min_out_len=256, diversity_rate=0.0, rel_len=False, bos_token_id=None, eos_token_id=None, pad_token_id=None, alpha=0.6, temperature=1.0, early_stopping=False, forced_eos_token_id=None)[源代码]#

Defines the computation performed at every call. Should be overridden by all subclasses.

参数:
  • *inputs (tuple) -- unpacked tuple arguments

  • **kwargs (dict) -- unpacked dict arguments

class InferT5Decoding(model, decoding_lib=None, use_fp16_decoding=False)[源代码]#
forward(enc_output, memory_seq_lens, beam_size=4, top_k=1, top_p=0.0, decoding_strategy='beam_search_v3', max_out_len=256, diversity_rate=0.0, rel_len=False, bos_token_id=None, eos_token_id=None, pad_token_id=None, alpha=0.6, temperature=1.0, early_stopping=False)[源代码]#

Defines the computation performed at every call. Should be overridden by all subclasses.

参数:
  • *inputs (tuple) -- unpacked tuple arguments

  • **kwargs (dict) -- unpacked dict arguments