decoding#

convert_params(fast_model, model, fuse_qkv=1, use_fp16=False, restore_data=False)[source]#

Convert parameters included in Transformer layer (nn.TransformerEncoder and gpt.modeling.TransformerDecoder) from original models to the format of faster models.

Parameters:
  • fast_model (Layer) – The faster model object.

  • model (Layer) – The Transformer layer. It can be an instance of nn.TransformerEncoder or gpt.modeling.TransformerDecoder currently, and nn.TransformerDecoder would be supported soon.

  • fuse_qkv (int) – 0 for nofuse, 1 for fuse, 2 for fuse and delete the unfused parameters. If environment variable PPFG_QKV_MEM_OPT is set and the weights of q/k/v is fused, it will try to delete the original unfused weights. Note the rollback to original model would not be guarantee anymore when the faster model failed if the original weights are deleted. Default to 1.

  • use_fp16 (bool) – Whether to use float16. Maybe we should use the default dtype as the highest priority later. Default to False.

  • restore_data (bool) – If False, need to reload the weight values. It should be True for weight loaded models. Default to False.

Returns:

Each value is a list including converted parameters in all

layers. For other parameters not included in Transformer module to be converted, such as embeddings, you can achieve it by using the returned dict params though params['word_emb'].append() directly which would do CPU/GPU and fp32/fp16 transfer automatically.

Return type:

defaultdict

class InferBase(use_fp16_decoding)[source]#
class InferTransformerDecoding(decoder, word_embedding, positional_embedding, linear, num_decoder_layers, n_head, d_model, bos_id=0, eos_id=1, decoding_strategy='beam_search', beam_size=4, topk=1, topp=0.0, max_out_len=256, diversity_rate=0.0, decoding_lib=None, use_fp16_decoding=False, rel_len=False, alpha=0.6)[source]#
forward(enc_output, memory_seq_lens, trg_word=None)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters:
  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

class FTParaConf(tensor_para_size=None, layer_para_size=None, layer_para_batch_size=1)[source]#

Configurations for model parallel in FastGeneration. Currently only support GPT. Please refer to Megatron for details.

Parameters:
  • tensor_para_size (int, optional) – The size for tensor parallel. If it is 1, tensor parallel would not be used. Default to 1.

  • layer_para_size (int, optional) – The size for layer parallel. If it is 1, layer parallel would not be used. Default to 1.

  • layer_para_batch_size (int, optional) – The local batch size for pipeline parallel. It is suggested to use batch_size // layer_para_size. Default to 1.

is_last_group()[source]#

For layer parallel, only the process corresponding to the last layer group can get the predict results. It is used to check whether this is the process corresponding to the last layer group.

is_load(i, num_layer)[source]#

Whether or not the given transformer layer of should be loaded to the current parallel model. For layer parallel, there is no need not to load other layer groups.

Parameters:
  • i (int) – The index of Transformer layer.

  • num_layer (int) – The number of Transformer layers.

Returns:

Indicate whether or not the given transformer layer of should

be loaded to the current parallel model.

Return type:

bool

slice_weight(weight, axis, phase=1, out_param=False)[source]#

Get the weight slice for tensor parallel.

Parameters:
  • weight (Tensor or ndarray) – The weight or bias to be sliced.

  • axis (int) – The axis to perform slice.

  • phase (int, optional) – 0 is used for creating partial model when initializing and from_pretrained. While 1 is used in converting parameters to FastGeneration. No slice would be performed if it is 1, since parameters have been sliced in phase=0.

  • out_param (bool, optional) – If true, weight should be a Parameter and force the output to be a Parameter.

Returns:

The sliced weight.

Return type:

Tensor or ndarray

set_partial_model(is_partial_model)[source]#

This is used to set whether or not the current model has complete parameters.

Parameters:

is_partial_model (bool) – It is used to set whether or not the current model has complete parameters.

fit_partial_model(model, state_to_load)[source]#

Slice every values included in state_to_load according to the shape of corresponding parameters in model. This is used in from_pratrained to get sliced parameter values.

Parameters:
  • model (PretrainedModel) – The model to use.

  • state_to_load (dict) – The state dict including complete parameter values of model.

Returns:

The state dict contains adjusted values.

Return type:

dict

get_ft_para_conf()[source]#

Get settings for model parallel.

Returns:

The settings for model parallel.

Return type:

FTParaConf

enable_ft_para(tensor_para_size=None, layer_para_size=None, layer_para_batch_size=1)[source]#

Enable model parallel with the given settings in FastGeneration. Currently only support GPT. Please refer to Megatron for details.

Parameters:
  • tensor_para_size (int, optional) – The size for tensor parallel. If it is 1, tensor parallel would not be used. When it is None, tensor parallel size would be set as world_size / layer_para_size. Default to None.

  • layer_para_size (int, optional) – The size for layer parallel. If it is 1, layer parallel would not be used. When it is None, it would be set as 1. Default to None.

  • layer_para_batch_size (int, optional) – The local batch size for pipeline parallel. It is suggested to use batch_size // layer_para_size. Default to 1.

class InferOptDecoding(model: OPTForCausalLM, decoding_lib=None, use_fp16_decoding=False)[source]#

extract infer model parameters and feed it into the cuda decoder

forward(input_ids, mem_seq_len, attention_mask=None, topk=4, topp=0.0, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, max_out_len=256, temperature=1)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters:
  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

class InferGptDecoding(model, decoding_lib=None, use_fp16_decoding=False)[source]#
forward(input_ids, mem_seq_len, attention_mask=None, topk=4, topp=0.0, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, max_out_len=256, temperature=1)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters:
  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

class InferUnifiedDecoding(model, decoding_lib=None, use_fp16_decoding=False, logits_mask=None, n_head=8, hidden_dims=512, size_per_head=64, n_layer=6, unk_id=0, mask_id=30000, normalize_before=True, hidden_act='gelu')[source]#
forward(input_ids, attn_mask, memory_seq_lens, type_id, decoder_type_id, role_id=None, decoder_role_id=None, position_id=None, decoder_position_id=None, beam_size=4, topk=4, topp=0.0, decoding_strategy='greedy_search', max_out_len=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=1.0, length_penalty=1.0, diversity_rate=0.0, pos_bias=True, rel_len=False, early_stopping=False, min_length=0)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters:
  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

class InferMIRODecoding(model, decoding_lib=None, use_fp16_decoding=False, logits_mask=None, n_head=8, hidden_dims=512, size_per_head=64, n_layer=6, unk_id=0, mask_id=30000, normalize_before=True, hidden_act='relu')[source]#
forward(input_ids, attn_mask, memory_seq_lens, type_id, decoder_type_id, role_id=None, decoder_role_id=None, position_id=None, decoder_position_id=None, beam_size=4, topk=4, topp=0.0, decoding_strategy='greedy_search', max_out_len=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=1.0, length_penalty=1.0, diversity_rate=0.0, pos_bias=True, rel_len=False, early_stopping=False, min_length=0)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters:
  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

class InferBartDecoding(model, decoding_lib=None, use_fp16_decoding=False)[source]#
forward(enc_output, memory_seq_lens, beam_size=4, top_k=1, top_p=0.0, temperature=1.0, decoding_strategy='beam_search_v3', max_out_len=256, min_out_len=256, diversity_rate=0.0, rel_len=False, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, alpha=0.6, early_stopping=False)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters:
  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

class InferMBartDecoding(model, decoding_lib=None, use_fp16_decoding=False, hidden_act='gelu')[source]#
forward(enc_output, memory_seq_lens, trg_word=None, beam_size=4, top_k=1, top_p=0.0, decoding_strategy='beam_search_v3', max_out_len=256, diversity_rate=0.0, rel_len=False, bos_token_id=None, eos_token_id=None, pad_token_id=None, alpha=0.6, temperature=1.0, early_stopping=False)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters:
  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

convert_gptj_params(fast_model, model, fuse_qkv=1, use_fp16=False, restore_data=False, permutation=None)[source]#

Convert parameters included in Transformer layer from original models to the format of faster models.

Parameters:
  • fast_model (Layer) – The faster model object.

  • model (Layer) – The Transformer layer.

  • fuse_qkv (int) – 0 for nofuse, 1 for fuse, 2 for fuse and delete the unfused parameters. If environment variable PPFG_QKV_MEM_OPT is set and the weights of q/k/v is fused, it will try to delete the original unfused weights. Note the rollback to original model would not be guarantee anymore when the faster model failed if the original weights are deleted. Default to 1.

  • use_fp16 (bool) – Whether to use float16. Maybe we should use the default dtype as the highest priority later. Default to False.

  • restore_data (bool) – If False, need to reload the weight values. It should be True for weight loaded models. Default to False.

Returns:

Each value is a list including converted parameters in all

layers. For other parameters not included in Transformer module to be converted, such as embeddings, you can achieve it by using the returned dict params though params['word_emb'].append() directly which would do CPU/GPU and fp32/fp16 transfer automatically.

Return type:

defaultdict

class InferGptJDecoding(model, decoding_lib=None, use_fp16_decoding=False, transpose_qkv=False)[source]#
forward(input_ids, mem_seq_len, attention_mask=None, topk=4, topp=0.0, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, max_out_len=256, temperature=1, repetition_penalty=1.0, min_length=0)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters:
  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

class InferPegasusDecoding(model, decoding_lib=None, use_fp16_decoding=False, hidden_act='gelu')[source]#
forward(enc_output, memory_seq_lens, beam_size=4, top_k=1, top_p=0.0, decoding_strategy='beam_search_v3', max_out_len=256, min_out_len=256, diversity_rate=0.0, rel_len=False, bos_token_id=None, eos_token_id=None, pad_token_id=None, alpha=0.6, temperature=1.0, early_stopping=False, forced_eos_token_id=None)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters:
  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

class InferT5Decoding(model, decoding_lib=None, use_fp16_decoding=False)[source]#
forward(enc_output, memory_seq_lens, beam_size=4, top_k=1, top_p=0.0, decoding_strategy='beam_search_v3', max_out_len=256, diversity_rate=0.0, rel_len=False, bos_token_id=None, eos_token_id=None, pad_token_id=None, alpha=0.6, temperature=1.0, early_stopping=False)[source]#

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters:
  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments