decoding

convert_params(faster_model, model, fuse_qkv=1, use_fp16=False, restore_data=False)[source]

Convert parameters included in Transformer layer (nn.TransformerEncoder and gpt.modeling.TransformerDecoder) from original models to the format of faster models.

Parameters
  • faster_model (Layer) – The faster model object.

  • model (Layer) – The Transformer layer. It can be an instance of nn.TransformerEncoder or gpt.modeling.TransformerDecoder currently, and nn.TransformerDecoder would be supported soon.

  • fuse_qkv (int) – 0 for nofuse, 1 for fuse, 2 for fuse and delete the unfused parameters. If environment variable PPFG_QKV_MEM_OPT is set and the weights of q/k/v is fused, it will try to delete the original unfused weights. Note the rollback to original model would not be guarantee anymore when the faster model failed if the original weights are deleted. Default to 1.

  • use_fp16 (bool) – Whether to use float16. Maybe we should use the default dtype as the highest priority later. Default to False.

  • restore_data (bool) – If False, need to reload the weight values. It should be True for weight loaded models. Default to False.

Returns

Each value is a list including converted parameters in all

layers. For other parameters not included in Transformer module to be converted, such as embeddings, you can achieve it by using the returned dict params though params['word_emb'].append() directly which would do CPU/GPU and fp32/fp16 transfer automatically.

Return type

defaultdict

class InferTransformerDecoding(decoder, word_embedding, positional_embedding, linear, num_decoder_layers, n_head, d_model, bos_id=0, eos_id=1, decoding_strategy='beam_search', beam_size=4, topk=1, topp=0.0, max_out_len=256, diversity_rate=0.0, decoding_lib=None, use_fp16_decoding=False, rel_len=False, alpha=0.6)[source]
forward(enc_output, memory_seq_lens, trg_word=None)[source]

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters
  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

class FTParaConf(tensor_para_size=None, layer_para_size=None, layer_para_batch_size=1)[source]

Configurations for model parallel in FasterTransformer. Currently only support GPT. Please refer to Megatron for details.

Parameters
  • tensor_para_size (int, optional) – The size for tensor parallel. If it is 1, tensor parallel would not be used. Default to 1.

  • layer_para_size (int, optional) – The size for layer parallel. If it is 1, layer parallel would not be used. Default to 1.

  • layer_para_batch_size (int, optional) – The local batch size for pipeline parallel. It is suggested to use batch_size // layer_para_size. Default to 1.

is_last_group()[source]

For layer parallel, only the process corresponding to the last layer group can get the predict results. It is used to check whether this is the process corresponding to the last layer group.

is_load(i, num_layer)[source]

Whether or not the given transformer layer of should be loaded to the current parallel model. For layer parallel, there is no need not to load other layer groups.

Parameters
  • i (int) – The index of Transformer layer.

  • num_layer (int) – The number of Transformer layers.

Returns

Indicate whether or not the given transformer layer of should

be loaded to the current parallel model.

Return type

bool

slice_weight(weight, axis, phase=1, out_param=False)[source]

Get the weight slice for tensor parallel.

Parameters
  • weight (Tensor or ndarray) – The weight or bias to be sliced.

  • axis (int) – The axis to perform slice.

  • phase (int, optional) – 0 is used for creating partial model when initializing and from_pretrained. While 1 is used in converting parameters to FasterTransformer. No slice would be performed if it is 1, since parameters have been sliced in phase=0.

  • out_param (bool, optional) – If true, weight should be a Parameter and force the output to be a Parameter.

Returns

The sliced weight.

Return type

Tensor or ndarray

set_partial_model(is_partial_model)[source]

This is used to set whether or not the current model has complete parameters.

Parameters

is_partial_model (bool) – It is used to set whether or not the current model has complete parameters.

fit_partial_model(model, state_to_load)[source]

Slice every values included in state_to_load according to the shape of corresponding parameters in model. This is used in from_pratrained to get sliced parameter values.

Parameters
  • model (PretrainedModel) – The model to use.

  • state_to_load (dict) – The state dict including complete parameter values of model.

Returns

The state dict contains adjusted values.

Return type

dict

get_ft_para_conf()[source]

Get settings for model parallel.

Returns

The settings for model parallel.

Return type

FTParaConf

enable_ft_para(tensor_para_size=None, layer_para_size=None, layer_para_batch_size=1)[source]

Enable model parallel with the given settings in FasterTransformer. Currently only support GPT. Please refer to Megatron for details.

Parameters
  • tensor_para_size (int, optional) – The size for tensor parallel. If it is 1, tensor parallel would not be used. When it is None, tensor parallel size would be set as world_size / layer_para_size. Default to None.

  • layer_para_size (int, optional) – The size for layer parallel. If it is 1, layer parallel would not be used. When it is None, it would be set as 1. Default to None.

  • layer_para_batch_size (int, optional) – The local batch size for pipeline parallel. It is suggested to use batch_size // layer_para_size. Default to 1.

class InferGptDecoding(model, decoding_lib=None, use_fp16_decoding=False)[source]
forward(input_ids, mem_seq_len, attention_mask=None, topk=4, topp=0.0, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, max_out_len=256, temperature=1)[source]

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters
  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

class InferUnifiedDecoding(model, decoding_lib=None, use_fp16_decoding=False, logits_mask=None, n_head=8, hidden_dims=512, size_per_head=64, n_layer=6, unk_id=0, mask_id=30000, normalize_before=True, hidden_act='gelu')[source]
forward(input_ids, attn_mask, memory_seq_lens, type_id, decoder_type_id, role_id=None, decoder_role_id=None, position_id=None, decoder_position_id=None, beam_size=4, topk=4, topp=0.0, decoding_strategy='greedy_search', max_out_len=256, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, temperature=1.0, length_penalty=1.0, diversity_rate=0.0, pos_bias=True, rel_len=False, early_stopping=False, min_length=0)[source]

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters
  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

class InferBartDecoding(model, decoding_lib=None, use_fp16_decoding=False)[source]
forward(enc_output, memory_seq_lens, beam_size=4, top_k=1, top_p=0.0, decoding_strategy='beam_search_v3', max_out_len=256, diversity_rate=0.0, rel_len=False, bos_token_id=None, eos_token_id=None, pad_token_id=None, forced_eos_token_id=None, alpha=0.6, early_stopping=False)[source]

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters
  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments

class InferMBartDecoding(model, decoding_lib=None, use_fp16_decoding=False, hidden_act='gelu')[source]
forward(enc_output, memory_seq_lens, trg_word=None, beam_size=4, top_k=1, top_p=0.0, decoding_strategy='beam_search_v3', max_out_len=256, diversity_rate=0.0, rel_len=False, bos_token_id=None, eos_token_id=None, pad_token_id=None, alpha=0.6, temperature=1.0, early_stopping=False)[source]

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters
  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments