Precision Alignment#

1. Overview#

1.1 Background#

Model precision alignment is crucial for subsequent work, ensuring that the same model produces stable and consistent output results under identical environments and parameter configurations. This provides a solid foundation for subsequent data analysis, decision-making, and system optimization.

1.2 Prerequisites#

Based on precision alignment acceptance criteria, the following preparations are recommended:

  • Prepare training/validation datasets for model training and evaluation.

  • Prepare PyTorch model architecture as the baseline for model precision.

  • Prepare validation hardware: For fp16 model parameters, use V100, A100, etc. For bf16 parameters, use A100 or similar compute cards.

2. Workflow#

The overall workflow includes model structure alignment, small dataset preparation, initial forward alignment, loss function alignment, optimizer alignment, learning rate alignment, regularization strategy alignment, initial backward alignment, training data alignment, and training alignment. For large models using parallel strategies, additional steps include parallel model structure alignment, parallel forward alignment, and parallel backward alignment.

2.1 Process Overview#

The overall workflow for model precision validation is shown below:

align_workflow

3. Model Alignment Process#

3.1 Model Structure Alignment#

Three main steps for model structure alignment:

  • Network structure code conversion

  • Weight conversion

  • Model architecture validation

3.1.1 Network Structure Code Conversion#

【Basic Process】

PyTorch APIs are generally similar to PaddlePaddle APIs. Refer to the PyTorch Latest Release vs Paddle Develop API Mapping Table for manual conversion of some network code.

【Automatic Code Conversion Tool】

PaConvert Automatic Code Conversion Tool _ Can automatically convert code trained or inferred by other deep learning frameworks into PaddlePaddle code, facilitating quick and automated model code migration.

Currently only supports automatic conversion of PyTorch code. Support for other deep learning frameworks will be added later. During conversion, we try to maintain the original code style and structure, converting API interfaces from other frameworks to PaddlePaddle APIs.

【Large Model Network Structure Examples】

3.1.2 Weight Conversion#

【Basic Process】

After completing the network code conversion, model weights need to be converted.

  1import json
  2import os
  3import shutil
  4import copy
  5import paddle
  6import torch
  7from safetensors.torch import load_file
  8from safetensors.numpy import save_file
  9from paddlenlp.utils.log import logger
 10from paddlenlp.transformers import Qwen2MoeForCausalLM, AutoConfig
 11
 12
 13def execute_cmd(cmd, file_path):
 14    cmd = cmd + " " + file_path
 15    os.system(cmd)
 16
 17
 18def convert_from_torch_to_paddle(torch_path=None, paddle_path=None, torch_prefix_key="model.", paddle_class=Qwen2MoeForCausalLM, delete_after_convert=False):
 19    assert torch_path is not None
 20    if paddle_path is None:
 21        paddle_path = torch_path + "-paddle"
 22    if not os.path.exists(paddle_path):
 23        os.mkdir(paddle_path)
 24
 25    config = AutoConfig.from_pretrained(torch_path)
 26    name_mappings = paddle_class._get_name_mappings(config=config)
 27
 28    torch_prefix_key = torch_prefix_key
 29    paddle_prefix_key = paddle_class.base_model_prefix + "."
 30
 31    if os.path.exists(os.path.join(torch_path, "model.safetensors.index.json")):
 32        index = json.load(open(os.path.join(torch_path, "model.safetensors.index.json")))
 33        dst_index = copy.deepcopy(index)
 34
 35        for key in list(dst_index["weight_map"].keys()):
 36            paddle_key = key.replace(torch_prefix_key, paddle_prefix_key)
 37            dst_index["weight_map"][paddle_key] = dst_index["weight_map"].pop(key)
 38
 39        files = set(index["weight_map"].values())
 40        logger.info(files)
 41
 42        for file_name in sorted(os.listdir(torch_path)):
 43            # skip hidden files
 44            if file_name.startswith("."):
 45                continue
 46
 47            logger.info(file_name)
 48            if file_name in files:
 49                # convert safetensors to safetensors(paddle)
 50                convert_safetensors_from_torch_to_paddle(file_name,
 51                                                        torch_path,
 52                                                        paddle_path,
 53                                                        torch_prefix_key,
 54                                                        paddle_prefix_key,
 55                                                        name_mappings,
 56                                                        delete_after_convert=False)
 57            else:
 58                # copy config.json and other files
 59                shutil.copy(os.path.join(torch_path, file_name), os.path.join(paddle_path, file_name))
 60
 61        json.dump(dst_index, open(os.path.join(paddle_path, "model.safetensors.index.json"), "w"), indent=2)
 62    else:
 63        for file_name in sorted(os.listdir(torch_path)):
 64            # skip hidden files
 65            if file_name.startswith("."):
 66                continue
 67
 68            logger.info(file_name)
 69            if file_name == "model.safetensors":
 70                convert_safetensors_from_torch_to_paddle(file_name,
 71                                                        torch_path,
 72                                                        paddle_path,
 73                                                        torch_prefix_key,
 74                                                        paddle_prefix_key,
 75                                                        name_mappings,
 76                                                        delete_after_convert=False)
 77            else:
 78                # copy config.json and other files
 79                shutil.copy(os.path.join(torch_path, file_name), os.path.join(paddle_path, file_name))
 80
 81    execute_cmd(cmd="sed -i -e  's/torch_dtype/dtype/g' ",
 82                file_path=os.path.join(paddle_path, "config.json"))
 83
 84def convert_safetensors_from_torch_to_paddle(file_name, torch_path, paddle_path, torch_prefix_key, paddle_prefix_key, name_mappings, delete_after_convert=False):
 85    tensors = load_file(os.path.join(torch_path, file_name))
 86
 87    transpose_state_dict = {}
 88    for name_mapping in name_mappings:
 89        if name_mapping.action == "transpose":
 90            transpose_state_dict[name_mapping.target_name] = True
 91        else:
 92            transpose_state_dict[name_mapping.target_name] = False
 93
 94    for key in list(tensors.keys()):
 95        paddle_key = key.replace(torch_prefix_key, paddle_prefix_key)
 96        logger.info("{} {}".format(key, tensors[key].shape))
 97        if transpose_state_dict[paddle_key]:
 98            t = tensors.pop(key).cuda().t().contiguous()
 99            capsule = torch.utils.dlpack.to_dlpack(t)
100            t = paddle.utils.dlpack.from_dlpack(capsule)
101            tensors[paddle_key] = t.numpy()
102        else:
103            t = tensors.pop(key).cuda()
104            capsule = torch.utils.dlpack.to_dlpack(t)
105            t = paddle.utils.dlpack.from_dlpack(capsule)
106            tensors[paddle_key] = t.numpy()
107
108            # tensors[dst_key] = paddle.to_tensor(tensors.pop(key).cuda().float().cpu().numpy(), dtype="bfloat16").numpy()
109        logger.info("{} {}".format(paddle_key, tensors[paddle_key].shape))
110
111    save_file(tensors, os.path.join(paddle_path, file_name), metadata={"format": "np"})
112    if delete_after_convert:
113        os.remove(os.path.join(torch_path, file_name))
114
115
116convert_from_paddle_to_torch(paddle_path="/root/code/PaddleNLP/ckpt/Qwen/Qwen2-0.5B" paddle_class=Qwen2MoeForCausalLM)

The model structure needs to implement the _get_name_mapping method, which identifies parameters that need transposing in linear layers to adapt to Paddle’s nn.Linear parameters. Refer to Qwen model structure:

PaddlePaddle/PaddleNLP

 1class Qwen2PretrainedModel(PretrainedModel):
 2    @classmethod
 3    def _get_name_mappings(cls, config: Qwen2Config) -> list[StateDictNameMapping]:
 4        mappings: list[StateDictNameMapping] = []
 5        model_mappings = [
 6            ["embed_tokens.weight"],
 7            ["norm.weight"],
 8        ]
 9        for layer_index in range(config.num_hidden_layers):
10            layer_mappings = [
11                [f"layers.{layer_index}.self_attn.q_proj.weight", None, "transpose"],
12                [f"layers.{layer_index}.self_attn.k_proj.weight", None, "transpose"],
13                [f"layers.{layer_index}.self_attn.v_proj.weight", None, "transpose"],
14                [f"layers.{layer_index}.self_attn.q_proj.bias", None],
15                [f"layers.{layer_index}.self_attn.k_proj.bias", None],
16                [f"layers.{layer_index}.self_attn.v_proj.bias", None],
17                [f"layers.{layer_index}.self_attn.o_proj.weight", None, "transpose"],
18                [f"layers.{layer_index}.mlp.up_proj.weight", None, "transpose"],
19                [f"layers.{layer_index}.mlp.gate_proj.weight", None, "transpose"],
20                [f"layers.{layer_index}.mlp.down_proj.weight", None, "transpose"],
21                [f"layers.{layer_index}.self_attn.rotary_emb.inv_freq"],
22                [f"layers.{layer_index}.input_layernorm.weight"],
23                [f"layers.{layer_index}.post_attention_layernorm.weight"],
24            ]
25            model_mappings.extend(layer_mappings)
26
27        init_name_mappings(mappings=model_mappings)
28        # base-model prefix "Qwen2MoEModel"
29        if "Qwen2Model" not in config.architectures:
30            for mapping in model_mappings:
31                mapping[0] = "model." + mapping[0]
32                mapping[1] = "qwen2." + mapping[1]
33            if not config.tie_word_embeddings:
34                model_mappings.append(["lm_head.weight", "lm_head.weight", "transpose"])
35
36        mappings = [StateDictNameMapping(*mapping, index=index) for index, mapping in enumerate(model_mappings)]
37        return mappings

3.1.3 Model Network Correctness Verification#

【Basic Process】

  1. Define PyTorch model, load weights, fix seed, generate random numbers based on numpy, convert to tensor processable by PyTorch, feed into network, obtain output.

  2. Define PaddlePaddle model, load weights, fix seed, generate random numbers based on numpy, convert to tensor processable by PaddlePaddle, feed into network, obtain output.

  3. Check diff; if below threshold, verification is successful.

【Example Code】

 1import numpy as np
 2import paddle
 3import torch
 4from transformers import Qwen2Config as Qwen2Config_hf
 5from transformers import Qwen2ForCausalLM as Qwen2ForCausalLM_hf
 6
 7from paddlenlp.transformers import Qwen2Config, Qwen2ForCausalLM
 8
 9def eval_model_convert():
10    paddle_input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
11    torch_input_ids = torch.LongTensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
12
13    # paddle model
14    paddle_ckpt_path = "Qwen/Qwen2-0.5B"
15    config_paddle = Qwen2Config.from_pretrained(paddle_ckpt_path)
16    model_paddle = Qwen2ForCausalLM.from_pretrained(paddle_ckpt_path, config=config_paddle, dtype="float32")
17
18    # torch model
19    torch_ckpt_path = "/root/.cache/modelscope/hub/Qwen/Qwen2-0___5B"
20    config_torch = Qwen2Config_hf.from_pretrained(torch_ckpt_path, trust_remote_code=True)
21    config_torch.dtype = "float32"
22    model_torch = Qwen2ForCausalLM_hf.from_pretrained(torch_ckpt_path, config=config_torch, trust_remote_code=True)
23
24    model_paddle.eval()
25    model_torch.eval()
26
27    out_paddle = model_paddle(paddle_input_ids)[0]
28    out_torch = model_torch(torch_input_ids, return_dict=False)[0]
29
30    assert np.allclose(out_paddle.numpy(), out_torch.detach().numpy(), rtol=1e-5, atol=1e-3)
31
32eval_model_convert()

【Notes】

  • When verifying forward alignment, call model.eval() to disable randomness in network components like BatchNorm and Dropout.

  • For reproducibility, fix random seeds if random numbers are involved.

  • Output diff can be calculated using np.max(np.abs(o1 - o2)). Generally, if diff <1e-5, forward pass is considered correct. If output diff is large, use binary search to locate the problematic operation.

  • Set environment variables to avoid operator randomness:

 1# General environment variables
 2export NVIDIA_TF32_OVERRIDE=0
 3export FLAGS_embedding_deterministic=1
 4export FLAGS_cudnn_deterministic=1
 5
 6# Distributed training environment variables
 7export Flags_mp_aysnc_allreduce=1
 8export Flags_skip_mp_c_identity=1
 9export FLAGS_shard_norm_align_dp=0
10export FLAGS_shard_use_reduce=1
11export FLAGS_sync_before_allreduce=1

3.1.4 Distributed Network Alignment#

【Basic Process】

The basic process is similar to section 3.1.3. Additionally, during model initialization, create a distributed environment and use paddle.distributed.launch to start training. Example command:

1python -m paddle.distributed.launch --devices 0,1 compare_torch_with_paddle.py

【Example Code】

 1import numpy as np
 2import paddle
 3import torch
 4from padiff import auto_diff
 5from transformers import Qwen2Config as Qwen2Config_hf
 6from transformers import Qwen2ForCausalLM as Qwen2ForCausalLM_hf
 7from paddle.distributed import fleet
 8from paddlenlp.transformers import Qwen2Config, Qwen2ForCausalLM
 9
10def eval_model_convert_parallel(mp_degree=1):
11    paddle_input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
12    torch_input_ids = torch.LongTensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
13
14    strategy = fleet.DistributedStrategy()
15    strategy.hybrid_configs = {
16        "dp_degree": 1,
17        "mp_degree": mp_degree,
18        "pp_degree": 1,
19        "sharding_degree": 1,
20    }
21    fleet.init(is_collective=True, strategy=strategy)
22    hcg = fleet.get_hybrid_communicate_group()
23
24    # paddle model
25    paddle_ckpt_path = "Qwen/Qwen2-0.5B"
26    config_paddle = Qwen2Config.from_pretrained(paddle_ckpt_path)
27    config_paddle.tensor_parallel_degree = hcg.get_model_parallel_world_size()
28    config_paddle.tensor_parallel_rank = hcg.get_model_parallel_rank()
29    config_paddle.tensor_parallel_output = False
30    model_paddle = Qwen2ForCausalLM.from_pretrained(paddle_ckpt_path, config=config_paddle, dtype="float32")
31
32    # torch model
33    torch_ckpt_path = "/root/.cache/modelscope/hub/Qwen/Qwen2-0___5B"
34    config_torch = Qwen2Config_hf.from_pretrained(torch_ckpt_path, trust_remote_code=True)
35    config_torch.dtype = "float32"
36    model_torch = Qwen2ForCausalLM_hf.from_pretrained(torch_ckpt_path, config=config_torch, trust_remote_code=True)
37
38    model_paddle.eval()
39    model_torch.eval()
40
41    # Manual verification
42    out_paddle = model_paddle(paddle_input_ids)[0]
43    out_torch = model_torch(torch_input_ids, return_dict=False)[0]
44    assert np.allclose(out_paddle.numpy(), out_torch.detach().numpy(), rtol=1e-5, atol=1e-4)
45
46eval_model_convert_parallel(mp_degree=2)

【Notes】

  • Set environment variables to avoid operator randomness:

 1# General environment variables
 2export NVIDIA_TF32_OVERRIDE=0
 3export FLAGS_embedding_deterministic=1
 4export FLAGS_cudnn_deterministic=1
 5
 6# Distributed training environment variables
 7export Flags_mp_aysnc_allreduce=1
 8export Flags_skip_mp_c_identity=1
 9export FLAGS_shard_norm_align_dp=0
10export FLAGS_shard_use_reduce=1
11export FLAGS_sync_before_allreduce=1

3.2 Forward & Backward Alignment - Alignment Tool Verification#

【Basic Process】

Instead of manual verification, use automated tool PaDiff for alignment. PaDiff is a model precision alignment tool between PaddlePaddle and PyTorch. It takes Paddle or Torch models, aligns intermediate training results and final weights, and reports where the first precision diff occurs.

PaDiff: PaddlePaddle/PaDiff

【Usage】

 1import numpy as np
 2import paddle
 3import torch
 4from padiff import auto_diff
 5from transformers import Qwen2Config as Qwen2Config_hf
 6from transformers import Qwen2ForCausalLM as Qwen2ForCausalLM_hf
 7
 8from paddlenlp.transformers import Qwen2Config, Qwen2ForCausalLM
 9
10
11def eval_model_convert():
12    paddle_input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
13    torch_input_ids = torch.LongTensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
14
15    # paddle model
16    paddle_ckpt_path = "Qwen/Qwen2-0.5B"
17    config_paddle = Qwen2Config.from_pretrained(paddle_ckpt_path)
18    model_paddle = Qwen2ForCausalLM.from_pretrained(paddle_ckpt_path, config=config_paddle, dtype="float32")
19
20    # torch model
21    torch_ckpt_path = "/root/.cache/modelscope/hub/Qwen/Qwen2-0___5B"
22    config_torch = Qwen2Config_hf.from_pretrained(torch_ckpt_path, trust_remote_code=True)
23    config_torch.dtype = "float32"
24    model_torch = Qwen2ForCausalLM_hf.from_pretrained(torch_ckpt_path, config=config_torch, trust_remote_code=True)
25
26    model_paddle.eval()
27    model_torch.eval()
28
29    # Manual verification
30    out_paddle = model_paddle(paddle_input_ids)[0]
31    out_torch = model_torch(torch_input_ids, return_dict=False)[0]
32    assert np.allclose(out_paddle.numpy(), out_torch.detach().numpy(), rtol=1e-5, atol=1e-4)
33
34    # Use padiff for verification
35    inp = ({"input_ids": torch_input_ids,
36            "use_cache": False,
37            "output_attentions": False,
38            "output_hidden_states": False,
39            "return_dict": False},
40        {"input_ids": paddle_input_ids})
41    # diff_phase can be forward, backward or both
42    auto_diff(model_torch, model_paddle, inp, atol=1e-4, rtol=1e3, auto_init=False, diff_phase="both", compare_mode="strict")
43
44eval_model_convert()

Precision alignment reference (verification standard):

model

size

logits diff (float32)

loss diff (float32)

each tensor in all layers (float32)

Qwen/Qwen2-0.5B

0.5B

1e-4

1e-5

1e-4

Qwen/Qwen2-1.5B

1.5B

1e-3

1e-5

1e-3

Qwen/Qwen2-7B

7B

1e-3

1e-5

1e-3

Qwen/Qwen1.5-14B

14B

1e-4

1e-5

1e-4

3.3 Model Training Alignment#

【Basic Process】

After completing previous steps, proceed to full-data training alignment:

  1. Prepare train/eval data, data loaders, and model

  2. Initialize model

  3. Load configuration and start training to obtain final model and evaluation metrics.

【Notes】

  1. 【Strongly Recommended】Complete backward alignment before training alignment. Uncertain factors include: dataset differences, framework discrepancies between PaddlePaddle and reference code in training mode, and initialization parameters.

  2. During training alignment, some output differences are acceptable. For example, in SST-2 classification task, difference <0.15% is considered normal. Adjust diff_threshold in ReprodDiffHelper.report as needed.

  3. Training fluctuations are normal. If final convergence differs, check:

  • Verify Dropout, BatchNorm, and other modules with hyperparameters.

  • Generate a pretrained model using reference code, convert to PaddlePaddle model, and compare convergence curves.

  • Use reference code’s DataLoader output for training to exclude data loading effects.

References:

  1. PaddlePaddle/PaDiff

  2. PaddlePaddle/models