LLaMA#

This document demonstrates how to build and run the LLaMA series of large models in PaddleNLP.

Model Introduction#

The LLaMA series of large models is an open and efficient collection of foundational language models released by Meta AI.
Llama 2: In July 2023, Meta released the Llama 2 series with four versions: 7B, 13B, 34B, and 70B. This version enables commercial use under open-source license, reducing the cost for startups to create ChatGPT-like chatbots.
Llama 3: On April 19, 2024, Meta launched the Llama 3 series, including 8B and 70B versions, with a 400B Llama-3 still in training. This version demonstrates comprehensive improvements across multiple benchmarks.
Llama 3.1: On July 23, 2024, Meta released Llama 3.1 8B, 70B, and 405B models, further enhancing model performance and efficiency.

Verified Models#

Model
meta-llama/Llama-2-7b-chat
meta-llama/Llama-2-13b-chat
meta-llama/Llama-2-70b-chat
meta-llama/Meta-Llama-3-8B-Instruct
meta-llama/Meta-Llama-3-70B-Instruct
meta-llama/Meta-Llama-3.1-8B-Instruct
meta-llama/Meta-Llama-3.1-70B-Instruct
meta-llama/Meta-Llama-3.1-405B-Instruct
meta-llama/Llama-3.2-3B-Instruct

Verified Pre-quantized Models#

Model
meta-llama/Meta-Llama-3-8B-Instruct-A8W8C8
meta-llama/Meta-Llama-3-8B-Instruct-A8W8-FP8
meta-llama/Meta-Llama-3.1-8B-Instruct-A8W8C8
meta-llama/Meta-Llama-3.1-8B-Instruct-A8W8-FP8

Model Inference#

Taking meta-llama/Meta-Llama-3-8B-Instruct (single GPU) and meta-llama/Meta-Llama-3.1-405B-Instruct (multi-GPU) as examples.

BF16 Inference

# Dynamic Graph Inference
python ./predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --dtype bfloat16 --mode dynamic --inference_model 1 --append_attn 1

# Dynamic to Static Model Exportation
python predict/export_model.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --append_attn 1

# Static Graph Inference
python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --append_attn 1

WINT8 Inference

# Dynamic Graph Inference
python predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --dtype bfloat16 --mode dynamic --inference_model 1 --append_attn 1 --quant_type weight_only_int8

# Dynamic to Static Model Exportation
python predict/export_model.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --append_attn 1 --quant_type weight_only_int8

# Static Graph Inference
python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --append_attn 1 --quant_type weight_only_int8

The following quantization inference requires models generated according to the Large Model Quantization Tutorial, such as checkpoints/llama_ptq_ckpts, or using pre-quantized models provided, e.g., meta-llama/Meta-Llama-3-8B-Instruct-A8W8C8.

INT8-A8W8 Inference

# Dynamic graph inference
python predict/predictor.py --model_name_or_path checkpoints/llama_ptq_ckpts --dtype bfloat16 --mode dynamic --inference_model 1 --append_attn 1 --quant_type a8w8

# Dynamic to static model export
python predict/export_model.py --model_name_or_path checkpoints/llama_ptq_ckpts --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --append_attn 1 --quant_type a8w8

# Static graph inference
python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --append_attn 1 --quant_type a8w8

INT8-A8W8C8 Inference

# Dynamic graph inference
python predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct-A8W8C8 --dtype bfloat16 --mode dynamic --inference_model 1 --append_attn 1 --quant_type a8w8 --cachekv_int8_type static

# Dynamic to static model export
python predict/export_model.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct-A8W8C8 --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --append_attn 1 --quant_type a8w8 --cachekv_int8_type static

# Static graph inference
python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --append_attn 1 --quant_type a8w8 --cachekv_int8_type static

FP8-A8W8 Inference

# Dynamic graph inference
python predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct-A8W8-FP8 --dtype bfloat16 --mode dynamic --inference_model 1 --append_attn 1 --quant_type a8w8_fp8

# Dynamic to static model export
python predict/export_model.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct-A8W8-FP8 --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --append_attn 1 --quant_type a8w8_fp8

# Static graph inference
python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --append_attn 1 --quant_type a8w8_fp8

405B INT8-A8W8C8 TP8 Inference

# Due to model size, execute the following script to pre-download the model
from paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizer
from paddlenlp.generation import GenerationConfig
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-405B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-405B-Instruct")
generation_config = GenerationConfig.from_pretrained("meta-llama/Meta-Llama-3.1-405B-Instruct")

Here we use fake parameters via –use_fake_parameter. For correctly inferring quantized models, please refer to the Large Model Quantization Tutorial for quantization.

# Export model (Set paddle.set_device("cpu") in predict/export_model.py to export model via memory)
python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" predict/export_model.py --model_name_or_path meta-llama/Meta-Llama-3.1-405B-Instruct --output_path /path/to/a8w8c8_tp8 --inference_model 1 --append_attn 1 --dtype bfloat16 --quant_type a8w8 --cachekv_int8_type static --use_fake_parameter 1

# Inference
python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" predict/predictor.py --model_name_or_path /path/to/a8w8c8_tp8 --mode static --inference_model 1 --append_attn 1 --dtype bfloat16 --quant_type a8w8 --cachekv_int8_type static

LLaMA

Contents

LLaMA#

Model Introduction#

Verified Models#

Verified Pre-quantized Models#

Model Inference#