High-Performance Deployment of Static Graphs#

This deployment tool is based on NVIDIA Triton, designed specifically for server-side large model serving. It provides service interfaces supporting gRPC and HTTP protocols, along with streaming token output capabilities. The underlying inference engine supports continuous batching, weight-only int8, post-training quantization (PTQ), and other acceleration strategies, delivering an easy-to-use and high-performance deployment experience.

Quick Deployment of Static Graphs#

This method only supports one-click runnable models listed here for instant inference service startup.

To avoid prolonged download times for large models, we provide an automatic download script that supports post-download service initiation. After entering the container, perform static graph download based on single/multi-node model scenarios.

MODEL_PATH specifies the model storage path (user-definable) model_name specifies the model name to download. Supported models can be found in this document

Notes:

Ensure shm-size >= 5, otherwise service startup may fail
Verify model environment and hardware requirements before deployment. Refer to documentation

A100 Deployment Example

export MODEL_PATH=${MODEL_PATH:-$PWD}
export model_name=${model_name:-"deepseek-ai/DeepSeek-R1-Distill-Llama-8B/weight_only_int8"}
docker run -i --rm --gpus all --shm-size 32G --network=host --privileged --cap-add=SYS_PTRACE \
-v $MODEL_PATH:/models -e "model_name=${model_name}" \
-dit ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlenlp:llm-serving-cuda124-cudnn9-v2.1 /bin/bash \
-c -ex 'start_server $model_name && tail -f /dev/null'

Deployment Environment Preparation#

Basic Environment#

This serving deployment tool currently only supports deployment on Linux systems. Please ensure the system has proper GPU environment before deployment.

Install Docker Refer to Install Docker Engine to install Docker environment for your corresponding Linux platform.
Install NVIDIA Container Toolkit Refer to Installing the NVIDIA Container Toolkit to learn and install NVIDIA Container Toolkit.

After successful installation of NVIDIA Container Toolkit, refer to Running a Sample Workload with Docker to verify if NVIDIA Container Toolkit works properly.

Prepare Deployment Images#

For deployment convenience, we provide images for CUDA 12.4 and CUDA 11.8. You can either directly pull the images or use our provided Dockerfile to build custom images.

CUDA Version	Supported Hardware Architectures	Image Address	Supported Typical Devices
CUDA 11.8	70 75 80 86	ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlenlp:llm-serving-cuda118-cudnn8-v2.1	V100, T4, A100, A30, A10
CUDA 12.4	80 86 89 90	ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlenlp:llm-serving-cuda124-cudnn9-v2.1	A100, A30, A10, L20, H20, H100

Prepare Models#

The exported models can be placed in any directory, such as /home/workspace/models_dir This deployment tool provides an efficient deployment solution for PaddleNLP static graph models. For model static graph export solutions, please refer to: DeepSeek, LLaMA, Qwen, Mixtral…

cd /home/workspace/models_dir

# The exported model directory structure is as follows, theoretically seamlessly supporting PaddleNLP exported static graph models without modifying the model directory structure
# /opt/output/Serving/models
# ├── config.json                # Model configuration file
# ├── xxxx.model                 # Vocabulary model file
# ├── special_tokens_map.json    # Vocabulary configuration file
# ├── tokenizer_config.json      # Vocabulary configuration file
# └── rank_0                     # Directory storing model structure and weight files
#     ├── model.pdiparams
#     └── model.pdmodel or model.json # Paddle 3.0 version uses model.json, Paddle 2.x version uses model.pdmodel

Static Graph Download#

In addition to supporting automatic download via setting model_name during startup, the service provides scripts for manual download. During deployment, the environment variable MODEL_DIR must be specified as the model download storage path

Script location: /opt/output/download_model.py

python download_model.py \
--model_name $model_name \
--dir $MODEL_PATH \
--nnodes 2 \
--mode "master" \
--speculate_model_path $MODEL_PATH

Single-node Model Download Taking DeepSeek-R1 weight_only_int4 model as example:

export MODEL_PATH=${MODEL_PATH:-$PWD}
export model_name="deepseek-ai/DeepSeek-R1/weight_only_int4"
python download_model.py --model_name $model_name --dir $MODEL_PATH --nnodes 1

Multi-node Model Download Taking DeepSeek-R1 2-node weight_only_int8 model as example: node1 Master node

export MODEL_PATH=${MODEL_PATH:-$PWD}
export model_name="deepseek-ai/DeepSeek-R1-2nodes/weight_only_int8"
python download_model.py --model_name $model_name --dir $MODEL_PATH --nnodes 2 --mode "master"

Node 2 (Slave Node)

export MODEL_PATH=${MODEL_PATH:-$PWD}
export model_name="deepseek-ai/DeepSeek-R1-2nodes/weight_only_int8"
python download_model.py --model_name $model_name --dir $MODEL_PATH --nnodes 2 --mode "slave"

Parameter Description

Field Name	Type	Description	Mandatory	Default Value
model_name	str	Specified model name for download. Supported models can be found in documentation	No	deepseek-ai/DeepSeek-R1/weight_only_int4
dir	str	Model storage path	No	downloads
nnodes	int	Number of nodes	No	1
mode	str	Download mode for distinguishing different nodes in multi-machine setup	No	Only supports "master" and "slave" values
speculate_model_path	str	Speculative decoding model storage path	No	None

Create Container#

Before creating the container, please check Docker version and GPU environment to ensure Docker supports --gpus all parameter.

Mount the model directory to the container. Default model mount path is /models/. The mount path can be customized via MODEL_DIR environment variable during service startup.

docker run --gpus all \
    --name paddlenlp_serving \
    --privileged \
    --cap-add=SYS_PTRACE \
    --network=host \
    --shm-size=5G \
    -v /home/workspace/models_dir:/models/ \
    -dit ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlenlp:llm-serving-cuda124-cudnn9-v2.1 bash

# Enter container to verify GPU environment and model mount status
docker exec -it paddlenlp_serving /bin/bash
nvidia-smi
ls /models/

Start Service#

Configure Parameters#

Set the following environment variables according to requirements and hardware information

# Single/Multi-GPU Inference Configuration. Modify as needed.
## For single-GPU inference, using GPU 0, set the following environment variables.
export MP_NUM=1
export CUDA_VISIBLE_DEVICES=0

## For multi-GPU inference, besides meeting the 2-GPU requirements for model export, also set the following environment variables.
# export MP_NUM=2
# export CUDA_VISIBLE_DEVICES=0,1

# If the deployment scenario does not require streaming Token returns, configure the following switch
# The service will return all generated Tokens for each request at once
# Reduces pressure on the service to send Tokens incrementally
# Disabled by default
# export DISABLE_STREAMING=1

# Data service configuration. Modify HTTP_PORT, GRPC_PORT, METRICS_PORT and INFER_QUEUE_PORT as needed. (Verify port availability first)
export HEALTH_HTTP_PORT="8110"                         # Health check service http port (currently only used for health checks)
export SERVICE_GRPC_PORT="8811"                         # Model serving grpc port
export METRICS_HTTP_PORT="8722"                      # Monitoring metrics port for model service
export INTER_PROC_PORT="8813"                  # Internal communication port for model service
export SERVICE_HTTP_PORT="9965"               # HTTP port for service requests. Defaults to -1 (only GRPC supported) if not configured

# MAX_SEQ_LEN: The service will reject requests where input token count exceeds MAX_SEQ_LEN and return error
# MAX_DEC_LEN: The service will reject requests with max_dec_len/min_dec_len exceeding this parameter and return error
export MAX_SEQ_LEN=8192
export MAX_DEC_LEN=1024

export BATCH_SIZE="48"                          # Set maximum Batch Size - maximum concurrent requests the model can handle, should not exceed 128
export BLOCK_BS="5"                             # Maximum Query Batch Size for cached Blocks. Reduce this value if encountering out of memory errors
export BLOCK_RATIO="0.75"                       # Generally can be set to (average input tokens)/(average input + output tokens)

export MAX_CACHED_TASK_NUM="128"  # Maximum length of service cache queue. New requests will be rejected when queue reaches limit, default 128
# To enable HTTP interface, configure the following parameters
export PUSH_MODE_HTTP_WORKERS="1" # Number of HTTP service processes. Effective when PUSH_MODE_HTTP_PORT is configured, can be set up to 8, default 1

Multi-Machine Parameter Configuration#

Additional service parameters compared to single-machine deployment

export POD_IPS=10.0.0.1,10.0.0.2
export POD_0_IP=10.0.0.1
export MP_NNODE=2 # Number of nodes set to 2 indicates 2-machine service
export MP_NUM=16 # Model sharding set to 16
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

For more request parameters, please refer to Model Configuration Parameters

Service Startup#

We provide two deployment solutions for model serving:

Deploy with pre-saved models at specified path
Auto-download static graph deployment

Single-machine Startup#

Deploy with pre-saved models at specified path

export MODEL_DIR=${MODEL_DIR:-"/models"}
start_server

# Before restarting the service, stop it using stop_server

Auto-download static graph deployment model_name specifies the model to download. For supported models, see documentation

model_name="deepseek-ai/DeepSeek-R1-2nodes/weight_only_int8"
start_server $model_name

# Before restarting the service, stop it using stop_server

Multi-machine Startup#

Sequential Startup#

Start master node service
Start services on other nodes sequentially

Start command is same as single-machine

MPI Startup#

MPI startup requires prior SSH configuration between machines

mpirun start_server

# Stop service
mpirun stop_server

Service Health Check#

# port should be the HEALTH_HTTP_PORT specified during service startup
  > Please ensure correct service IP and port before testing

Liveness probe: (Check if service can accept requests)
  http://127.0.0.1:8110/v2/health/live
Readiness probe: (Check if model is ready for inference)
  http://127.0.0.1:8110/v2/health/ready

Service Testing#

For multi-machine testing, execute on master node or replace IP with master node’s IP

HTTP Invocation#

import uuid
import json
import requests

ip = 127.0.0.1
service_http_port = "9965"    # Service configuration
url = f"http://{ip}:{service_http_port}/v1/chat/completions"

req_id = str(uuid.uuid1())
data_single = {
    "text": "Hello, how are you?",
    "req_id": req_id,
    "max_dec_len": 64,
    "stream": True,
}

# Stream per token
res = requests.post(url, json=data_single, stream=True)
for line in res.iter_lines():
    print(json.loads(line))

# Multi-turn dialogue
data_multi = {
    "messages": [
        {"role": "user", "content": "Hello, who are you"},
        {"role": "system", "content": "I'm a helpful AI assistant."},
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    "req_id": req_id,
    "max_dec_len": 64,
    "stream": True,
}

# Stream per token
res = requests.post(url, json=data_multi, stream=True)
for line in res.iter_lines():
    print(json.loads(line))

For more request parameters, please refer to Request Parameters

Response Examples#

When stream is True, streaming returns:
    If normal, returns {'token': xxx, 'is_end': xxx, 'send_idx': xxx, ..., 'error_msg': '', 'error_code': 0}
    If error occurs, returns {'error_msg': xxx, 'error_code': xxx} with error_msg not empty and error_code non-zero

When stream is False, non-streaming returns:
    If normal, returns {'tokens_all': xxx, ..., 'error_msg': '', 'error_code': 0}
    If error occurs, returns {'error_msg': xxx, 'error_code': xxx} with error_msg not empty and error_code non-zero

OpenAI Client#

We provide support for the OpenAI client. Usage is as follows:

import openai

ip = 127.0.0.1
service_http_port = "9965"    # Service configuration

client = openai.Client(base_url=f"http://{ip}:{service_http_port}/v1/chat/completions", api_key="EMPTY_API_KEY")

# Non-streaming response
response = client.completions.create(
    model="default",
    prompt="Hello, how are you?",
  max_tokens=50,
  stream=False,
)

print(response)
print("\n")

# Streaming response
response = client.completions.create(
    model="default",
    prompt="Hello, how are you?",
  max_tokens=100,
  stream=True,
)

for chunk in response:
  if chunk.choices[0] is not None:
    print(chunk.choices[0].text, end='')
print("\n")

# Chat completion
# Non-streaming response
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "user", "content": "Hello, who are you"},
        {"role": "system", "content": "I'm a helpful AI assistant."},
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
    stream=False,
)

print(response)
print("\n")

# Streaming response
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "user", "content": "Hello, who are you"},
        {"role": "system", "content": "I'm a helpful AI assistant."},
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
    stream=True,
)

for chunk in response:
  if chunk.choices[0].delta is not None:
    print(chunk.choices[0].delta.content, end='')
print("\n")

Creating Your Own Image Based on Dockerfile#

To facilitate users in building custom services, we provide scripts for creating your own images based on dockerfile.

git clone https://github.com/PaddlePaddle/PaddleNLP.git
cd PaddleNLP/llm/server

docker build --network=host -f ./dockerfiles/Dockerfile_serving_cuda124_cudnn9 -t llm-serving-cu124-self .

After creating your own image, you can create a container based on it.

Model Configuration Parameters#

Field Name	Type	Description	Required	Default Value	Remarks
MP_NNODE	int	Number of nodes	No	1	Should match the number of machines
MP_NUM	int	Model parallelism degree	No	8	CUDA_VISIBLE_DEVICES must be configured with corresponding GPU count
CUDA_VISIBLE_DEVICES	str	GPU indices	No	0,1,2,3,4,5,6,7
POD_IPS	str	IP addresses of multi-node cluster	No	None	Required for multi-node, example: "10.0.0.1,10.0.0.2"
POD_0_IP	str	Master node IP in multi-node cluster	No	None	Required for multi-node, example: "10.0.0.1" (must exist in POD_IPS)
HEALTH_HTTP_PORT	int	Health check service HTTP port	Yes	None	Currently only used for health checks (pre 3.0.0 images use HTTP_PORT)
SERVICE_GRPC_PORT	int	Model serving GRPC port	Yes	None	(pre 3.0.0 images use GRPC_PORT)
METRICS_HTTP_PORT	int	Monitoring metrics HTTP port	Yes	None	(pre 3.0.0 images use METRICS_PORT)
INTER_PROC_PORT	int	Internal process communication port	No	56666	(pre 3.0.0 images use INTER_QUEUE_PORT)
SERVICE_HTTP_PORT	int	HTTP port for service requests	No	9965	(pre 3.0.0 images use PUSH_MODE_HTTP_PORT)
DISABLE_STREAMING	int	Disable streaming response	No	0
MAX_SEQ_LEN	int	Maximum input sequence length	No	8192	Requests exceeding this limit will be rejected with error
MAX_DEC_LEN	int	Maximum decoder sequence length	No	1024	Requests with max_dec_len/min_dec_len exceeding this will be rejected
BATCH_SIZE	int	Maximum batch size	No	50	Maximum concurrent inputs the model can handle, cannot exceed 128
BLOCK_BS	int	Maximum query batch size for cached blocks	No	50	Reduce this value if encountering out of memory errors
BLOCK_RATIO	float		No	0.75	Recommended to set as input length ratio
MAX_CACHED_TASK_NUM	int	Maximum cached tasks in queue	No	128	New requests will be rejected when queue reaches limit
PUSH_MODE_HTTP_WORKERS	int	Number of HTTP service workers	No	1	Effective when configured, increase for high concurrency (max recommended 8)
USE_WARMUP	int	Enable warmup	No	0
USE_HF_TOKENIZER	int	Use HuggingFace tokenizer	No	0
USE_CACHE_KV_INT8	int	Enable INT8 for KV Cache	No	0	Set to 1 for c8 quantized models
MODEL_DIR	str	Model file path	No	/models/
model_name	str	Model name	No	None	Used for static model downloads (refer to #./static_models.md)
OUTPUT_LOG_TO_CONSOLE	str	Redirect logs to console	No	0

GPU Memory Configuration Recommendations#

BLOCK_BS: Determines the number of cacheKV blocks. Total supported tokens = BLOCK_BS * INFER_MODEL_MAX_SEQ_LEN
- Example: For 8K model with BLOCK_BS=40, supports 40*8=320K tokens
- For 32K model, recommended BLOCK_BS=40/(32K/8K)=10 (consider input length variations, may reduce to 8-9)
BLOCK_RATIO: Block allocation ratio between encoder/decoder. Recommended value: (avg_input_len+128)/(avg_input_len + avg_output_len)*EXTEND_RATIO (EXTEND_RATIO=1~1.3)
- Example: avg_input=300, avg_output=1500 → (300+128)/(300+1500)≈0.25
BATCH_SIZE: Generally < TOTAL_TOKENS / (avg_input_len + avg_output_len)

GPU Memory	Deployed Model	Static Graph Weights	Nodes	Quantization Type	Context Length	MTP Enabled	MTP Quant Type	Recommended BLOCK_BS
80GB	DeepSeek-V3/R1	deepseek-ai/DeepSeek-R1/a8w8_fp8_wint4	1	a8w8_fp8_wint4	8K	No	-	40
80GB	DeepSeek-V3/R1	deepseek-ai/DeepSeek-R1-MTP/a8w8_fp8_wint4	1	a8w8_fp8_wint4	8K	Yes	a8w8_fp8	36
80GB	DeepSeek-V3/R1	deepseek-ai/DeepSeek-R1/weight_only_int4	1	weight_only_int4	8K	No	-	40
80GB	DeepSeek-V3/R1	deepseek-ai/DeepSeek-R1-MTP/weight_only_int4	1	weight_only_int4	8K	Yes	weight_only_int8	36
80GB	DeepSeek-V3/R1	deepseek-ai/DeepSeek-R1-2nodes/a8w8_fp8	2	a8w8_fp8	8K	No	-	50
80GB	DeepSeek-V3/R1	deepseek-ai/DeepSeek-R1-MTP-2nodes/a8w8_fp8	2	a8w8_fp8	8K	Yes	a8w8_fp8	36
80GB	DeepSeek-V3/R1	deepseek-ai/DeepSeek-R1-2nodes/weight_only_int8	2	weight_only_int8	8K	No	-	40
80GB	DeepSeek-V3/R1	deepseek-ai/DeepSeek-R1-MTP-2nodes/weight_only_int8	2	weight_only_int8	8K	Yes	weight_only_int8	36

Request Parameters#

Field Name	Type	Description	Required	Default Value	Remarks
req_id	str	Request ID (unique identifier)	No	Random ID	Duplicate req_id will return error
text	str	Input text	No	None	Either text or messages must be provided
messages	str	Multi-turn conversation context	No	None	Stored as list
max_dec_len	int	Maximum generated tokens	No	max_seq_len - input_tokens	Requests exceeding limit will return error
min_dec_len	int	Minimum generated tokens	No	1
topp	float	Top-p sampling (0-1)	No	0.7	Higher values increase randomness
temperature	float	Temperature (must >0)	No	0.95	Lower values reduce randomness
frequency_score	float	Frequency penalty	No	0
penalty_score	float	Repetition penalty	No	1
presence_score	float	Presence penalty	No	0
stream	bool	Stream response	No	False
timeout	int	Request timeout (seconds)	No	300
return_usage	bool	Return input/output token counts	No	False

Service supports both GRPC and HTTP requests
- stream parameter only applies to HTTP requests

High-Performance Deployment of Static Graphs

Contents