DeepSeek R1: 70B Parameters, New 15X Local Deployment Boost with prima.cpp – A Game-Changer for Private Deployment Needs!

Project Address: https://github.com/Lizonghang/prima.cpp
Paper Address: https://arxiv.org/abs/2504.08791

With the rapid advancement of large language model (LLM) technology, many companies have launched powerful and efficient cloud-based inference services. These cloud services provide users with convenient, low-cost solutions, covering extensive corpora and robust computational capabilities. However, cloud deployment requires a continuous internet connection, which may raise concerns about privacy leaks, data security, and network dependency.

To meet the needs of enterprise and individual users who demand higher data security and offline availability, local deployment solutions for LLMs have emerged. Local deployment not only enhances data control but also improves response speed and system customization. However, the high hardware requirements of traditional large models pose challenges for local deployment, especially on resource-constrained devices. To address this issue, researchers and developers are committed to creating lightweight deployment solutions, making it possible to run LLMs on low-cost devices.

llama.cpp is a high-performance open-source commercial inference framework written in C/C++, supporting efficient operation of the LLaMA series models on devices without GPUs. By introducing the GGUF file format and various quantization techniques, it significantly reduces memory usage and computational demands. Currently, it has garnered 78.3k stars on GitHub.

Recently, a new inference framework, prima.cpp, has been open-sourced. This project was jointly developed by researchers from Mohamed bin Zayed University of Artificial Intelligence and the University of Electronic Science and Technology of China. It aims to efficiently run larger models on resource-limited everyday devices. Experiments show that the project successfully deployed the 70B-scale large language model DeepSeek R1. Compared to the llama.cpp solution, it achieves up to 15x speed improvements on some models and demonstrates significant advancements in memory management.

Highlights and Innovations of prima.cpp

According to the April 2025 paper PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters, prima.cpp introduces several innovative technologies:

• Memory Savings and Cost Reduction: It employs a technique called mmap, allowing large models to be loaded on-demand, similar to buffering when watching a movie, reducing memory pressure. Combined with an intelligent “division of labor” mechanism, it enables running ultra-large models on home computers.
• CPU and GPU Collaboration: Instead of relying solely on GPUs, it intelligently assigns tasks to both CPUs and GPUs based on each device’s capabilities, functioning like an efficient team where each member plays to their strengths, improving overall speed.
• High-Speed “Pipeline” Mechanism: It introduces a piped-ring architecture akin to an “assembly line,” preloading data needed by the model to avoid lag and ensure smoother operation.
• Intelligent Load Scheduling: The system automatically evaluates each machine’s “capability score” (processing power and memory) and allocates tasks accordingly. Stronger devices handle heavier workloads, while lighter ones assist.
• Compatibility with Mainstream Models and Formats: Whether you prefer Meta’s LLaMA, domestic models like DeepSeek or Qwen, prima.cpp supports them all, along with various compression formats.
• Multi-System Support for Broader Deployment: Whether you use Mac or Linux, you can deploy prima.cpp. The official team is working on a Windows deployment solution.

Currently Supported Popular Models:

DeepSeek

DeepSeek R1 is an ultra-large language model series released by the DeepSeek AI team, covering scales from 7B to 70B. It employs distillation and other techniques to compress model size and improve efficiency. Combining the strengths of Qwen and LLaMA architectures, it offers powerful comprehension while being hardware-friendly, making it ideal for high-performance local deployment frameworks like prima.cpp, balancing performance and resource utilization.

• DeepSeek R1-7B (Q4K, Q6K, Q80): deepseek-ai.DeepSeek-R1-Distill-Qwen-7B
• DeepSeek R1-8B (Q4K, Q6K, Q80): deepseek-ai.DeepSeek-R1-Distill-Llama-8B
• DeepSeek R1-14B (Q4K, Q6K, Q80): deepseek-ai.DeepSeek-R1-Distill-Qwen-14B
• DeepSeek R1-32B (Q4K, Q6K, Q80): deepseek-ai.DeepSeek-R1-Distill-Qwen-32B
• DeepSeek R1-70B (Q4K, Q6K, Q80): DeepSeek-R1-Distill-Llama-70B

Llama

A series of large language models launched by Meta (Facebook’s parent company), renowned for being open-source and high-performance, widely used in natural language understanding, dialogue generation, programming assistance, and more. It supports versions ranging from small to ultra-large, such as 7B, 30B, and 70B, catering to various computational environments. The LLaMA model is extremely popular in the global community, almost the “default choice” for locally deployed AI.

• Llama 3-8B (Q4K, Q6K, Q80): Meta-Llama-3-8B-Instruct
• Llama 3-14B (Q4K, Q6K, Q80): Llama-3-14B-Instruct-v1
• Llama 1-30B (Q4K, Q6K, Q80): upstage-llama-30b-instruct-2048
• Llama 3-45B (Q4K, Q6K, Q80): Llama-3-pruned-45B-Drobeta-Turnu-Severin
• Llama 3-60B (Q4K, Q6K, Q80): nyun-llama3-60B
• Llama 1-65B (Q4K, Q6K, Q80): llama-65b
• Llama 3-70B (Q4K, Q6K, Q80): Meta-Llama-3-70B-Instruct

Qwen 2.5 / QwQ

Qwen 2.5 is a next-generation Chinese large model developed by Alibaba’s Tongyi Lab, optimized for Chinese comprehension while maintaining multilingual capabilities. In contrast, QwQ is its lightweight sibling, employing compression and quantization techniques to run efficiently on ordinary devices. Both models excel in Chinese Q&A, customer service, education, and other fields.

• Qwen 2.5-7B (Q4K, Q6K, Q80): Qwen2.5-7B-Instruct
• Qwen 2.5-14B (Q4K, Q6K, Q80): Qwen2.5-14B-Instruct
• Qwen 2.5-32B (Q4K, Q6K, Q80): Qwen2.5-32B-Instruct
• Qwen 2.5-72B (Q4K, Q6K, Q80): Qwen2.5-72B-Instruct
• QwQ-32B (Q4K, Q6K, Q80): qwq-32b

Application Scenarios of prima.cpp

• End Users: Can deploy advanced 30B to 70B large models on ordinary devices for localized intelligent assistant functions, such as natural language conversations, schedule management, and home automation control.
• Developers: Can run large models in local environments for testing, debugging, and fine-tuning, reducing reliance on cloud services, lowering development costs, and protecting sensitive data privacy.
• Educational Institutions and Researchers: Can use prima.cpp on ordinary computing devices for teaching and research on large models, promoting the popularization of AI education and deeper scientific exploration.
• Small and Medium Enterprises (SMEs): Can deploy large language models internally for customer service, document generation, data analysis, and other scenarios, improving efficiency and ensuring data security.

Performance of prima.cpp

Official Test Devices:

Device	D1	D2	D3	D4
Hardware	Mac M1	Laptop	Desktop	Mate40Pro
OS	MacOS (UMA)	Linux	Linux	Linux (HarmonyOS)
CPU	Apple M1	Intel i9	Intel i9	Kirin 9000
CPU Cores	8	8	16	8
RAM (Avail.)	2.4 GiB	4.1 GiB	9.7 GiB	1.9 GiB
Disk Speed	0.72 GB/s	2.98 GB/s	3.17 GB/s	1.37 GB/s
GPU Type	Apple Metal	3070	2080Ti	–
VRAM (Avail.)	–	8 GiB	11 GiB	–

Inference Speed Comparison Between llama.cpp and prima.cpp

The results show that prima.cpp leads in processing ultra-large models, with inference times remaining more stable as model parameters increase.

Model	llama.cpp / 1Token	prima.cpp / 1Token
Llama 3-8B	15 ms	54 ms
Llama 3-14B	20 ms	65 ms
Llama 1-30B	202 ms	72 ms
Llama 3-45B	328 ms	233 ms
Llama 3-60B	7965 ms	468 ms
Llama 1-65B	8807 ms	569 ms
Llama 3-70B	10120 ms	674 ms
Qwen-2.5-7B	14 ms	44 ms
DeepSeek-R1-Distill-Qwen-7B	14 ms	52 ms
DeepSeek-R1-Distill-Llama-8B	14 ms	59 ms
Qwen-2.5-14B	23 ms	65 ms
DeepSeek-R1-Distill-Qwen-14B	24 ms	76 ms
Qwen-2.5-32B and QwQ-32B	224 ms	89 ms
DeepSeek-R1-Distill-Qwen-32B	232 ms	93 ms
DeepSeek-R1-Distill-Llama-70B	10978 ms	724 ms
Qwen-2.5-72B	12227 ms	867 ms

Deployment Tutorial

Since the project is new, it currently only supports Linux and macOS (Android requires Linux emulation). Those with C project experience can get started quickly, but most content requires scientific internet access.
Note: Use an SSD for operations; HDDs are too slow. Currently, prima.cpp only supports CUDA-based GPUs.

Environment Dependencies:

• gcc >= 9.4.0 (Common C/C++ compiler tool, turning code into runnable applications)
Download: https://ftp.gnu.org/gnu/gcc/gcc-9.5.0/
• make >= 4.2.1 (Automated build tool, acting as the “construction foreman” for compiling the project)
Download: https://ftp.gnu.org/gnu/make/
• cmake >= 3.16.3 (Helps configure the project “blueprint,” working with make)
Download: https://cmake.org/files/v3.18/
• fio >= 3.16 (Used to test disk read/write speeds)
Download: https://sourceforge.net/projects/fio/files/fio-3.16-amd64-jessie/download
• zmq >= 4.3.2 (A tool for inter-device communication, enabling efficient collaboration)
Download: https://zeromq.org/languages/cplusplus/
• HiGHS >= 1.9.0 (Intelligent task scheduler, optimizing workload distribution)
Download: https://github.com/ERGO-Code/HiGHS/releases
• CUDA (Optional, for NVIDIA GPUs)

On Linux (Ubuntu), run:

sudo apt update -y && sudo apt install -y gcc-9 make cmake fio git wget libzmq3-dev
------ Install HiGHS separately ---------------
git clone https://github.com/ERGO-Code/HiGHS.git
cd HiGHS
mkdir build && cd build
cmake ..
make -j$(nproc)
sudo make install

On macOS, run:

brew install gcc make cmake fio git wget highs zeromq

Download prima.cpp and Initialize the Build:

git clone https://github.com/Lizonghang/prima.cpp.git
cd prima.cpp
make -j$(nproc)  # Default build, using all CPU cores
------------------------ Custom Build Options ----------------------------
# If building on rank 0 device, add USE_HIGHS=1:
make USE_HIGHS=1 -j$(nproc)
# USE_HIGHS=1 enables the HIGHS solver (for mathematical optimization)
# -j$(nproc) uses all available CPU cores for parallel compilation

# If CUDA is installed, enable GPU acceleration with GGML_CUDA=1:
make GGML_CUDA=1 -j$(nproc)
# GGML_CUDA=1 enables CUDA backend for accelerated operations

# macOS users may disable Metal for stability with large models:
make LLAMA_NO_METAL=1 -j$(nproc)
# LLAMA_NO_METAL=1 disables macOS Metal acceleration

# Enable debug mode with LLAMA_DEBUG=1:
make LLAMA_DEBUG=1 -j$(nproc)
# LLAMA_DEBUG=1 enables debug output for development and troubleshooting

Download a Compatible Model:

For example, the official recommended model:
https://huggingface.co/Qwen/QwQ-32B-GGUF

mkdir download
wget https://huggingface.co/Qwen/QwQ-32B-GGUF/resolve/main/qwq-32b-q4_k_m.gguf -P download/

Single-Machine Deployment (Using llama.cpp for Testing):

./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 -p "what is edge AI?" -n 256 -ngl 30
# -ngl 30 is only useful for GPUs

llama-cli Quick Reference Table

Parameter	Example Value	Description
`-m`	`download/qwq-32b-q4_k_m.gguf`	Model path (.gguf format)
`-c`	`1024`	Context window size (token count)
`-p`	`"what is edge AI?"`	Prompt (input text for generation)
`-n`	`256`	Max tokens to generate
`-ngl`	`30`	Layers offloaded to GPU (more = faster, but requires VRAM)
`--temp`	`0.8`	Temperature (lower = more deterministic)
`--top-k`	`40`	Sample from top K probable tokens
`--top-p`	`0.95`	Sample from tokens with cumulative probability > P
`--repeat_penalty`	`1.1`	Penalize repetition in output
`--seed`	`42`	Random seed for reproducibility
`--threads`	`8`	CPU threads to use
`--color`	(no value)	Enable colored output
`--interactive`	(no value)	Enable multi-turn chat mode
`--keep`	`0`	Tokens to retain (for context management)

Distributed Multi-Device Deployment (prima.cpp’s Key Feature)

Official Test Devices:

Device Name	Approx. Model/Specs	Assigned IP
Mac M1	MacBook Air/Pro M1	192.168.1.2
Laptop	Mid-range Windows laptop	192.168.1.3
Desktop	Mid-range desktop (GPU)	192.168.1.4
Mate40 Pro	Huawei flagship phone	192.168.1.5

Requirements:

• A WLAN for communication.
• Disable firewalls and open ports (e.g., 9000, 10000) to avoid communication failures.
• For Android, use Termux to emulate Linux:
Download: https://github.com/termux/termux-app/releases

All four devices must deploy the same environment as in Single-Machine Deployment.

Run on Each Device:

# D0 (Head Device):
./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 -n 256 -p "what is edge AI?" --world 4 --rank 0 --master 192.168.1.2 --next 192.168.1.3 --prefetch

# D1:
./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 --world 4 --rank 1 --master 192.168.1.2 --next 192.168.1.4 --prefetch --gpu-mem 8

# D2:
./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 --world 4 --rank 2 --master 192.168.1.2 --next 192.168.1.5 --prefetch --gpu-mem 11

# D3:
./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 --world 4 --rank 3 --master 192.168.1.2 --next 192.168.1.2 --prefetch

D0 is the master device. Once launched, prima.cpp analyzes each device’s capabilities and assigns workloads (e.g., how many model layers each device processes, and how many run on GPUs).

Communication follows a ring topology (e.g., D0 → D1 → D2 → D3 → D0). The paper used a 6-device network topology (see diagram).

Manual Layer Distribution Control:

Use -lw (or --layer-window) and -ngl to specify layer distribution:

• -lw: Sets the number of layers per device (comma-separated, ordered by rank).
Examples: "8,8,8,8", "4,4,4,4", "16,16,24,8".
• -ngl: Sets GPU-offloaded layers per device.

# On head device (rank 0, no GPU):
./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 -n 256 -p "what is edge AI?" --world 4 --rank 0 --master 192.168.1.2 --next 192.168.1.3 --prefetch -lw "16,16,16,16"

# On worker device (rank 1, 8 GiB VRAM):
./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 --world 4 --rank 1 --master 192.168.1.2 --next 192.168.1.4 --prefetch -ngl 16

# On worker device (rank 2, 11 GiB VRAM):
./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 --world 4 --rank 2 --master 192.168.1.2 --next 192.168.1.5 --prefetch -ngl 16

# On worker device (rank 3, no GPU):
./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 --world 4 --rank 3 --master 192.168.1.2 --next 192.168.1.2 --prefetch

Deployment on Virtual Machines (Docker)

Official Test Setup:

• Host: 32 CPU cores, 32 GiB RAM, 32 GiB VRAM.
• Simulated 4 homogeneous nodes via Docker, each allocated 8 CPU cores, 8 GiB RAM, and 8 GiB VRAM.

Pull 4 Docker Images:

sudo docker run -dit --name prima-v1 --memory=8gb --memory-swap=8gb --cpus 8 --cpuset-cpus="0-7" --network host --gpus all prima.cpp:1.0.1-cuda
sudo docker run -dit --name prima-v2 --memory=8gb --memory-swap=8gb --cpus 8 --cpuset-cpus="8-15" --network host --gpus all prima.cpp:1.0.1-cuda
sudo docker run -dit --name prima-v3 --memory=8gb --memory-swap=8gb --cpus 8 --cpuset-cpus="16-23" --network host --gpus all prima.cpp:1.0.1-cuda
sudo docker run -dit --name prima-v4 --memory=8gb --memory-swap=8gb --cpus 8 --cpuset-cpus="24-31" --network host --gpus all prima.cpp:1.0.1-cuda
# For non-GPU setups, remove `--gpus all`

Copy the Model to All Containers:

cd prima.cpp/download
sudo docker cp qwq-32b-q4_k_m.gguf prima-v1:/root/prima.cpp/download/
sudo docker cp qwq-32b-q4_k_m.gguf prima-v2:/root/prima.cpp/download/
sudo docker cp qwq-32b-q4_k_m.gguf prima-v3:/root/prima.cpp/download/
sudo docker cp qwq-32b-q4_k_m.gguf prima-v4:/root/prima.cpp/download/

Rebuild prima.cpp in Containers (Non-GPU):

cd ./prima.cpp && make clean
make -j$(nproc)  # If not rank 0
make USE_HIGHS=1 -j$(nproc)  # If rank 0

Start Inference:

cd ./prima.cpp
# (prima-v1)
./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 -n 256 -p "what is edge AI?" --world 4 --rank 0 --prefetch --gpu-mem 8
# (prima-v2)
./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 --world 4 --rank 1 --prefetch --gpu-mem 8
# (prima-v3)
./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 --world 4 --rank 2 --prefetch --gpu-mem 8
# (prima-v4)
./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 --world 4 --rank 3 --prefetch --gpu-mem 8
# For non-GPU, remove `--gpu-mem 8`

Start Chat Mode:

Add -cnv to the head device. Exit with quit or exit.

./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 -n 256 -p "what is edge AI?" --world 4 --rank 0 --master 192.168.1.2 --next 192.168.1.3 --prefetch -lw "16,16,16,16" -cnv

Notes:

• Prefetching: prima.cpp suggests OS-level prefetching of upcoming layer weights. For explicit prefetching, add --force, but this may introduce latency. Use only after testing.

Current Limitations:

• Limited device compatibility.
• Redundant configurations.
• Ring communication is sequential; no fault tolerance for device failures.
• Model weights must be stored on all devices.

Reproduction without permission is prohibited：AI LAB » DeepSeek R1: 70B Parameters, New 15X Local Deployment Boost with prima.cpp – A Game-Changer for Private Deployment Needs!