Mooncake: A KVCache-centric Disaggregated
Architecture for LLM Serving

Mooncake is the serving platform for icon Kimi, a leading LLM service provided by icon Moonshot AI. Now the core of Mooncake - Transfer Engine is open-sourced! This repository also hosts its technical report and the open sourced traces.

🔄 Updates

Nov 28, 2024: We open sourced the Transfer Engine, the central component of Mooncake. We also provide two demonstrations of Transfer Engine: a P2P Store and vLLM integration.
July 9, 2024: We open sourced the trace as a jsonl file!.
June 27, 2024: We present a series of Chinese blogs with more discussions on zhihu 1, 2, 3, 4.
June 26, 2024: Initial technical report release.

🎉 Overview

Mooncake features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache of KVCache.

The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs) requirements. Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges due to highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncake’s innovative architecture enables Kimi to handle 75% more requests.

🧩 Components

The bottom part of Mooncake is Transfer Engine, which supports rapid, reliable and flexible data transfer over TCP, RDMA, NVIDIA GPUDirect-based RDMA and and NVMe over Fabric (NVMe-of) protocols. Comparing with gloo (used by Distributed PyTorch) and TCP, Mooncake Transfer Engine has the lowest I/O latency.
Based on Transfer Engine, we implemented the P2P Store library, supports sharing temporary objects (e.g., checkpoint files) among nodes in a cluster. It avoids bandwidth saturation on a single machine.
Additionally, we modified vLLM so that Transfer Engine is integrated. It makes prefill-decode disaggregation more efficient by utilizing RDMA devices.
In the future, we plan to build Mooncake Store on the basis of Transfer Engine, which supports pooled KVCache for more flexible P/D disaggregation.

🔥 Show Cases

Use Transfer Engine Standalone (Guide)

Transfer Engine is a high-performance data transfer framework. Transfer Engine provides a unified interface to transfer data from DRAM, VRAM or NVMe, while the technical details related to hardware are hidden. Transfer Engine supports TCP, RDMA (InfiniBand/RoCEv2/eRDMA/NVIDIA GPUDirect) and NVMe over Fabric (NVMe-of) protocols.

Highlights

Efficient use of multiple RDMA NIC devices. Transfer Engine supports the use of multiple RDMA NIC devices to achieve the aggregation of transfer bandwidth.
Topology aware path selection. Transfer Engine can select optimal devices based on the location (NUMA affinity, etc.) of both source and destination.
More robust on temporary network error. Once transmission fails, Transfer Engine will try to use alternative paths for data delivery automatically.

Performance

With 40 GB of data (equivalent to the size of the KVCache generated by 128k tokens in the LLaMA3-70B model), Mooncake Transfer Engine delivers up to 87 GB/s and 190 GB/s of bandwidth in 4×200 Gbps and 8×400 Gbps RoCE networks respectively, which are about 2.4x and 4.6x faster than the TCP protocol.

P2P Store (Guide)

P2P Store is built on the Transfer Engine and supports sharing temporary objects between peer nodes in a cluster. P2P Store is ideal for scenarios like checkpoint transfer, where data needs to be rapidly and efficiently shared across a cluster. P2P Store has been used in the checkpoint transfer service of Moonshot AI.

Highlights

Decentralized architecture. P2P Store leverages a pure client-side architecture with global metadata managed by the etcd service.
Efficient data distribution. Designed to enhance the efficiency of large-scale data distribution, P2P Store avoids bandwidth saturation issues by allowing replicated nodes to share data directly. This reduces the CPU/RDMA NIC pressures of data providers (e.g., trainers).

Performance

Thanks to the high performance of Transfer Engine, P2P Stores can also distribute objects with full utilization of hardware incoming bandwidth (e.g., A 25Gbps NIC was used in the following figure, and the throughput of get replica is about 3.1 GB/s).

vLLM Integration (Guide)

To optmize LLM inference, the vLLM's community is working at supporting disaggregated prefilling (PR 8498). This feature allows separating the prefill phase from the decode phase in different processes. The vLLM uses nccl and gloo as the transport layer by default, but currently it cannot efficiently decouple both phases in different machines.

We have implemented vLLM integration, which uses Transfer Engine as the network layer instead of nccl and gloo, to support inter-node KVCache transfer. Transfer Engine provides simpler interface and more efficient use of RDMA devices. In the future, we plan to build Mooncake Store on the basis of Transfer Engine, which supports pooled prefill/decode disaggregation.

Performance

By supporting Topology Aware Path Selection and multi-card bandwidth aggregation, TTFT of vLLM with Transfer Engine is up to 33% lower than traditional TCP-based transports. In the future, we will further improve TTFT through GPUDirect RDMA and zero-copy.

Backend/Setting	Output Token Throughput (tok/s)	Total Token Throughput (tok/s)	Mean TTFT (ms)	Median TTFT (ms)	P99 TTFT (ms)
Transfer Engine (RDMA)	12.07	2046.78	1165.25	678.74	4576.57
TCP	12.06	2045.51	1925.52	1011.58	8149.52

Click here to access detailed benchmark results.

More advanced features will coming soon, so stay tuned!

🚀 Quick Start

Preparation

In order to install and use Mooncake, some preparation is required.

RDMA Driver & SDK (e.g., Mellanox OFED).
Linux-x86_64 with gcc, g++ (9.4+) and cmake (3.16+).
Python (3.10 or above)

In addition, to support more features of Mooncake Transfer Engine, we recommand you to install the following components:

CUDA 12.1 and above, including NVIDIA GPUDirect Storage Support, if you want to build with -DUSE_CUDA. You may install them from here.

# Adding CUDA to PATH
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export CUDA_PATH=/usr/local/cuda

Go 1.20+, if you want to build with -DWITH_P2P_STORE. You may download it from here.
Rust Toolclain, if you want to build with -DWITH_WITH_RUST_EXAMPLE.

Installation

Init source code

git clone https://github.com/kvcache-ai/Mooncake.git
cd Mooncake

Install dependencies
```
bash dependencies.sh
```

Compile Mooncake and examples

mkdir build
cd build
cmake .. # (optional) Specify build options like -D
make -j

🛣️ Incoming Milestones

First release of Mooncake and integrate with latest vLLM
Share KV caches across multiple serving engines
User and developer documentation

📦 Open Source Trace

{
    "timestamp": 27482,
    "input_length": 6955,
    "output_length": 52,
    "hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 2353, 2354]
}
{
    "timestamp": 30535,
    "input_length": 6472,
    "output_length": 26,
    "hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 2366]
}

The above presents two samples from our trace dataset. The trace includes the timing of request arrivals, the number of input tokens, the number of output tokens, and the remapped block hash. To protect our customers' privacy, we applied several mechanisms to remove user-related information while preserving the dataset's utility for simulated evaluation. More descriptions of the trace (e.g., up to 50% cache hit ratio) can be found in Section 4 of the paper's Version 3.

📑 Citation

Please kindly cite our paper if you find the paper or the trace is useful:

@article{qin2024mooncake,
  title        = {Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving},
  author       = {Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu},
  year         = {2024},
  url          = {https://arxiv.org/abs/2407.00079}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.ci		.ci
doc		doc
image		image
mooncake-integration		mooncake-integration
mooncake-p2p-store		mooncake-p2p-store
mooncake-transfer-engine		mooncake-transfer-engine
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MULAN		LICENSE-MULAN
Mooncake-v3.pdf		Mooncake-v3.pdf
README.md		README.md
dependencies.sh		dependencies.sh
mooncake_trace.jsonl		mooncake_trace.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mooncake: A KVCache-centric Disaggregated
Architecture for LLM Serving

🔄 Updates

🎉 Overview

🧩 Components

🔥 Show Cases

Use Transfer Engine Standalone (Guide)

Highlights

Performance

P2P Store (Guide)

Highlights

Performance

vLLM Integration (Guide)

Performance

🚀 Quick Start

Preparation

Installation

🛣️ Incoming Milestones

📦 Open Source Trace

📑 Citation

About

Releases

Packages

Languages

License

Jiang-Jia-Jun/Mooncake

Folders and files

Latest commit

History

Repository files navigation

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

🔄 Updates

🎉 Overview

🧩 Components

🔥 Show Cases

Use Transfer Engine Standalone (Guide)

Highlights

Performance

P2P Store (Guide)

Highlights

Performance

vLLM Integration (Guide)

Performance

🚀 Quick Start

Preparation

Installation

🛣️ Incoming Milestones

📦 Open Source Trace

📑 Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Mooncake: A KVCache-centric Disaggregated
Architecture for LLM Serving

Packages