I. Prefill-Decode Disaggregation (PD 分离)#

1. Motivation (动机)#

In standard LLM serving, prefill and decode run on the same GPU and interfere with each other:

Prefill is compute-bound (计算密集型): processes hundreds of tokens in parallel, saturates CUDA cores, one long iteration.
Decode is memory-bandwidth-bound (内存带宽密集型): reads the full KV cache per step for just 1 new token, starved of compute.

Running them together causes prefill-decode interference (干扰):

A large prefill blocks decode iterations → spikes in Inter-Token Latency (ITL, 令牌间延迟).
Decode’s need for low batch size conflicts with prefill’s need for large batches.

PD disaggregation puts them on separate GPU pools so each can be tuned independently.

2. Architecture (架构)#

1) Two Pools (两个资源池)#

Pool	Role	Bottleneck	Optimal hardware
Prefill pool (预填充池)	Process prompt tokens, build KV cache	Compute (FLOPs)	High-FLOP GPUs (e.g. H100 SXM)
Decode pool (解码池)	Autoregressive token generation	Memory bandwidth	High-bandwidth GPUs or more GPUs

2) KV Cache Transfer (KV缓存传输)#

After prefill completes, the KV cache must be migrated from the prefill GPU to the decode GPU. This is the central engineering challenge of PD disaggregation.

$\text{Transfer cost} = \frac{2 \times n_{\text{layers}} \times d_{\text{model}} \times L_{\text{prompt}}}{\text{NVLink / RDMA bandwidth}}$

Where $L_{\text{prompt}}$ is the prompt length (提示长度). A 4096-token prompt on a 70B model generates ~8 GB of KV cache — transfer latency directly adds to TTFT (首个令牌时间).

Transfer methods (传输方式):

NVLink — within a node, ~600 GB/s, negligible latency.
RDMA over InfiniBand — across nodes, ~200–400 GB/s.
TCP/IP — fallback, much slower, not recommended.

3. Runnable Example (可运行示例)#

1
# Simulates PD-disaggregated scheduling with KV transfer cost.
2
# No external dependencies required.
3

4
import time
5
import threading
6
from queue import Queue
7

8
PREFILL_TIME_PER_TOKEN = 0.0005   # seconds per token (compute-bound)
9
DECODE_TIME_PER_TOKEN  = 0.020    # seconds per token (memory-bound)
10
KV_TRANSFER_GBPS       = 200      # simulated NVLink bandwidth (GB/s)
11
BYTES_PER_KV_TOKEN     = 2 * 80 * 8192  # 2 (K+V) × 80 layers × 8192 bytes
12

13
class Request:
14
    def __init__(self, req_id: str, prompt_len: int, max_new_tokens: int):
15
        self.req_id = req_id
16
        self.prompt_len = prompt_len
17
        self.max_new_tokens = max_new_tokens
18

19
def prefill_worker(req: Request, kv_queue: Queue):
20
    """Prefill pool: process prompt, produce KV cache."""
21
    t0 = time.time()
22
    time.sleep(req.prompt_len * PREFILL_TIME_PER_TOKEN)   # simulate compute
23
    prefill_ms = (time.time() - t0) * 1000
24

25
    # Simulate KV cache transfer
26
    kv_bytes = BYTES_PER_KV_TOKEN * req.prompt_len
27
    transfer_s = kv_bytes / (KV_TRANSFER_GBPS * 1e9)
28
    time.sleep(transfer_s)
29
    transfer_ms = transfer_s * 1000
30

31
    print(f"[Prefill→Transfer] {req.req_id}: "
32
          f"prefill={prefill_ms:.1f}ms  transfer={transfer_ms:.1f}ms  "
33
          f"KV={kv_bytes/1e6:.1f}MB")
34
    kv_queue.put(req)    # hand off to decode pool
35

36
def decode_worker(kv_queue: Queue):
37
    """Decode pool: consume KV cache, generate tokens."""
38
    while True:
39
        req = kv_queue.get()
40
        if req is None:
41
            break
42
        t0 = time.time()
43
        time.sleep(req.max_new_tokens * DECODE_TIME_PER_TOKEN)
44
        decode_ms = (time.time() - t0) * 1000
45
        ttft = (req.prompt_len * PREFILL_TIME_PER_TOKEN
46
                + BYTES_PER_KV_TOKEN * req.prompt_len / (KV_TRANSFER_GBPS * 1e9)
47
                + DECODE_TIME_PER_TOKEN) * 1000
48
        print(f"[Decode Done]     {req.req_id}: "
49
              f"decode={decode_ms:.1f}ms  est.TTFT={ttft:.1f}ms")
50
        kv_queue.task_done()
51

52
if __name__ == "__main__":
53
    requests = [
54
        Request("R1", prompt_len=512,  max_new_tokens=50),
55
        Request("R2", prompt_len=2048, max_new_tokens=20),
56
        Request("R3", prompt_len=256,  max_new_tokens=100),
57
    ]
58

59
    kv_queue: Queue = Queue()
60

61
    # Start decode pool (always listening)
62
    decoder = threading.Thread(target=decode_worker, args=(kv_queue,), daemon=True)
63
    decoder.start()
64

65
    # Prefill pool: process all requests (could be parallel in real systems)
66
    threads = [threading.Thread(target=prefill_worker, args=(r, kv_queue))
67
               for r in requests]
68
    for t in threads:
69
        t.start()
70
    for t in threads:
71
        t.join()
72

73
    kv_queue.join()
74
    kv_queue.put(None)   # signal decoder to exit

4. Benefits and Trade-offs (优缺点)#

Aspect	Coupled (耦合)	Disaggregated (分离)
TTFT	Higher (prefill blocks decode)	Lower (dedicated prefill GPUs)
ITL	Spikey under long prompts	Stable (no prefill interference)
Independent scaling (独立扩缩容)	No	Yes — scale each pool by workload
KV transfer overhead	None	Adds latency on long prompts
Hardware cost	Lower	Higher (more GPUs)

5. Key Formula — Transfer Latency (传输延迟)#

For a LLaMA-3 70B with a 4096-token prompt over 200 GB/s RDMA:

$T_{\text{transfer}} \approx \frac{8,\text{GB}}{200,\text{GB/s}} = 40,\text{ms}$

This 40 ms is added directly to TTFT — the central cost of PD disaggregation.

Chunked Prefill (分块预填充) — an alternative to PD disaggregation that interleaves prefill and decode on the same GPU; lower cost but less isolation.
Continuous Batching (连续批处理) — iteration-level scheduling; used within each pool in PD disaggregation.
KV Cache Migration (KV缓存迁移) — the core engineering problem: moving large tensors across GPUs with minimal TTFT penalty.
Mooncake / Splitwise / DistServe — research systems that implement PD disaggregation at production scale.