I. Chunked Prefill (分块预填充)#

1. Background#

In Large Language Model (大语言模型) inference, processing a request has two phases:

Prefill (预填充): The model processes all input prompt tokens in parallel, building the KV Cache (键值缓存). This is compute-bound (计算密集型).
Decode (解码): The model generates one output token per iteration, autoregressively. This is memory-bandwidth-bound (内存带宽密集型).

Without chunked prefill, a single long-prompt prefill request can monopolize the GPU for many milliseconds, blocking decode iterations for other requests and inflating the Time to First Token (TTFT, 首个令牌时间) and Inter-Token Latency (ITL, 令牌间延迟) for concurrent users.

It improves GPU utilization, reduces latency, increases throughput, and ensures fair scheduling across requests.

2. What Is Chunked Prefill?#

Chunked Prefill splits a long prefill sequence into fixed-size pieces called chunks, In the same GPU iteration, it processes: - decode tokens from some requests - prefill chunks from other requests.

Key Insight: Instead of “finish all prefill, then decode,” we interleave them so the GPU is always doing useful, mixed work.

1) Core Idea#

\text{Iteration}_t = \underbrace{\text{Prefill Chunk}_k}_{\text{partial prompt}} + \underbrace{\text{Decode Tokens}_{r_1, r_2, \ldots}}_{\text{running requests}}

Each iteration processes:

A chunk of C tokens from one (or more) prefilling requests.
All current decode tokens from already-started requests.

2) Chunk Size (块大小)#

chunk_size (e.g., 512 or 1024 tokens) is a tunable hyperparameter (超参数):

Chunk size	Effect
Smaller	Lower TTFT jitter, better fairness, more scheduling overhead
Larger	Higher GPU utilization, less overhead, longer decode stalls

3. Algorithm#

1) Scheduler Logic (调度器逻辑)#

1
# Pseudocode — runs once per GPU iteration
2
def schedule_iteration(prefill_queue, decode_queue, chunk_size):
3
    batch = []
4

5
    # Step 1: Add decode tokens for all running requests
6
    for req in decode_queue:
7
        batch.append(DecodeToken(req))          # 1 token per running request
8

9
    # Step 2: Fill remaining compute budget with one prefill chunk
10
    if prefill_queue:
11
        req = prefill_queue[0]
12
        start = req.processed_tokens            # where we left off
13
        end   = min(start + chunk_size, len(req.prompt))
14
        batch.append(PrefillChunk(req, start, end))
15
        req.processed_tokens = end
16
        if req.processed_tokens == len(req.prompt):
17
            prefill_queue.pop(0)                # prefill done → move to decode_queue
18

19
    return batch

2) KV Cache Construction (键值缓存构建)#

Because prefill is chunked, the KV cache is filled incrementally:

\text{KV}[0:n] = \text{KV}[0:C] \;\|\; \text{KV}[C:2C] \;\|\; \cdots \;\|\; \text{KV}[\lfloor n/C \rfloor \cdot C : n]

Each chunk appends its computed key-value pairs to the existing cache. This is correct because attention (注意力机制) only requires the cache to be causally complete — past tokens never need updating.

4. Runnable Example#

The following standalone example simulates chunked prefill scheduling:

1
# Simulates chunked prefill scheduling on a toy model.
2
# Requires: pip install numpy
3

4
import numpy as np
5

6
CHUNK_SIZE = 4          # tokens processed per prefill chunk per iteration
7
VOCAB = 32              # toy vocabulary size
8
D_MODEL = 8             # tiny embedding dimension
9

10
class Request:
11
    """A single inference request (推理请求)."""
12
    def __init__(self, req_id: str, prompt_tokens: list[int]):
13
        self.req_id = req_id
14
        self.prompt = prompt_tokens
15
        self.processed = 0          # how many prompt tokens have been prefilled
16
        self.kv_cache: list = []    # accumulated KV pairs (simulated as ints)
17
        self.output_tokens: list[int] = []
18
        self.is_decoding = False
19

20
    def is_prefill_done(self) -> bool:
21
        return self.processed >= len(self.prompt)
22

23
def fake_attention(tokens: list[int], kv_cache: list) -> list:
24
    """Simulates attention output — returns dummy KV entries."""
25
    return [t * 2 for t in tokens]   # fake KV = token_id * 2
26

27
def fake_decode(kv_cache: list, rng: np.random.Generator) -> int:
28
    """Samples a token given the completed KV cache."""
29
    return int(rng.integers(0, VOCAB))
30

31
def run_chunked_prefill(requests: list[Request], max_new_tokens: int = 5):
32
    rng = np.random.default_rng(42)
33
    prefill_queue = list(requests)
34
    decode_queue: list[Request] = []
35
    iteration = 0
36

37
    while prefill_queue or decode_queue:
38
        iteration += 1
39
        print(f"\n--- Iteration {iteration} ---")
40

41
        # ── Decode step (解码步骤): one token per running request ──────────
42
        for req in list(decode_queue):
43
            tok = fake_decode(req.kv_cache, rng)
44
            req.output_tokens.append(tok)
45
            print(f"  [Decode] {req.req_id}: generated token {tok}")
46
            if len(req.output_tokens) >= max_new_tokens:
47
                decode_queue.remove(req)
48
                print(f"  [Done]   {req.req_id} finished.")
49

50
        # ── Prefill chunk (预填充分块): one chunk from the head of the queue ─
51
        if prefill_queue:
52
            req = prefill_queue[0]
53
            start = req.processed
54
            end   = min(start + CHUNK_SIZE, len(req.prompt))
55
            chunk = req.prompt[start:end]
56
            kv    = fake_attention(chunk, req.kv_cache)
57
            req.kv_cache.extend(kv)
58
            req.processed = end
59
            print(f"  [Prefill] {req.req_id}: tokens [{start}:{end}] → KV len={len(req.kv_cache)}")
60

61
            if req.is_prefill_done():
62
                prefill_queue.pop(0)
63
                decode_queue.append(req)
64
                print(f"  [Ready]  {req.req_id}: prefill complete → decode queue")
65

66
if __name__ == "__main__":
67
    reqs = [
68
        Request("R1", list(range(10))),   # 10-token prompt → needs 3 chunks
69
        Request("R2", list(range(6))),    # 6-token prompt  → needs 2 chunks
70
    ]
71
    run_chunked_prefill(reqs, max_new_tokens=3)

Expected output (abridged):

1
--- Iteration 1 ---
2
  [Prefill] R1: tokens [0:4] → KV len=4
3

4
--- Iteration 2 ---
5
  [Prefill] R1: tokens [4:8] → KV len=8
6

7
--- Iteration 3 ---
8
  [Prefill] R1: tokens [8:10] → KV len=12
9
  [Ready]  R1: prefill complete → decode queue
10

11
--- Iteration 4 ---
12
  [Decode] R1: generated token 17
13
  [Prefill] R2: tokens [0:4] → KV len=4
14
...

5. Benefits and Trade-offs (优缺点)#

Aspect	Without Chunked Prefill	With Chunked Prefill
TTFT (首个令牌时间)	Unpredictable, spikes on long prompts	More predictable, bounded by chunk size
Decode stalls (解码停顿)	Long stalls during large prefills	Eliminated — decode runs every iteration
GPU utilization (GPU利用率)	Suboptimal during pure decode	Higher — compute + memory-BW co-utilized
Implementation complexity	Simple	Moderate (state tracking per request)

6. Key Formula — Prefill Iterations per Request#

N_{\text{iters}} = \left\lceil \frac{L}{C} \right\rceil

Where $L$ is the prompt length (提示长度) and $C$ is the chunk size (块大小). The request enters the decode queue after $N_{\text{iters}}$ prefill iterations.

PagedAttention (分页注意力) — manages KV cache memory in pages; works naturally with chunked prefill.
Continuous Batching (连续批处理) — iterates dynamically; chunked prefill is one scheduling strategy on top.
Speculative Decoding (推测解码) — orthogonal technique to speed up the decode phase.
TTFT / ITL / TBT — latency metrics affected by chunked prefill scheduling.