I. Continuous Batching (连续批处理)#

1. Background#

Traditional LLM serving uses static batching (静态批处理): a fixed group of requests is loaded together, the GPU runs until every request in the batch finishes, then the next batch starts. Two problems arise:

Padding waste (填充浪费): short requests must be padded to match the longest request in the batch, wasting compute.
Head-of-line blocking (队头阻塞): new requests wait for the entire current batch to complete, even if most slots are already idle.

Continuous Batching (also called iteration-level scheduling, 迭代级调度) solves both by scheduling at the granularity of a single iteration rather than a whole batch.

2. Core Idea#

As soon as one request finishes, its GPU slot is freed and a new request is admitted — within the same next iteration.

The scheduler runs once per forward pass (前向传播). It looks at:

Which running requests still need decode steps.
Whether any free capacity exists to admit a new prefill request.

This means the batch composition changes every iteration, hence “continuous.”

3. Algorithm#

1) Iteration-Level Scheduler (迭代级调度器)#

1
# Standalone simulation of continuous batching.
2
# No external dependencies required.
3

4
from collections import deque
5

6
class Request:
7
    """One inference request (推理请求)."""
8
    def __init__(self, req_id: str, prompt_len: int, max_new_tokens: int):
9
        self.req_id = req_id
10
        self.prompt_len = prompt_len
11
        self.max_new_tokens = max_new_tokens
12
        self.generated = 0          # tokens generated so far
13
        self.prefill_done = False
14

15
    def is_done(self) -> bool:
16
        return self.prefill_done and self.generated >= self.max_new_tokens
17

18
    def step(self):
19
        """Simulate one decode step (解码步骤)."""
20
        if not self.prefill_done:
21
            self.prefill_done = True    # single-step prefill (simplified)
22
        else:
23
            self.generated += 1
24

25
def run_continuous_batching(
26
    all_requests: list[Request],
27
    max_batch_size: int = 3,
28
    max_iterations: int = 20,
29
):
30
    waiting_queue = deque(all_requests)   # requests not yet admitted
31
    running: list[Request] = []           # currently active requests
32
    finished: list[Request] = []
33

34
    for iteration in range(1, max_iterations + 1):
35
        # ── 1. Evict finished requests (移除完成的请求) ──────────────────
36
        done = [r for r in running if r.is_done()]
37
        for r in done:
38
            running.remove(r)
39
            finished.append(r)
40
            print(f"  [Done]    {r.req_id} finished at iteration {iteration}")
41

42
        # ── 2. Admit new requests to fill freed slots (补充新请求) ────────
43
        while waiting_queue and len(running) < max_batch_size:
44
            new_req = waiting_queue.popleft()
45
            running.append(new_req)
46
            print(f"  [Admit]   {new_req.req_id} admitted at iteration {iteration}")
47

48
        if not running:
49
            print(f"Iteration {iteration}: all done.")
50
            break
51

52
        # ── 3. Step every running request (执行一步) ──────────────────────
53
        print(f"\n--- Iteration {iteration} | batch={[r.req_id for r in running]} ---")
54
        for r in running:
55
            r.step()
56

57
    print("\n=== Summary ===")
58
    for r in finished:
59
        print(f"  {r.req_id}: prompt={r.prompt_len}, generated={r.generated}")
60

61
if __name__ == "__main__":
62
    requests = [
63
        Request("R1", prompt_len=10, max_new_tokens=2),
64
        Request("R2", prompt_len=8,  max_new_tokens=5),
65
        Request("R3", prompt_len=12, max_new_tokens=3),
66
        Request("R4", prompt_len=6,  max_new_tokens=2),
67
        Request("R5", prompt_len=9,  max_new_tokens=4),
68
    ]
69
    run_continuous_batching(requests, max_batch_size=3)

Expected output (abridged):

1
--- Iteration 1 | batch=['R1', 'R2', 'R3'] ---
2
  [Admit]   R1 admitted at iteration 1
3
  ...
4
--- Iteration 3 | batch=['R1', 'R2', 'R3'] ---
5
  [Done]    R1 finished at iteration 4
6
  [Admit]   R4 admitted at iteration 4    ← slot freed, new request in immediately
7
--- Iteration 4 | batch=['R2', 'R3', 'R4'] ---
8
...

2) Key Invariant (关键不变式)#

At every iteration $t$ :

$|\text{running}*t| \leq B*{\max}$

where $B_{\max}$ is the maximum batch size (最大批大小), constrained by GPU memory (显存) available for KV Cache (键值缓存).

4. No Padding Needed#

In static batching, all sequences in a batch must share the same length tensor, requiring padding tokens (填充令牌):

1
Static:  [A A A A _ _]   ← _ = wasted padding
2
         [B B _ _ _ _]
3
         [C C C C C C]

In continuous batching with PagedAttention (分页注意力), each request owns its own KV cache pages. The forward pass uses variable-length attention — no padding:

1
Continuous:  [A A A A]   ← exact length, no padding
2
             [B B]
3
             [C C C C C C]

5. Comparison Table#

Property	Static Batching (静态批处理)	Continuous Batching (连续批处理)
Scheduling granularity (调度粒度)	Per-batch	Per-iteration
Padding (填充)	Required	Not needed
GPU idle time (GPU空闲时间)	High (slot waits for stragglers)	Near zero
Latency for new requests (新请求延迟)	Full batch wait	At most one iteration
Implementation complexity (实现复杂度)	Simple	Moderate
Typical throughput gain (吞吐量提升)	Baseline	2–5× higher

6. Key Metrics (关键指标)#

$\text{Throughput (吞吐量)} = \frac{\text{Total output tokens}}{\text{Wall-clock time}}$

$\text{TTFT (首个令牌时间)} = t_{\text{first token}} - t_{\text{request arrival}}$

$\text{ITL (令牌间延迟)} = \frac{t_{\text{last token}} - t_{\text{first token}}}{\text{output tokens} - 1}$

Continuous batching primarily improves throughput and reduces queuing latency (排队延迟) for newly arriving requests.

Chunked Prefill (分块预填充) — splits long prefills into chunks; a scheduling strategy that sits on top of continuous batching.
PagedAttention (分页注意力) — enables variable-length KV cache per request; prerequisite for efficient continuous batching.
Preemption (抢占) — when KV cache is full, the scheduler may evict a low-priority request and recompute later.
vLLM — the open-source serving framework that popularized continuous batching + PagedAttention.