I. Transformer — Complete Learning Handbook#

Overview: The Transformer (变换器) is the foundational architecture behind virtually all modern large language models — GPT, BERT, T5, LLaMA, and beyond. Introduced in "Attention Is All You Need" (Vaswani et al., 2017), it replaces recurrence with Self-Attention (自注意力机制), enabling fully parallel training and capturing long-range dependencies without vanishing gradients. This handbook covers every component from first principles, and ends with complete, runnable training and inference code.

1. Architecture Overview (架构总览)#

A standard Encoder-Decoder Transformer (编码器-解码器变换器) consists of:

1
Input Tokens
2
     ↓
3
[Token Embedding + Positional Encoding]
4
     ↓
5
┌─────────────────────────────────┐
6
│  Encoder (编码器)  × N layers    │
7
│  ┌──────────────────────────┐   │
8
│  │ Multi-Head Self-Attention│   │
9
│  │ Add & Norm               │   │
10
│  │ Feed-Forward Network     │   │
11
│  │ Add & Norm               │   │
12
│  └──────────────────────────┘   │
13
└─────────────────────────────────┘
14
     ↓  (encoder output = memory)
15
┌─────────────────────────────────┐
16
│  Decoder (解码器)  × N layers    │
17
│  ┌──────────────────────────┐   │
18
│  │ Masked Self-Attention    │   │
19
│  │ Add & Norm               │   │
20
│  │ Cross-Attention          │   │
21
│  │ Add & Norm               │   │
22
│  │ Feed-Forward Network     │   │
23
│  │ Add & Norm               │   │
24
│  └──────────────────────────┘   │
25
└─────────────────────────────────┘
26
     ↓
27
Linear + Softmax → Output Probabilities

2. Scaled Dot-Product Attention (缩放点积注意力)#

1) The Formula#

The core attention operation:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

Q (Query, 查询): What is each token looking for?
K (Key, 键): How can this token be found by others?
V (Value, 值): What does this token actually offer?
$\sqrt{d_k}$ (scaling factor, 缩放因子): Prevents softmax saturation when $d_k$ is large

2) Implementation#

1
import torch
2
import torch.nn as nn
3
import torch.nn.functional as F
4
import math
5

6
def scaled_dot_product_attention(
7
    Q: torch.Tensor,   # (batch, heads, seq_q, d_k)
8
    K: torch.Tensor,   # (batch, heads, seq_k, d_k)
9
    V: torch.Tensor,   # (batch, heads, seq_k, d_v)
10
    mask: torch.Tensor = None,
11
) -> tuple[torch.Tensor, torch.Tensor]:
12
    d_k = Q.size(-1)
13

14
    # Step 1: Compute attention scores
15
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
16
    # scores shape: (batch, heads, seq_q, seq_k)
17

18
    # Step 2: Apply mask (set -inf so softmax → 0)
19
    if mask is not None:
20
        scores = scores.masked_fill(mask == 0, float('-inf'))
21

22
    # Step 3: Softmax over key dimension
23
    attn_weights = F.softmax(scores, dim=-1)   # (batch, heads, seq_q, seq_k)
24

25
    # Step 4: Weighted sum of values
26
    output = torch.matmul(attn_weights, V)     # (batch, heads, seq_q, d_v)
27

28
    return output, attn_weights

Note: The scaling by √d_k is critical. Without it, dot products grow large as d_k increases, pushing softmax into regions with extremely small gradients — causing the vanishing gradient problem (梯度消失) during training.

3. Multi-Head Attention (多头注意力)#

1) Motivation#

A single attention head can only attend to one “subspace” at a time. Multi-Head Attention (多头注意力) runs $h$ attention heads in parallel, each learning to focus on different aspects (syntax, semantics, coreference, etc.), then concatenates and projects the results.

$\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O$

$\text{head}_i = \text{Attention}(QW_i^Q,\ KW_i^K,\ VW_i^V)$

2) Implementation#

1
class MultiHeadAttention(nn.Module):
2
    def __init__(self, d_model: int, num_heads: int, dropout: float = 0.1):
3
        super().__init__()
4
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
5

6
        self.d_model = d_model
7
        self.num_heads = num_heads
8
        self.d_k = d_model // num_heads   # Dimension per head
9

10
        # Linear projections for Q, K, V, and output
11
        self.W_q = nn.Linear(d_model, d_model, bias=False)
12
        self.W_k = nn.Linear(d_model, d_model, bias=False)
13
        self.W_v = nn.Linear(d_model, d_model, bias=False)
14
        self.W_o = nn.Linear(d_model, d_model, bias=False)
15

16
        self.dropout = nn.Dropout(dropout)
17

18
    def split_heads(self, x: torch.Tensor) -> torch.Tensor:
19
        """(batch, seq, d_model) → (batch, heads, seq, d_k)"""
20
        batch, seq, _ = x.size()
21
        x = x.view(batch, seq, self.num_heads, self.d_k)
22
        return x.transpose(1, 2)   # (batch, heads, seq, d_k)
23

24
    def forward(
25
        self,
26
        query: torch.Tensor,    # (batch, seq_q, d_model)
27
        key: torch.Tensor,      # (batch, seq_k, d_model)
28
        value: torch.Tensor,    # (batch, seq_k, d_model)
29
        mask: torch.Tensor = None,
30
    ) -> torch.Tensor:
31
        # Project inputs to Q, K, V
32
        Q = self.split_heads(self.W_q(query))   # (batch, heads, seq_q, d_k)
33
        K = self.split_heads(self.W_k(key))     # (batch, heads, seq_k, d_k)
34
        V = self.split_heads(self.W_v(value))   # (batch, heads, seq_k, d_k)
35

36
        # Scaled dot-product attention
37
        attn_output, _ = scaled_dot_product_attention(Q, K, V, mask)
38
        # attn_output: (batch, heads, seq_q, d_k)
39

40
        # Concatenate heads
41
        batch, _, seq_q, _ = attn_output.size()
42
        attn_output = attn_output.transpose(1, 2).contiguous()
43
        attn_output = attn_output.view(batch, seq_q, self.d_model)
44
        # attn_output: (batch, seq_q, d_model)
45

46
        # Final linear projection
47
        return self.W_o(attn_output)

4. Position-wise Feed-Forward Network (逐位置前馈网络)#

Applied independently to each position — acts as a two-layer MLP (多层感知机) with an inner expansion:

$\text{FFN}(x) = \max(0,\ xW_1 + b_1) W_2 + b_2$

1
class FeedForward(nn.Module):
2
    def __init__(self, d_model: int, d_ff: int, dropout: float = 0.1):
3
        super().__init__()
4
        # Standard expansion: d_ff = 4 * d_model
5
        self.linear1 = nn.Linear(d_model, d_ff)
6
        self.linear2 = nn.Linear(d_ff, d_model)
7
        self.dropout = nn.Dropout(dropout)
8

9
    def forward(self, x: torch.Tensor) -> torch.Tensor:
10
        # x: (batch, seq, d_model)
11
        x = self.linear1(x)       # (batch, seq, d_ff)
12
        x = F.relu(x)             # ReLU activation (or GELU in modern variants)
13
        x = self.dropout(x)
14
        x = self.linear2(x)       # (batch, seq, d_model)
15
        return x

5. Positional Encoding (位置编码)#

Since Transformers have no recurrence, positional information must be injected explicitly. The original paper uses sinusoidal encoding (正弦编码):

$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$ $PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$

1
class PositionalEncoding(nn.Module):
2
    def __init__(self, d_model: int, max_seq_len: int = 5000, dropout: float = 0.1):
3
        super().__init__()
4
        self.dropout = nn.Dropout(dropout)
5

6
        # Build the positional encoding table once
7
        pe = torch.zeros(max_seq_len, d_model)                    # (max_len, d_model)
8
        position = torch.arange(0, max_seq_len).unsqueeze(1)      # (max_len, 1)
9
        div_term = torch.exp(
10
            torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model)
11
        )
12

13
        pe[:, 0::2] = torch.sin(position * div_term)   # Even indices
14
        pe[:, 1::2] = torch.cos(position * div_term)   # Odd indices
15
        pe = pe.unsqueeze(0)                            # (1, max_len, d_model)
16

17
        # Register as buffer (not a parameter — not updated during training)
18
        self.register_buffer('pe', pe)
19

20
    def forward(self, x: torch.Tensor) -> torch.Tensor:
21
        # x: (batch, seq, d_model)
22
        x = x + self.pe[:, :x.size(1)]   # Add positional encoding
23
        return self.dropout(x)

Note: Modern models (BERT, RoBERTa, GPT) use learned positional embeddings (可学习位置嵌入) instead of fixed sinusoids. Even more recent models (LLaMA, Mistral) use RoPE (Rotary Position Embedding, 旋转位置编码) which encodes relative positions directly into the attention computation.

6. Add & Norm — Residual Connection + Layer Normalization#

Each sub-layer is wrapped with a residual connection (残差连接) and Layer Normalization (层归一化):

$\text{LayerNorm}(x + \text{Sublayer}(x))$

1
class AddNorm(nn.Module):
2
    def __init__(self, d_model: int, dropout: float = 0.1):
3
        super().__init__()
4
        self.norm = nn.LayerNorm(d_model)
5
        self.dropout = nn.Dropout(dropout)
6

7
    def forward(self, x: torch.Tensor, sublayer_output: torch.Tensor) -> torch.Tensor:
8
        # Pre-norm variant: norm(x) → sublayer → + x  (used in modern GPT-style)
9
        # Post-norm variant (original paper): x + sublayer(x) → norm
10
        return self.norm(x + self.dropout(sublayer_output))

Note: The original paper uses Post-LN (后归一化) — normalize after adding the residual. Modern models (GPT-2, LLaMA) use Pre-LN (前归一化) — normalize before the sublayer. Pre-LN is more training-stable and is now the dominant choice.

7. Encoder Layer (编码器层)#

1
class EncoderLayer(nn.Module):
2
    def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1):
3
        super().__init__()
4
        self.self_attn = MultiHeadAttention(d_model, num_heads, dropout)
5
        self.ff        = FeedForward(d_model, d_ff, dropout)
6
        self.norm1     = nn.LayerNorm(d_model)
7
        self.norm2     = nn.LayerNorm(d_model)
8
        self.dropout   = nn.Dropout(dropout)
9

10
    def forward(self, x: torch.Tensor, src_mask: torch.Tensor = None) -> torch.Tensor:
11
        # Self-attention + residual + norm
12
        attn_out = self.self_attn(x, x, x, src_mask)
13
        x = self.norm1(x + self.dropout(attn_out))
14

15
        # Feed-forward + residual + norm
16
        ff_out = self.ff(x)
17
        x = self.norm2(x + self.dropout(ff_out))
18

19
        return x
20

21

22
class Encoder(nn.Module):
23
    def __init__(self, num_layers: int, d_model: int, num_heads: int, d_ff: int, dropout: float):
24
        super().__init__()
25
        self.layers = nn.ModuleList([
26
            EncoderLayer(d_model, num_heads, d_ff, dropout)
27
            for _ in range(num_layers)
28
        ])
29
        self.norm = nn.LayerNorm(d_model)
30

31
    def forward(self, x: torch.Tensor, src_mask: torch.Tensor = None) -> torch.Tensor:
32
        for layer in self.layers:
33
            x = layer(x, src_mask)
34
        return self.norm(x)

8. Decoder Layer (解码器层)#

The decoder has three sub-layers: masked self-attention, cross-attention over encoder output, and feed-forward.

1
class DecoderLayer(nn.Module):
2
    def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1):
3
        super().__init__()
4
        self.self_attn  = MultiHeadAttention(d_model, num_heads, dropout)  # Masked
5
        self.cross_attn = MultiHeadAttention(d_model, num_heads, dropout)  # Cross
6
        self.ff         = FeedForward(d_model, d_ff, dropout)
7
        self.norm1      = nn.LayerNorm(d_model)
8
        self.norm2      = nn.LayerNorm(d_model)
9
        self.norm3      = nn.LayerNorm(d_model)
10
        self.dropout    = nn.Dropout(dropout)
11

12
    def forward(
13
        self,
14
        x: torch.Tensor,           # Decoder input  (batch, tgt_seq, d_model)
15
        memory: torch.Tensor,      # Encoder output (batch, src_seq, d_model)
16
        tgt_mask: torch.Tensor = None,   # Causal mask for decoder self-attention
17
        src_mask: torch.Tensor = None,   # Padding mask for cross-attention
18
    ) -> torch.Tensor:
19
        # 1. Masked self-attention (prevents attending to future tokens)
20
        attn1 = self.self_attn(x, x, x, tgt_mask)
21
        x = self.norm1(x + self.dropout(attn1))
22

23
        # 2. Cross-attention over encoder memory
24
        attn2 = self.cross_attn(x, memory, memory, src_mask)
25
        x = self.norm2(x + self.dropout(attn2))
26

27
        # 3. Feed-forward
28
        ff_out = self.ff(x)
29
        x = self.norm3(x + self.dropout(ff_out))
30

31
        return x
32

33

34
class Decoder(nn.Module):
35
    def __init__(self, num_layers: int, d_model: int, num_heads: int, d_ff: int, dropout: float):
36
        super().__init__()
37
        self.layers = nn.ModuleList([
38
            DecoderLayer(d_model, num_heads, d_ff, dropout)
39
            for _ in range(num_layers)
40
        ])
41
        self.norm = nn.LayerNorm(d_model)
42

43
    def forward(self, x, memory, tgt_mask=None, src_mask=None):
44
        for layer in self.layers:
45
            x = layer(x, memory, tgt_mask, src_mask)
46
        return self.norm(x)

9. Masks (掩码)#

1) Padding Mask (填充掩码)#

Prevents attention over <PAD> tokens:

1
def make_pad_mask(seq: torch.Tensor, pad_idx: int = 0) -> torch.Tensor:
2
    """
3
    seq: (batch, seq_len) — integer token IDs
4
    Returns: (batch, 1, 1, seq_len) — True where NOT padding
5
    """
6
    return (seq != pad_idx).unsqueeze(1).unsqueeze(2)

2) Causal Mask / Look-ahead Mask (因果掩码)#

Prevents decoder positions from attending to future positions:

1
def make_causal_mask(seq_len: int, device: torch.device) -> torch.Tensor:
2
    """
3
    Returns lower-triangular mask of shape (1, 1, seq_len, seq_len)
4
    Position i can attend to positions 0..i only.
5
    """
6
    mask = torch.tril(torch.ones(seq_len, seq_len, device=device))
7
    return mask.unsqueeze(0).unsqueeze(0)   # (1, 1, seq_len, seq_len)

10. Complete Transformer Model (完整模型)#

1
class Transformer(nn.Module):
2
    def __init__(
3
        self,
4
        src_vocab_size: int,
5
        tgt_vocab_size: int,
6
        d_model: int      = 512,
7
        num_heads: int    = 8,
8
        num_layers: int   = 6,
9
        d_ff: int         = 2048,
10
        max_seq_len: int  = 512,
11
        dropout: float    = 0.1,
12
        pad_idx: int      = 0,
13
    ):
14
        super().__init__()
15
        self.pad_idx = pad_idx
16
        self.d_model = d_model
17

18
        # Embeddings
19
        self.src_embedding = nn.Embedding(src_vocab_size, d_model, padding_idx=pad_idx)
20
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model, padding_idx=pad_idx)
21
        self.pos_encoding  = PositionalEncoding(d_model, max_seq_len, dropout)
22

23
        # Encoder & Decoder
24
        self.encoder = Encoder(num_layers, d_model, num_heads, d_ff, dropout)
25
        self.decoder = Decoder(num_layers, d_model, num_heads, d_ff, dropout)
26

27
        # Output projection
28
        self.fc_out = nn.Linear(d_model, tgt_vocab_size)
29

30
        # Weight initialization
31
        self._init_weights()
32

33
    def _init_weights(self):
34
        for p in self.parameters():
35
            if p.dim() > 1:
36
                nn.init.xavier_uniform_(p)
37

38
    def encode(self, src: torch.Tensor, src_mask: torch.Tensor = None) -> torch.Tensor:
39
        x = self.pos_encoding(self.src_embedding(src) * math.sqrt(self.d_model))
40
        return self.encoder(x, src_mask)
41

42
    def decode(
43
        self,
44
        tgt: torch.Tensor,
45
        memory: torch.Tensor,
46
        tgt_mask: torch.Tensor = None,
47
        src_mask: torch.Tensor = None,
48
    ) -> torch.Tensor:
49
        x = self.pos_encoding(self.tgt_embedding(tgt) * math.sqrt(self.d_model))
50
        return self.decoder(x, memory, tgt_mask, src_mask)
51

52
    def forward(
53
        self,
54
        src: torch.Tensor,   # (batch, src_len)
55
        tgt: torch.Tensor,   # (batch, tgt_len)
56
    ) -> torch.Tensor:
57
        # Build masks
58
        src_mask = make_pad_mask(src, self.pad_idx)
59
        tgt_pad_mask = make_pad_mask(tgt, self.pad_idx)
60
        tgt_causal   = make_causal_mask(tgt.size(1), tgt.device)
61
        tgt_mask     = tgt_pad_mask & tgt_causal   # Combine both
62

63
        # Forward pass
64
        memory = self.encode(src, src_mask)
65
        output = self.decode(tgt, memory, tgt_mask, src_mask)
66

67
        # Project to vocabulary
68
        return self.fc_out(output)   # (batch, tgt_len, tgt_vocab_size)

11. Training (训练)#

1) Hyperparameters & Setup#

1
import torch
2
import torch.optim as optim
3
from torch.utils.data import DataLoader, Dataset
4

5
# ---- Hyperparameters ----
6
SRC_VOCAB  = 8000
7
TGT_VOCAB  = 8000
8
D_MODEL    = 256
9
NUM_HEADS  = 8
10
NUM_LAYERS = 4
11
D_FF       = 1024
12
MAX_LEN    = 128
13
DROPOUT    = 0.1
14
PAD_IDX    = 0
15
BOS_IDX    = 1
16
EOS_IDX    = 2
17
BATCH_SIZE = 32
18
EPOCHS     = 20
19
LR         = 1e-4
20
DEVICE     = torch.device("cuda" if torch.cuda.is_available() else "cpu")
21

22
# ---- Model ----
23
model = Transformer(
24
    src_vocab_size=SRC_VOCAB,
25
    tgt_vocab_size=TGT_VOCAB,
26
    d_model=D_MODEL,
27
    num_heads=NUM_HEADS,
28
    num_layers=NUM_LAYERS,
29
    d_ff=D_FF,
30
    max_seq_len=MAX_LEN,
31
    dropout=DROPOUT,
32
    pad_idx=PAD_IDX,
33
).to(DEVICE)
34

35
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

2) Learning Rate Scheduler — Warmup (学习率预热)#

The original paper uses a custom schedule: $lr = d_{model}^{-0.5} \cdot \min(\text{step}^{-0.5},\ \text{step} \cdot \text{warmup}^{-1.5})$

1
class WarmupScheduler:
2
    def __init__(self, optimizer, d_model: int, warmup_steps: int = 4000):
3
        self.optimizer = optimizer
4
        self.d_model = d_model
5
        self.warmup_steps = warmup_steps
6
        self.step_num = 0
7

8
    def step(self):
9
        self.step_num += 1
10
        lr = self.d_model ** (-0.5) * min(
11
            self.step_num ** (-0.5),
12
            self.step_num * self.warmup_steps ** (-1.5)
13
        )
14
        for param_group in self.optimizer.param_groups:
15
            param_group['lr'] = lr
16

17
optimizer  = optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9)
18
scheduler  = WarmupScheduler(optimizer, d_model=D_MODEL, warmup_steps=4000)
19
criterion  = nn.CrossEntropyLoss(ignore_index=PAD_IDX, label_smoothing=0.1)

Note: Label Smoothing (标签平滑) with label_smoothing=0.1 distributes 10% of the probability mass uniformly across all tokens instead of concentrating it on the correct token. This regularizes the model and prevents overconfidence.

3) Dummy Dataset for Demonstration#

1
class Seq2SeqDataset(Dataset):
2
    """
3
    Minimal demo dataset — replace with real tokenized data.
4
    Each sample is (src_ids, tgt_ids).
5
    """
6
    def __init__(self, size=1000, src_vocab=8000, tgt_vocab=8000,
7
                 src_len=20, tgt_len=22):
8
        self.data = [
9
            (
10
                torch.randint(3, src_vocab, (src_len,)),
11
                torch.randint(3, tgt_vocab, (tgt_len,)),
12
            )
13
            for _ in range(size)
14
        ]
15

16
    def __len__(self):
17
        return len(self.data)
18

19
    def __getitem__(self, idx):
20
        return self.data[idx]
21

22

23
def collate_fn(batch):
24
    """Pad sequences in a batch to the same length."""
25
    src_batch, tgt_batch = zip(*batch)
26
    src_padded = torch.nn.utils.rnn.pad_sequence(src_batch, batch_first=True, padding_value=PAD_IDX)
27
    tgt_padded = torch.nn.utils.rnn.pad_sequence(tgt_batch, batch_first=True, padding_value=PAD_IDX)
28
    return src_padded, tgt_padded
29

30

31
train_dataset = Seq2SeqDataset(size=2000)
32
train_loader  = DataLoader(train_dataset, batch_size=BATCH_SIZE,
33
                           shuffle=True, collate_fn=collate_fn)

4) Training Loop (训练循环)#

1
def train_epoch(model, loader, optimizer, scheduler, criterion, device):
2
    model.train()
3
    total_loss = 0.0
4
    total_tokens = 0
5

6
    for batch_idx, (src, tgt) in enumerate(loader):
7
        src = src.to(device)         # (batch, src_len)
8
        tgt = tgt.to(device)         # (batch, tgt_len)
9

10
        # Teacher forcing (教师强制):
11
        #   Input  to decoder: tgt[:, :-1]  (all but last token)
12
        #   Target from model: tgt[:, 1:]   (all but first token = BOS)
13
        tgt_input  = tgt[:, :-1]
14
        tgt_target = tgt[:, 1:]
15

16
        # Forward pass
17
        logits = model(src, tgt_input)
18
        # logits: (batch, tgt_len-1, tgt_vocab_size)
19

20
        # Reshape for cross-entropy
21
        logits_flat  = logits.reshape(-1, logits.size(-1))  # (batch*(tgt-1), vocab)
22
        targets_flat = tgt_target.reshape(-1)               # (batch*(tgt-1),)
23

24
        loss = criterion(logits_flat, targets_flat)
25

26
        # Backward pass
27
        optimizer.zero_grad()
28
        loss.backward()
29
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # Gradient clipping
30
        optimizer.step()
31
        scheduler.step()
32

33
        # Track metrics
34
        non_pad = (tgt_target != PAD_IDX).sum().item()
35
        total_loss   += loss.item() * non_pad
36
        total_tokens += non_pad
37

38
        if batch_idx % 50 == 0:
39
            print(f"  Batch {batch_idx}/{len(loader)}  "
40
                  f"Loss: {loss.item():.4f}  "
41
                  f"LR: {optimizer.param_groups[0]['lr']:.6f}")
42

43
    return total_loss / total_tokens
44

45

46
def evaluate(model, loader, criterion, device):
47
    model.eval()
48
    total_loss = 0.0
49
    total_tokens = 0
50

51
    with torch.no_grad():
52
        for src, tgt in loader:
53
            src = src.to(device)
54
            tgt = tgt.to(device)
55
            tgt_input  = tgt[:, :-1]
56
            tgt_target = tgt[:, 1:]
57

58
            logits = model(src, tgt_input)
59
            loss   = criterion(logits.reshape(-1, logits.size(-1)), tgt_target.reshape(-1))
60

61
            non_pad = (tgt_target != PAD_IDX).sum().item()
62
            total_loss   += loss.item() * non_pad
63
            total_tokens += non_pad
64

65
    return total_loss / total_tokens
66

67

68
# ---- Main Training Loop ----
69
best_val_loss = float('inf')
70

71
for epoch in range(1, EPOCHS + 1):
72
    train_loss = train_epoch(model, train_loader, optimizer, scheduler, criterion, DEVICE)
73
    # val_loss = evaluate(model, val_loader, criterion, DEVICE)
74

75
    print(f"\nEpoch {epoch}/{EPOCHS}  Train Loss: {train_loss:.4f}  "
76
          f"Perplexity: {math.exp(train_loss):.2f}")
77

78
    # Save best checkpoint
79
    torch.save({
80
        'epoch': epoch,
81
        'model_state_dict': model.state_dict(),
82
        'optimizer_state_dict': optimizer.state_dict(),
83
        'loss': train_loss,
84
    }, 'transformer_best.pt')

12. Inference — Greedy Decoding (贪婪解码)#

The simplest decoding strategy: at each step, pick the token with the highest probability.

1
def greedy_decode(
2
    model: Transformer,
3
    src: torch.Tensor,         # (1, src_len) — single example
4
    max_len: int = 50,
5
    bos_idx: int = BOS_IDX,
6
    eos_idx: int = EOS_IDX,
7
    device: torch.device = DEVICE,
8
) -> list[int]:
9
    model.eval()
10
    src = src.to(device)
11

12
    with torch.no_grad():
13
        # Step 1: Encode source sequence once
14
        src_mask = make_pad_mask(src, PAD_IDX)
15
        memory = model.encode(src, src_mask)   # (1, src_len, d_model)
16

17
        # Step 2: Initialize decoder input with BOS token
18
        tgt = torch.tensor([[bos_idx]], device=device)   # (1, 1)
19
        output_tokens = []
20

21
        for _ in range(max_len):
22
            # Build causal mask for current target length
23
            tgt_mask = make_causal_mask(tgt.size(1), device)
24

25
            # Decode one step
26
            dec_out = model.decode(tgt, memory, tgt_mask, src_mask)
27
            # dec_out: (1, tgt_len, d_model)
28

29
            # Project and take argmax of last position
30
            logits     = model.fc_out(dec_out[:, -1, :])   # (1, vocab)
31
            next_token = logits.argmax(dim=-1).item()
32

33
            output_tokens.append(next_token)
34

35
            if next_token == eos_idx:
36
                break
37

38
            # Append predicted token and continue
39
            tgt = torch.cat([tgt, torch.tensor([[next_token]], device=device)], dim=1)
40

41
    return output_tokens
42

43

44
# Example usage
45
src_example = torch.randint(3, SRC_VOCAB, (1, 15))
46
predicted = greedy_decode(model, src_example, max_len=50)
47
print("Predicted token IDs:", predicted)

13. Inference — Beam Search (束搜索)#

Maintains the top-k candidate sequences at each step — much better output quality than greedy.

1
from dataclasses import dataclass, field
2

3
@dataclass(order=True)
4
class BeamHypothesis:
5
    score: float
6
    tokens: list[int] = field(compare=False)
7

8

9
def beam_search_decode(
10
    model: Transformer,
11
    src: torch.Tensor,
12
    beam_size: int   = 4,
13
    max_len: int     = 50,
14
    bos_idx: int     = BOS_IDX,
15
    eos_idx: int     = EOS_IDX,
16
    device: torch.device = DEVICE,
17
    length_penalty: float = 0.6,
18
) -> list[int]:
19
    model.eval()
20
    src = src.to(device)
21

22
    with torch.no_grad():
23
        src_mask = make_pad_mask(src, PAD_IDX)
24
        memory   = model.encode(src, src_mask)
25

26
        # Initialize beam with BOS token
27
        beams     = [BeamHypothesis(score=0.0, tokens=[bos_idx])]
28
        completed = []
29

30
        for step in range(max_len):
31
            all_candidates = []
32

33
            for beam in beams:
34
                if beam.tokens[-1] == eos_idx:
35
                    completed.append(beam)
36
                    continue
37

38
                tgt = torch.tensor([beam.tokens], device=device)
39
                tgt_mask = make_causal_mask(tgt.size(1), device)
40

41
                dec_out = model.decode(tgt, memory, tgt_mask, src_mask)
42
                logits  = model.fc_out(dec_out[:, -1, :])           # (1, vocab)
43
                log_probs = F.log_softmax(logits, dim=-1).squeeze(0) # (vocab,)
44

45
                # Expand top-k tokens
46
                topk_log_probs, topk_ids = log_probs.topk(beam_size)
47

48
                for log_prob, token_id in zip(topk_log_probs.tolist(),
49
                                               topk_ids.tolist()):
50
                    new_score  = beam.score + log_prob
51
                    new_tokens = beam.tokens + [token_id]
52
                    all_candidates.append(
53
                        BeamHypothesis(score=new_score, tokens=new_tokens)
54
                    )
55

56
            if not all_candidates:
57
                break
58

59
            # Keep top beam_size candidates
60
            all_candidates.sort(key=lambda h: h.score / (len(h.tokens) ** length_penalty),
61
                                 reverse=True)
62
            beams = all_candidates[:beam_size]
63

64
        # Return best completed hypothesis (or best incomplete beam)
65
        all_hyps = completed + beams
66
        best = max(all_hyps, key=lambda h: h.score / (len(h.tokens) ** length_penalty))
67
        return best.tokens[1:]   # Strip BOS
68

69

70
predicted_beam = beam_search_decode(model, src_example, beam_size=4)
71
print("Beam search tokens:", predicted_beam)

14. Saving & Loading Checkpoints (保存与加载)#

1
# ---- Save ----
2
torch.save({
3
    'model_state_dict'    : model.state_dict(),
4
    'optimizer_state_dict': optimizer.state_dict(),
5
    'epoch'               : epoch,
6
    'loss'                : train_loss,
7
    'config': {
8
        'src_vocab': SRC_VOCAB, 'tgt_vocab': TGT_VOCAB,
9
        'd_model': D_MODEL, 'num_heads': NUM_HEADS,
10
        'num_layers': NUM_LAYERS, 'd_ff': D_FF,
11
    }
12
}, 'transformer_checkpoint.pt')
13

14
# ---- Load ----
15
checkpoint = torch.load('transformer_checkpoint.pt', map_location=DEVICE)
16
cfg = checkpoint['config']
17

18
model = Transformer(
19
    src_vocab_size=cfg['src_vocab'],
20
    tgt_vocab_size=cfg['tgt_vocab'],
21
    d_model=cfg['d_model'],
22
    num_heads=cfg['num_heads'],
23
    num_layers=cfg['num_layers'],
24
    d_ff=cfg['d_ff'],
25
).to(DEVICE)
26

27
model.load_state_dict(checkpoint['model_state_dict'])
28
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
29
model.eval()
30
print(f"Loaded checkpoint from epoch {checkpoint['epoch']}")

15. Key Design Decisions & Modern Variants (关键设计决策与现代变体)#

Component	Original Paper	Modern Practice
Positional Encoding	Sinusoidal (fixed)	Learned embeddings (BERT) / RoPE (LLaMA)
Normalization	Post-LN (后归一化)	Pre-LN (前归一化) — more stable
Activation	ReLU	GELU / SwiGLU (GPT, LLaMA)
Attention	Full self-attention	GQA / MQA (grouped/multi-query) — faster inference
Vocab size	~37,000	32k–128k+ with BPE/SentencePiece
Weight tying	None	Tie input & output embeddings (GPT-2)
KV Cache	None	KV Cache (KV 缓存) for autoregressive inference
Context length	512	4k–128k+ with sliding window or ALiBi

💡 One-line Takeaway
A Transformer is a stack of Multi-Head Attention (多头注意力) + Feed-Forward (前馈网络) blocks tied together by Residual Connections (残差连接) + LayerNorm (层归一化) — master scaled_dot_product_attention, understand causal masking, use warmup scheduling, and switch from greedy to beam search for better output quality.