V. Neural Network Modules — `nn.Module` (神经网络模块)#

1. `nn.Module`#

The base class for all neural networks in PyTorch. Manages parameters (参数), sub-modules (子模块), and defines the forward pass (前向传播) logic.

1
import torch.nn as nn
2

3
class MLP(nn.Module):
4
    def __init__(self):
5
        super().__init__()
6
        self.fc1 = nn.Linear(784, 256)
7
        self.fc2 = nn.Linear(256, 10)
8

9
    def forward(self, x):
10
        x = torch.relu(self.fc1(x))
11
        return self.fc2(x)

Note: Only implement __init__ and forward. The backward pass is handled automatically by autograd.

2. `nn.Linear()`#

Fully Connected Layer (全连接层) / Affine Transformation (仿射变换): y = xW^T + b. The most fundamental learnable layer.

1
fc = nn.Linear(in_features=128, out_features=64, bias=True)
2
x = torch.rand(32, 128)
3
out = fc(x)  # shape [32, 64]

Note: Weight shape is [out, in]. bias=False is commonly paired with BatchNorm.

3. `nn.Conv2d()`#

2D Convolutional Layer (二维卷积层). Extracts local spatial features; the core building block of CNNs (卷积神经网络).

1
conv = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=1, padding=1)
2
x = torch.rand(8, 3, 224, 224)
3
out = conv(x)  # [8, 64, 224, 224]

Note: padding=kernel_size//2 preserves feature map size (Same Padding, 等尺寸填充).

4. `nn.BatchNorm2d()`#

Normalizes each channel of a mini-batch (小批量归一化). Accelerates training and mitigates gradient vanishing (梯度消失).

1
bn = nn.BatchNorm2d(num_features=64)
2
x = torch.rand(8, 64, 28, 28)
3
out = bn(x)
4
# Standard order: Conv → BN → ReLU

Note: BN is unstable when batch_size=1. Switch to GroupNorm or LayerNorm in that case.

5. `nn.Dropout()`#

During training, randomly zeros out a fraction of neurons — a Regularization (正则化) technique to prevent Overfitting (过拟合).

1
dropout = nn.Dropout(p=0.5)
2
x = torch.rand(4, 128)
3
out = dropout(x)       # 50% elements zeroed during train mode
4

5
dropout.eval()
6
out_eval = dropout(x)  # identical to x in eval mode

Note: Forgetting model.eval() is the #1 most common bug causing non-deterministic inference results.

6. `nn.Sequential()`#

Chains a series of layers in order, executing each forward call sequentially. Simplifies model definition.

1
model = nn.Sequential(
2
    nn.Linear(784, 256),
3
    nn.ReLU(),
4
    nn.Dropout(0.3),
5
    nn.Linear(256, 10)
6
)
7
out = model(x)

Note: Use OrderedDict to name layers: nn.Sequential(OrderedDict([('fc', nn.Linear(...))])).

7. `nn.ModuleList()` / `nn.ModuleDict()`#

Registers sub-modules as a list or dictionary so that their parameters are correctly tracked and saved.

1
layers = nn.ModuleList([nn.Linear(64, 64) for _ in range(6)])
2
for layer in layers:
3
    x = torch.relu(layer(x))
4

5
heads = nn.ModuleDict({
6
    'cls': nn.Linear(64, 10),
7
    'reg': nn.Linear(64, 1)
8
})

Note: Plain Python list / dict are not registered — parameters() will miss them!

8. `nn.Embedding()`#

Maps integer indices to dense vectors (稠密向量). The standard Word Embedding Lookup Table (词向量查找表) in NLP.

1
vocab_size, embed_dim = 10000, 128
2
emb = nn.Embedding(vocab_size, embed_dim)
3
ids = torch.randint(0, vocab_size, (16, 50))  # [batch, seq_len]
4
out = emb(ids)  # [16, 50, 128]

Note: padding_idx specifies a padding token whose embedding is excluded from gradient updates.

9. `nn.LSTM()` / `nn.GRU()`#

Long Short-Term Memory (长短时记忆) and Gated Recurrent Unit (门控循环单元) — classic recurrent layers for sequence data (序列数据).

1
lstm = nn.LSTM(input_size=128, hidden_size=256, num_layers=2, batch_first=True, dropout=0.2)
2
x = torch.rand(8, 50, 128)  # [batch, seq, feat]
3
out, (h, c) = lstm(x)

Note: batch_first=True sets the input format to [B, T, F], which is more intuitive. The default is [T, B, F].

10. `nn.MultiheadAttention()`#

Multi-head Self-Attention (多头自注意力机制) — the core component of the Transformer Architecture (Transformer架构).

1
attn = nn.MultiheadAttention(embed_dim=512, num_heads=8, batch_first=True)
2
x = torch.rand(4, 100, 512)
3
out, weights = attn(query=x, key=x, value=x)

Note: Use key_padding_mask to mask padding tokens; use attn_mask for Causal Masking (因果掩码) in decoders.

💡 One-line Takeaway
Every custom network inherits from nn.Module; use ModuleList/Dict (not plain lists) to ensure parameters are tracked.

V. Neural Network Modules — nn.Module (神经网络模块)#

1. nn.Module#

2. nn.Linear()#

3. nn.Conv2d()#

4. nn.BatchNorm2d()#

5. nn.Dropout()#

6. nn.Sequential()#

7. nn.ModuleList() / nn.ModuleDict()#

8. nn.Embedding()#

9. nn.LSTM() / nn.GRU()#

10. nn.MultiheadAttention()#