VI. Activation Functions & Loss Functions (激活函数与损失函数)#

1. `nn.ReLU()` / `F.relu()`#

Rectified Linear Unit (修正线性单元): max(0, x). Alleviates gradient vanishing; the most widely used activation function.

1
import torch.nn.functional as F
2
x = torch.randn(4, 64)
3
out1 = F.relu(x)                     # functional call
4
relu = nn.ReLU(inplace=True)
5
out2 = relu(x)                       # module call (can go in Sequential)

Note: inplace=True saves memory but modifies the original tensor — be careful when using autograd hooks.

2. `nn.GELU()` / `nn.SiLU()`#

GELU: the Transformer standard activation. SiLU (Swish): used in EfficientNet and mobile models.

1
gelu = nn.GELU()
2
silu = nn.SiLU()
3
x = torch.randn(4, 64)
4
print(gelu(x).shape)  # [4, 64]
5
print(silu(x).shape)  # [4, 64]

Note: BERT/GPT default to GELU; SiLU performs better on mobile-end models.

3. `nn.Softmax()` / `F.softmax()`#

Converts logits to a Probability Distribution (概率分布) summing to 1. Output layer for multi-class classification (多分类任务).

1
logits = torch.tensor([2.0, 1.0, 0.1])
2
probs = F.softmax(logits, dim=0)   # tensor([0.659, 0.242, 0.099])

Note: Do NOT manually add Softmax when using CrossEntropyLoss — it already includes it internally.

4. `nn.CrossEntropyLoss()`#

Multi-class Cross-Entropy Loss (多分类交叉熵损失) — internally fuses LogSoftmax + NLLLoss for numerical stability.

1
criterion = nn.CrossEntropyLoss()
2
logits = torch.rand(8, 10)
3
labels = torch.randint(0, 10, (8,))
4
loss = criterion(logits, labels)
5
loss.backward()

Note: The label_smoothing parameter (≥ PyTorch 1.8) effectively prevents Overconfidence (过度自信) and improves generalization.

5. `nn.BCEWithLogitsLoss()`#

Binary / Multi-label Classification Loss (二分类/多标签分类损失). More numerically stable than applying Sigmoid then BCE.

1
criterion = nn.BCEWithLogitsLoss()
2
logits = torch.rand(8, 1)                        # no sigmoid needed
3
targets = torch.randint(0, 2, (8, 1)).float()
4
loss = criterion(logits, targets)

Note: For multi-label classification, targets is a float matrix with each bit independent, not a class index.

6. `nn.MSELoss()` / `nn.L1Loss()`#

Mean Squared Error (均方误差) and Mean Absolute Error (平均绝对误差) — for continuous value prediction (连续值预测) / regression (回归).

1
mse = nn.MSELoss()
2
mae = nn.L1Loss()
3
pred = torch.rand(4, 1)
4
target = torch.rand(4, 1)
5
print(mse(pred, target))
6
print(mae(pred, target))

Note: L1 is more robust to outliers. SmoothL1Loss (Huber Loss) combines both — recommended for object detection (目标检测).

7. `nn.KLDivLoss()`#

KL Divergence Loss (KL散度损失) — measures the difference between two probability distributions. Used in knowledge distillation (知识蒸馏) and VAE.

1
kl = nn.KLDivLoss(reduction='batchmean')
2
log_p = F.log_softmax(student_logits, dim=-1)   # input: log prob
3
q = F.softmax(teacher_logits, dim=-1)            # target: prob
4
loss = kl(log_p, q)

Note: Input must be log-probabilities; target must be probabilities. This matches the mathematical definition.

8. `nn.LayerNorm()`#

Normalizes over the last N dimensions. Independent of batch size — the standard normalization in Transformers (Transformer标配).

1
ln = nn.LayerNorm(normalized_shape=512)
2
x = torch.rand(4, 100, 512)
3
out = ln(x)  # shape [4, 100, 512]

Note: Outperforms BatchNorm for variable-length NLP sequences (可变长序列) and small batch sizes.

💡 One-line Takeaway
Use CrossEntropyLoss for multi-class, BCEWithLogitsLoss for multi-label, and SmoothL1 for regression — and never apply Softmax before CrossEntropy.

VI. Activation Functions & Loss Functions (激活函数与损失函数)#

1. nn.ReLU() / F.relu()#

2. nn.GELU() / nn.SiLU()#

3. nn.Softmax() / F.softmax()#

4. nn.CrossEntropyLoss()#

5. nn.BCEWithLogitsLoss()#

6. nn.MSELoss() / nn.L1Loss()#

7. nn.KLDivLoss()#

8. nn.LayerNorm()#