686 words
3 minutes
Activation Functions & Loss Functions

VI. Activation Functions & Loss Functions (激活函数与损失函数)#

1. nn.ReLU() / F.relu()#

Rectified Linear Unit (修正线性单元): max(0, x). Alleviates gradient vanishing; the most widely used activation function.
import torch.nn.functional as F
x = torch.randn(4, 64)
out1 = F.relu(x) # functional call
relu = nn.ReLU(inplace=True)
out2 = relu(x) # module call (can go in Sequential)
Note: inplace=True saves memory but modifies the original tensor — be careful when using autograd hooks.

2. nn.GELU() / nn.SiLU()#

GELU: the Transformer standard activation. SiLU (Swish): used in EfficientNet and mobile models.
gelu = nn.GELU()
silu = nn.SiLU()
x = torch.randn(4, 64)
print(gelu(x).shape) # [4, 64]
print(silu(x).shape) # [4, 64]
Note: BERT/GPT default to GELU; SiLU performs better on mobile-end models.

3. nn.Softmax() / F.softmax()#

Converts logits to a Probability Distribution (概率分布) summing to 1. Output layer for multi-class classification (多分类任务).
logits = torch.tensor([2.0, 1.0, 0.1])
probs = F.softmax(logits, dim=0) # tensor([0.659, 0.242, 0.099])
Note: Do NOT manually add Softmax when using CrossEntropyLoss — it already includes it internally.

4. nn.CrossEntropyLoss()#

Multi-class Cross-Entropy Loss (多分类交叉熵损失) — internally fuses LogSoftmax + NLLLoss for numerical stability.
criterion = nn.CrossEntropyLoss()
logits = torch.rand(8, 10)
labels = torch.randint(0, 10, (8,))
loss = criterion(logits, labels)
loss.backward()
Note: The label_smoothing parameter (≥ PyTorch 1.8) effectively prevents Overconfidence (过度自信) and improves generalization.

5. nn.BCEWithLogitsLoss()#

Binary / Multi-label Classification Loss (二分类/多标签分类损失). More numerically stable than applying Sigmoid then BCE.
criterion = nn.BCEWithLogitsLoss()
logits = torch.rand(8, 1) # no sigmoid needed
targets = torch.randint(0, 2, (8, 1)).float()
loss = criterion(logits, targets)
Note: For multi-label classification, targets is a float matrix with each bit independent, not a class index.

6. nn.MSELoss() / nn.L1Loss()#

Mean Squared Error (均方误差) and Mean Absolute Error (平均绝对误差) — for continuous value prediction (连续值预测) / regression (回归).
mse = nn.MSELoss()
mae = nn.L1Loss()
pred = torch.rand(4, 1)
target = torch.rand(4, 1)
print(mse(pred, target))
print(mae(pred, target))
Note: L1 is more robust to outliers. SmoothL1Loss (Huber Loss) combines both — recommended for object detection (目标检测).

7. nn.KLDivLoss()#

KL Divergence Loss (KL散度损失) — measures the difference between two probability distributions. Used in knowledge distillation (知识蒸馏) and VAE.
kl = nn.KLDivLoss(reduction='batchmean')
log_p = F.log_softmax(student_logits, dim=-1) # input: log prob
q = F.softmax(teacher_logits, dim=-1) # target: prob
loss = kl(log_p, q)
Note: Input must be log-probabilities; target must be probabilities. This matches the mathematical definition.

8. nn.LayerNorm()#

Normalizes over the last N dimensions. Independent of batch size — the standard normalization in Transformers (Transformer标配).
ln = nn.LayerNorm(normalized_shape=512)
x = torch.rand(4, 100, 512)
out = ln(x) # shape [4, 100, 512]
Note: Outperforms BatchNorm for variable-length NLP sequences (可变长序列) and small batch sizes.
💡 One-line Takeaway
Use CrossEntropyLoss for multi-class, BCEWithLogitsLoss for multi-label, and SmoothL1 for regression — and never apply Softmax before CrossEntropy.

Activation Functions & Loss Functions
https://lxy-alexander.github.io/blog/posts/pytorch/api/06activation-functions--loss-functions/
Author
Alexander Lee
Published at
2026-03-12
License
CC BY-NC-SA 4.0