686 words
3 minutes
Activation Functions & Loss Functions

VI. Activation Functions & Loss Functions (激活函数与损失函数)
1. nn.ReLU() / F.relu()
Rectified Linear Unit (修正线性单元): max(0, x). Alleviates gradient vanishing; the most widely used activation function.
import torch.nn.functional as Fx = torch.randn(4, 64)out1 = F.relu(x) # functional callrelu = nn.ReLU(inplace=True)out2 = relu(x) # module call (can go in Sequential)Note:
inplace=True saves memory but modifies the original tensor — be careful when using autograd hooks.2. nn.GELU() / nn.SiLU()
GELU: the Transformer standard activation. SiLU (Swish): used in EfficientNet and mobile models. gelu = nn.GELU()silu = nn.SiLU()x = torch.randn(4, 64)print(gelu(x).shape) # [4, 64]print(silu(x).shape) # [4, 64]Note: BERT/GPT default to GELU; SiLU performs better on mobile-end models.
3. nn.Softmax() / F.softmax()
Converts logits to a Probability Distribution (概率分布) summing to 1. Output layer for multi-class classification (多分类任务).
logits = torch.tensor([2.0, 1.0, 0.1])probs = F.softmax(logits, dim=0) # tensor([0.659, 0.242, 0.099])Note: Do NOT manually add Softmax when using
CrossEntropyLoss — it already includes it internally.4. nn.CrossEntropyLoss()
Multi-class Cross-Entropy Loss (多分类交叉熵损失) — internally fuses LogSoftmax + NLLLoss for numerical stability.
criterion = nn.CrossEntropyLoss()logits = torch.rand(8, 10)labels = torch.randint(0, 10, (8,))loss = criterion(logits, labels)loss.backward()Note: The
label_smoothing parameter (≥ PyTorch 1.8) effectively prevents Overconfidence (过度自信) and improves generalization.5. nn.BCEWithLogitsLoss()
Binary / Multi-label Classification Loss (二分类/多标签分类损失). More numerically stable than applying Sigmoid then BCE.
criterion = nn.BCEWithLogitsLoss()logits = torch.rand(8, 1) # no sigmoid neededtargets = torch.randint(0, 2, (8, 1)).float()loss = criterion(logits, targets)Note: For multi-label classification,
targets is a float matrix with each bit independent, not a class index.6. nn.MSELoss() / nn.L1Loss()
Mean Squared Error (均方误差) and Mean Absolute Error (平均绝对误差) — for continuous value prediction (连续值预测) / regression (回归).
mse = nn.MSELoss()mae = nn.L1Loss()pred = torch.rand(4, 1)target = torch.rand(4, 1)print(mse(pred, target))print(mae(pred, target))Note: L1 is more robust to outliers.
SmoothL1Loss (Huber Loss) combines both — recommended for object detection (目标检测).7. nn.KLDivLoss()
KL Divergence Loss (KL散度损失) — measures the difference between two probability distributions. Used in knowledge distillation (知识蒸馏) and VAE.
kl = nn.KLDivLoss(reduction='batchmean')log_p = F.log_softmax(student_logits, dim=-1) # input: log probq = F.softmax(teacher_logits, dim=-1) # target: probloss = kl(log_p, q)Note: Input must be log-probabilities; target must be probabilities. This matches the mathematical definition.
8. nn.LayerNorm()
Normalizes over the last N dimensions. Independent of batch size — the standard normalization in Transformers (Transformer标配).
ln = nn.LayerNorm(normalized_shape=512)x = torch.rand(4, 100, 512)out = ln(x) # shape [4, 100, 512]Note: Outperforms BatchNorm for variable-length NLP sequences (可变长序列) and small batch sizes.
💡 One-line Takeaway
Use
Use
CrossEntropyLoss for multi-class, BCEWithLogitsLoss for multi-label, and SmoothL1 for regression — and never apply Softmax before CrossEntropy. Activation Functions & Loss Functions
https://lxy-alexander.github.io/blog/posts/pytorch/api/06activation-functions--loss-functions/