594 words
3 minutes
Optimizers & Learning Rate Schedulers

VII. Optimizers & Learning Rate Schedulers (优化器与学习率调度)
1. torch.optim.SGD()
Stochastic Gradient Descent (随机梯度下降). Supports momentum (动量), weight decay (权重衰减), and Nesterov momentum.
optimizer = torch.optim.SGD( model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4, nesterov=True)Note: SGD+momentum is still common in CV; final accuracy sometimes surpasses Adam.
2. torch.optim.Adam()
Adaptive Moment Estimation (自适应矩估计). Combines AdaGrad and RMSProp. The default optimizer for most tasks.
optimizer = torch.optim.Adam( model.parameters(), lr=1e-3, betas=(0.9, 0.999), weight_decay=1e-4)optimizer.zero_grad()loss.backward()optimizer.step()Note: For NLP/Transformer scenarios, use
AdamW (decoupled weight decay).3. torch.optim.AdamW()
Improved Adam with correctly decoupled L2 regularization (解耦L2正则). The go-to optimizer for training Transformers.
optimizer = torch.optim.AdamW( model.parameters(), lr=5e-5, weight_decay=0.01) # Standard config for BERT/GPT fine-tuningNote: In original Adam, L2 regularization is entangled with adaptive learning rate scaling. AdamW fixes this by decoupling them.
4. optimizer.zero_grad()
Clears all parameter gradient buffers. Must be called before each backward pass.
for epoch in range(10): for x, y in dataloader: optimizer.zero_grad() # 1. clear pred = model(x) loss = criterion(pred, y) loss.backward() # 3. backward optimizer.step() # 4. updateNote:
zero_grad(set_to_none=True) uses less memory and is recommended for PyTorch ≥ 1.7.5. lr_scheduler.StepLR()
Multiplies the learning rate by
gamma every step_size epochs — stepwise decay (阶梯式衰减). from torch.optim import lr_schedulerscheduler = lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)# Call at end of each epoch:scheduler.step()Note: In modern PyTorch, call
optimizer.step() before scheduler.step().6. lr_scheduler.CosineAnnealingLR()
Cosine Annealing Decay (余弦退火调度): LR oscillates between [eta_min, lr] following a cosine curve. Excellent convergence.
scheduler = lr_scheduler.CosineAnnealingLR( optimizer, T_max=50, eta_min=1e-6)Note: Combine with Warm Restarts (热重启,
CosineAnnealingWarmRestarts) to escape local optima.7. lr_scheduler.OneCycleLR()
Super-Convergence Training Strategy (超融合训练策略): LR rises then falls in a single cycle. Significantly reduces convergence time.
scheduler = lr_scheduler.OneCycleLR( optimizer, max_lr=0.01, steps_per_epoch=len(loader), epochs=10)scheduler.step() # call after every step (not epoch)Note: Set
max_lr to the highest stable LR found by a learning rate finder.8. torch.optim.LBFGS()
Quasi-Newton second-order optimizer (拟牛顿二阶优化器). Suited for small datasets. Requires a
closure function. optimizer = torch.optim.LBFGS(model.parameters(), lr=1)
def closure(): optimizer.zero_grad() loss = criterion(model(x), y) loss.backward() return loss
optimizer.step(closure)Note: Preferred for Neural Style Transfer (神经风格迁移) and other small-scale, high-precision convergence tasks.
💡 One-line Takeaway
Default to
Default to
AdamW + CosineAnnealingLR for Transformers, and SGD + StepLR for classic CNN image tasks. Optimizers & Learning Rate Schedulers
https://lxy-alexander.github.io/blog/posts/pytorch/api/07optimizers--learning-rate-schedulers/