594 words
3 minutes
Optimizers & Learning Rate Schedulers

VII. Optimizers & Learning Rate Schedulers (优化器与学习率调度)#

1. torch.optim.SGD()#

Stochastic Gradient Descent (随机梯度下降). Supports momentum (动量), weight decay (权重衰减), and Nesterov momentum.
optimizer = torch.optim.SGD(
model.parameters(), lr=0.01, momentum=0.9,
weight_decay=1e-4, nesterov=True
)
Note: SGD+momentum is still common in CV; final accuracy sometimes surpasses Adam.

2. torch.optim.Adam()#

Adaptive Moment Estimation (自适应矩估计). Combines AdaGrad and RMSProp. The default optimizer for most tasks.
optimizer = torch.optim.Adam(
model.parameters(), lr=1e-3, betas=(0.9, 0.999), weight_decay=1e-4
)
optimizer.zero_grad()
loss.backward()
optimizer.step()
Note: For NLP/Transformer scenarios, use AdamW (decoupled weight decay).

3. torch.optim.AdamW()#

Improved Adam with correctly decoupled L2 regularization (解耦L2正则). The go-to optimizer for training Transformers.
optimizer = torch.optim.AdamW(
model.parameters(), lr=5e-5, weight_decay=0.01
) # Standard config for BERT/GPT fine-tuning
Note: In original Adam, L2 regularization is entangled with adaptive learning rate scaling. AdamW fixes this by decoupling them.

4. optimizer.zero_grad()#

Clears all parameter gradient buffers. Must be called before each backward pass.
for epoch in range(10):
for x, y in dataloader:
optimizer.zero_grad() # 1. clear
pred = model(x)
loss = criterion(pred, y)
loss.backward() # 3. backward
optimizer.step() # 4. update
Note: zero_grad(set_to_none=True) uses less memory and is recommended for PyTorch ≥ 1.7.

5. lr_scheduler.StepLR()#

Multiplies the learning rate by gamma every step_size epochs — stepwise decay (阶梯式衰减).
from torch.optim import lr_scheduler
scheduler = lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
# Call at end of each epoch:
scheduler.step()
Note: In modern PyTorch, call optimizer.step() before scheduler.step().

6. lr_scheduler.CosineAnnealingLR()#

Cosine Annealing Decay (余弦退火调度): LR oscillates between [eta_min, lr] following a cosine curve. Excellent convergence.
scheduler = lr_scheduler.CosineAnnealingLR(
optimizer, T_max=50, eta_min=1e-6
)
Note: Combine with Warm Restarts (热重启, CosineAnnealingWarmRestarts) to escape local optima.

7. lr_scheduler.OneCycleLR()#

Super-Convergence Training Strategy (超融合训练策略): LR rises then falls in a single cycle. Significantly reduces convergence time.
scheduler = lr_scheduler.OneCycleLR(
optimizer, max_lr=0.01,
steps_per_epoch=len(loader), epochs=10
)
scheduler.step() # call after every step (not epoch)
Note: Set max_lr to the highest stable LR found by a learning rate finder.

8. torch.optim.LBFGS()#

Quasi-Newton second-order optimizer (拟牛顿二阶优化器). Suited for small datasets. Requires a closure function.
optimizer = torch.optim.LBFGS(model.parameters(), lr=1)
def closure():
optimizer.zero_grad()
loss = criterion(model(x), y)
loss.backward()
return loss
optimizer.step(closure)
Note: Preferred for Neural Style Transfer (神经风格迁移) and other small-scale, high-precision convergence tasks.
💡 One-line Takeaway
Default to AdamW + CosineAnnealingLR for Transformers, and SGD + StepLR for classic CNN image tasks.

Optimizers & Learning Rate Schedulers
https://lxy-alexander.github.io/blog/posts/pytorch/api/07optimizers--learning-rate-schedulers/
Author
Alexander Lee
Published at
2026-03-12
License
CC BY-NC-SA 4.0