470 words
2 minutes
GPU Acceleration & Distributed Training

X. GPU Acceleration & Distributed Training (GPU加速与分布式训练)
1. Tensor.to() / .cuda() / .cpu()
Moves Tensors or models to a specified device (GPU/CPU). The fundamental operation for GPU training (GPU训练).
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')model = model.to(device)x = x.to(device)result = output.cpu().numpy() # move back to CPUNote: Model and data must be on the same device. Mixing CPU/GPU Tensors throws a runtime error.
2. torch.cuda.amp — Automatic Mixed Precision (自动混合精度)
Automatically switches between FP16 and FP32, reducing VRAM usage and accelerating training.
from torch.cuda.amp import autocast, GradScalerscaler = GradScaler()
with autocast(): output = model(x) loss = criterion(output, y)
scaler.scale(loss).backward()scaler.step(optimizer)scaler.update()Note:
GradScaler prevents FP16 Gradient Underflow (梯度下溢). Recommended for all modern GPU training.3. nn.DataParallel()
Single-machine multi-GPU Data Parallel (数据并行) training. Automatically splits batches and aggregates gradients.
if torch.cuda.device_count() > 1: model = nn.DataParallel(model)model = model.to('cuda')
# Access original modelsd = model.module.state_dict() if isinstance(model, nn.DataParallel) else model.state_dict()Note: DataParallel efficiency is limited by Python GIL. For large-scale training, use DistributedDataParallel (DDP).
4. nn.parallel.DistributedDataParallel() — DDP
Distributed Data Parallel: one process per GPU. Communication efficiency far exceeds DataParallel.
import torch.distributed as distdist.init_process_group('nccl')local_rank = int(os.environ['LOCAL_RANK'])model = model.to(local_rank)model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])Note: Launch with
torchrun --nproc_per_node=4 train.py. Pair with DistributedSampler.5. torch.cuda.memory_summary()
Prints detailed GPU VRAM usage to help diagnose Out-of-Memory (OOM, 显存溢出) issues.
print(torch.cuda.memory_summary())
alloc = torch.cuda.memory_allocated()total = torch.cuda.get_device_properties(0).total_memoryprint(f'{alloc/1e9:.1f}GB used')Note: After OOM, call
torch.cuda.empty_cache() to release cached memory — but it cannot free memory in active use.6. torch.compile() (PyTorch 2.0+)
Compiles the model into optimized kernels using graph capture and Triton operators, dramatically accelerating training/inference.
model = torch.compile(model)
# Different modesmodel = torch.compile(model, mode='reduce-overhead', fullgraph=True)Note: First run has Compilation Overhead (编译开销) (warmup).
fullgraph=True forbids graph breaks for maximum performance.💡 One-line Takeaway
Use AMP for every GPU training job; prefer DDP over DataParallel for multi-GPU; add
Use AMP for every GPU training job; prefer DDP over DataParallel for multi-GPU; add
torch.compile() as a one-line speed boost in PyTorch 2.x. GPU Acceleration & Distributed Training
https://lxy-alexander.github.io/blog/posts/pytorch/api/10gpu-acceleration--distributed-training/