470 words
2 minutes
GPU Acceleration & Distributed Training

X. GPU Acceleration & Distributed Training (GPU加速与分布式训练)#

1. Tensor.to() / .cuda() / .cpu()#

Moves Tensors or models to a specified device (GPU/CPU). The fundamental operation for GPU training (GPU训练).
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
x = x.to(device)
result = output.cpu().numpy() # move back to CPU
Note: Model and data must be on the same device. Mixing CPU/GPU Tensors throws a runtime error.

2. torch.cuda.amp — Automatic Mixed Precision (自动混合精度)#

Automatically switches between FP16 and FP32, reducing VRAM usage and accelerating training.
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast():
output = model(x)
loss = criterion(output, y)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Note: GradScaler prevents FP16 Gradient Underflow (梯度下溢). Recommended for all modern GPU training.

3. nn.DataParallel()#

Single-machine multi-GPU Data Parallel (数据并行) training. Automatically splits batches and aggregates gradients.
if torch.cuda.device_count() > 1:
model = nn.DataParallel(model)
model = model.to('cuda')
# Access original model
sd = model.module.state_dict() if isinstance(model, nn.DataParallel) else model.state_dict()
Note: DataParallel efficiency is limited by Python GIL. For large-scale training, use DistributedDataParallel (DDP).

4. nn.parallel.DistributedDataParallel() — DDP#

Distributed Data Parallel: one process per GPU. Communication efficiency far exceeds DataParallel.
import torch.distributed as dist
dist.init_process_group('nccl')
local_rank = int(os.environ['LOCAL_RANK'])
model = model.to(local_rank)
model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
Note: Launch with torchrun --nproc_per_node=4 train.py. Pair with DistributedSampler.

5. torch.cuda.memory_summary()#

Prints detailed GPU VRAM usage to help diagnose Out-of-Memory (OOM, 显存溢出) issues.
print(torch.cuda.memory_summary())
alloc = torch.cuda.memory_allocated()
total = torch.cuda.get_device_properties(0).total_memory
print(f'{alloc/1e9:.1f}GB used')
Note: After OOM, call torch.cuda.empty_cache() to release cached memory — but it cannot free memory in active use.

6. torch.compile() (PyTorch 2.0+)#

Compiles the model into optimized kernels using graph capture and Triton operators, dramatically accelerating training/inference.
model = torch.compile(model)
# Different modes
model = torch.compile(model, mode='reduce-overhead', fullgraph=True)
Note: First run has Compilation Overhead (编译开销) (warmup). fullgraph=True forbids graph breaks for maximum performance.
💡 One-line Takeaway
Use AMP for every GPU training job; prefer DDP over DataParallel for multi-GPU; add torch.compile() as a one-line speed boost in PyTorch 2.x.

GPU Acceleration & Distributed Training
https://lxy-alexander.github.io/blog/posts/pytorch/api/10gpu-acceleration--distributed-training/
Author
Alexander Lee
Published at
2026-03-12
License
CC BY-NC-SA 4.0