X. GPU Acceleration & Distributed Training (GPU加速与分布式训练)#

1. `Tensor.to()` / `.cuda()` / `.cpu()`#

Moves Tensors or models to a specified device (GPU/CPU). The fundamental operation for GPU training (GPU训练).

1
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
2
model = model.to(device)
3
x = x.to(device)
4
result = output.cpu().numpy()  # move back to CPU

Note: Model and data must be on the same device. Mixing CPU/GPU Tensors throws a runtime error.

2. `torch.cuda.amp` — Automatic Mixed Precision (自动混合精度)#

Automatically switches between FP16 and FP32, reducing VRAM usage and accelerating training.

1
from torch.cuda.amp import autocast, GradScaler
2
scaler = GradScaler()
3

4
with autocast():
5
    output = model(x)
6
    loss = criterion(output, y)
7

8
scaler.scale(loss).backward()
9
scaler.step(optimizer)
10
scaler.update()

Note: GradScaler prevents FP16 Gradient Underflow (梯度下溢). Recommended for all modern GPU training.

3. `nn.DataParallel()`#

Single-machine multi-GPU Data Parallel (数据并行) training. Automatically splits batches and aggregates gradients.

1
if torch.cuda.device_count() > 1:
2
    model = nn.DataParallel(model)
3
model = model.to('cuda')
4

5
# Access original model
6
sd = model.module.state_dict() if isinstance(model, nn.DataParallel) else model.state_dict()

Note: DataParallel efficiency is limited by Python GIL. For large-scale training, use DistributedDataParallel (DDP).

4. `nn.parallel.DistributedDataParallel()` — DDP#

Distributed Data Parallel: one process per GPU. Communication efficiency far exceeds DataParallel.

1
import torch.distributed as dist
2
dist.init_process_group('nccl')
3
local_rank = int(os.environ['LOCAL_RANK'])
4
model = model.to(local_rank)
5
model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])

Note: Launch with torchrun --nproc_per_node=4 train.py. Pair with DistributedSampler.

5. `torch.cuda.memory_summary()`#

Prints detailed GPU VRAM usage to help diagnose Out-of-Memory (OOM, 显存溢出) issues.

1
print(torch.cuda.memory_summary())
2

3
alloc = torch.cuda.memory_allocated()
4
total = torch.cuda.get_device_properties(0).total_memory
5
print(f'{alloc/1e9:.1f}GB used')

Note: After OOM, call torch.cuda.empty_cache() to release cached memory — but it cannot free memory in active use.

6. `torch.compile()` (PyTorch 2.0+)#

Compiles the model into optimized kernels using graph capture and Triton operators, dramatically accelerating training/inference.

1
model = torch.compile(model)
2

3
# Different modes
4
model = torch.compile(model, mode='reduce-overhead', fullgraph=True)

Note: First run has Compilation Overhead (编译开销) (warmup). fullgraph=True forbids graph breaks for maximum performance.

💡 One-line Takeaway
Use AMP for every GPU training job; prefer DDP over DataParallel for multi-GPU; add torch.compile() as a one-line speed boost in PyTorch 2.x.

X. GPU Acceleration & Distributed Training (GPU加速与分布式训练)#

1. Tensor.to() / .cuda() / .cpu()#

2. torch.cuda.amp — Automatic Mixed Precision (自动混合精度)#

3. nn.DataParallel()#

4. nn.parallel.DistributedDataParallel() — DDP#

5. torch.cuda.memory_summary()#

6. torch.compile() (PyTorch 2.0+)#