597 words
3 minutes
Utilities & Performance Tips

XV. Utilities & Performance Tips (实用工具与性能技巧)
1. torch.no_grad() vs torch.inference_mode()
inference_mode is more aggressive than no_grad: it skips version counting entirely for faster pure inference. with torch.no_grad(): out1 = model(x)
@torch.inference_mode()def predict(x): return model(x)Note: PyTorch 1.9+ recommends
inference_mode for pure inference — faster than no_grad.2. torch.Tensor.pin_memory()
Pins CPU Tensors to Page-locked Memory (页锁定内存), dramatically accelerating CPU→GPU data transfer.
loader = DataLoader(dataset, pin_memory=True, num_workers=4)for x, y in loader: x = x.to('cuda', non_blocking=True) # async transferNote:
pin_memory=True + non_blocking=True enables CPU data loading to overlap with GPU computation (Pipeline, 流水线并行).3. torch.utils.checkpoint.checkpoint()
Gradient Checkpointing (梯度检查点): trades recomputation for VRAM savings. Can save 50%+ VRAM for very large models.
from torch.utils.checkpoint import checkpoint
def forward(self, x): x = checkpoint(self.heavy_block, x) # no intermediate activations saved return self.head(x)Note: Trades ~30% extra training time for drastically reduced VRAM — enables training models that would otherwise OOM.
4. torch.nn.functional.one_hot()
Converts integer class indices to One-hot Encoding (独热编码) tensors.
labels = torch.tensor([0, 2, 1, 3])one_hot = F.one_hot(labels, num_classes=4).float()# tensor([[1,0,0,0],[0,0,1,0],[0,1,0,0],[0,0,0,1]])Note: Output defaults to LongTensor. Call
.float() before participating in loss computation.5. torch.nn.functional.cosine_similarity()
Computes Cosine Similarity (余弦相似度) between two groups of vectors. Core metric in Contrastive Learning (对比学习) — SimCLR, CLIP.
a = torch.randn(8, 128)b = torch.randn(8, 128)sim = F.cosine_similarity(a, b, dim=-1) # shape [8], range [-1, 1]
# Similarity matrix for contrastive learningmat = a @ b.T # [8, 8]Note: Equivalent to (a/||a||) · (b/||b||). In contrastive learning, L2-normalize first then use dot product.
6. nn.SyncBatchNorm.convert_sync_batchnorm()
Replaces all BatchNorm layers with cross-GPU Synchronized BatchNorm (跨GPU同步批归一化). Essential for DDP training with BN.
model = MyModel()model = nn.SyncBatchNorm.convert_sync_batchnorm(model) # before DDP wrapmodel = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])Note: Must convert before wrapping with DDP. Without SyncBN, each GPU computes its own BN statistics — inaccurate.
💡 One-line Takeaway
Performance stack:
Performance stack:
pin_memory + non_blocking → AMP → torch.compile → gradient_checkpoint (if OOM). 🎯 Master Summary — 120 APIs in 6 Core Concepts
1. Tensor Ops: Create (
2. Autograd:
3. Networks:
4. Training: Loss → Optimizer (
5. Data:
6. Deploy:
1. Tensor Ops: Create (
tensor/zeros/rand) → Shape (reshape/permute/cat) → Math (matmul/clamp/topk)2. Autograd:
requires_grad → backward() → no_grad / detach3. Networks:
nn.Module → Layers (Linear/Conv2d/LSTM/Attention) → Norm + Dropout4. Training: Loss → Optimizer (
AdamW) → Scheduler → clip_grad_norm_5. Data:
Dataset → transforms → DataLoader → pretrained models6. Deploy:
state_dict → TorchScript / ONNX → AMP / DDP / torch.compile Utilities & Performance Tips
https://lxy-alexander.github.io/blog/posts/pytorch/api/15utilities--performance-tips/