597 words
3 minutes
Utilities & Performance Tips

XV. Utilities & Performance Tips (实用工具与性能技巧)#

1. torch.no_grad() vs torch.inference_mode()#

inference_mode is more aggressive than no_grad: it skips version counting entirely for faster pure inference.
with torch.no_grad():
out1 = model(x)
@torch.inference_mode()
def predict(x):
return model(x)
Note: PyTorch 1.9+ recommends inference_mode for pure inference — faster than no_grad.

2. torch.Tensor.pin_memory()#

Pins CPU Tensors to Page-locked Memory (页锁定内存), dramatically accelerating CPU→GPU data transfer.
loader = DataLoader(dataset, pin_memory=True, num_workers=4)
for x, y in loader:
x = x.to('cuda', non_blocking=True) # async transfer
Note: pin_memory=True + non_blocking=True enables CPU data loading to overlap with GPU computation (Pipeline, 流水线并行).

3. torch.utils.checkpoint.checkpoint()#

Gradient Checkpointing (梯度检查点): trades recomputation for VRAM savings. Can save 50%+ VRAM for very large models.
from torch.utils.checkpoint import checkpoint
def forward(self, x):
x = checkpoint(self.heavy_block, x) # no intermediate activations saved
return self.head(x)
Note: Trades ~30% extra training time for drastically reduced VRAM — enables training models that would otherwise OOM.

4. torch.nn.functional.one_hot()#

Converts integer class indices to One-hot Encoding (独热编码) tensors.
labels = torch.tensor([0, 2, 1, 3])
one_hot = F.one_hot(labels, num_classes=4).float()
# tensor([[1,0,0,0],[0,0,1,0],[0,1,0,0],[0,0,0,1]])
Note: Output defaults to LongTensor. Call .float() before participating in loss computation.

5. torch.nn.functional.cosine_similarity()#

Computes Cosine Similarity (余弦相似度) between two groups of vectors. Core metric in Contrastive Learning (对比学习) — SimCLR, CLIP.
a = torch.randn(8, 128)
b = torch.randn(8, 128)
sim = F.cosine_similarity(a, b, dim=-1) # shape [8], range [-1, 1]
# Similarity matrix for contrastive learning
mat = a @ b.T # [8, 8]
Note: Equivalent to (a/||a||) · (b/||b||). In contrastive learning, L2-normalize first then use dot product.

6. nn.SyncBatchNorm.convert_sync_batchnorm()#

Replaces all BatchNorm layers with cross-GPU Synchronized BatchNorm (跨GPU同步批归一化). Essential for DDP training with BN.
model = MyModel()
model = nn.SyncBatchNorm.convert_sync_batchnorm(model) # before DDP wrap
model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
Note: Must convert before wrapping with DDP. Without SyncBN, each GPU computes its own BN statistics — inaccurate.
💡 One-line Takeaway
Performance stack: pin_memory + non_blockingAMPtorch.compilegradient_checkpoint (if OOM).

🎯 Master Summary — 120 APIs in 6 Core Concepts

1. Tensor Ops: Create (tensor/zeros/rand) → Shape (reshape/permute/cat) → Math (matmul/clamp/topk)
2. Autograd: requires_gradbackward()no_grad / detach
3. Networks: nn.Module → Layers (Linear/Conv2d/LSTM/Attention) → Norm + Dropout
4. Training: Loss → Optimizer (AdamW) → Scheduler → clip_grad_norm_
5. Data: DatasettransformsDataLoader → pretrained models
6. Deploy: state_dict → TorchScript / ONNX → AMP / DDP / torch.compile
Utilities & Performance Tips
https://lxy-alexander.github.io/blog/posts/pytorch/api/15utilities--performance-tips/
Author
Alexander Lee
Published at
2026-03-12
License
CC BY-NC-SA 4.0