XV. Utilities & Performance Tips (实用工具与性能技巧)#

1. `torch.no_grad()` vs `torch.inference_mode()`#

inference_mode is more aggressive than no_grad: it skips version counting entirely for faster pure inference.

1
with torch.no_grad():
2
    out1 = model(x)
3

4
@torch.inference_mode()
5
def predict(x):
6
    return model(x)

Note: PyTorch 1.9+ recommends inference_mode for pure inference — faster than no_grad.

2. `torch.Tensor.pin_memory()`#

Pins CPU Tensors to Page-locked Memory (页锁定内存), dramatically accelerating CPU→GPU data transfer.

1
loader = DataLoader(dataset, pin_memory=True, num_workers=4)
2
for x, y in loader:
3
    x = x.to('cuda', non_blocking=True)  # async transfer

Note: pin_memory=True + non_blocking=True enables CPU data loading to overlap with GPU computation (Pipeline, 流水线并行).

3. `torch.utils.checkpoint.checkpoint()`#

Gradient Checkpointing (梯度检查点): trades recomputation for VRAM savings. Can save 50%+ VRAM for very large models.

1
from torch.utils.checkpoint import checkpoint
2

3
def forward(self, x):
4
    x = checkpoint(self.heavy_block, x)  # no intermediate activations saved
5
    return self.head(x)

Note: Trades ~30% extra training time for drastically reduced VRAM — enables training models that would otherwise OOM.

4. `torch.nn.functional.one_hot()`#

Converts integer class indices to One-hot Encoding (独热编码) tensors.

1
labels = torch.tensor([0, 2, 1, 3])
2
one_hot = F.one_hot(labels, num_classes=4).float()
3
# tensor([[1,0,0,0],[0,0,1,0],[0,1,0,0],[0,0,0,1]])

Note: Output defaults to LongTensor. Call .float() before participating in loss computation.

5. `torch.nn.functional.cosine_similarity()`#

Computes Cosine Similarity (余弦相似度) between two groups of vectors. Core metric in Contrastive Learning (对比学习) — SimCLR, CLIP.

1
a = torch.randn(8, 128)
2
b = torch.randn(8, 128)
3
sim = F.cosine_similarity(a, b, dim=-1)  # shape [8], range [-1, 1]
4

5
# Similarity matrix for contrastive learning
6
mat = a @ b.T  # [8, 8]

Note: Equivalent to (a/||a||) · (b/||b||). In contrastive learning, L2-normalize first then use dot product.

6. `nn.SyncBatchNorm.convert_sync_batchnorm()`#

Replaces all BatchNorm layers with cross-GPU Synchronized BatchNorm (跨GPU同步批归一化). Essential for DDP training with BN.

1
model = MyModel()
2
model = nn.SyncBatchNorm.convert_sync_batchnorm(model)  # before DDP wrap
3
model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])

Note: Must convert before wrapping with DDP. Without SyncBN, each GPU computes its own BN statistics — inaccurate.

💡 One-line Takeaway
Performance stack: pin_memory + non_blocking → AMP → torch.compile → gradient_checkpoint (if OOM).

🎯 Master Summary — 120 APIs in 6 Core Concepts

1. Tensor Ops: Create (tensor/zeros/rand) → Shape (reshape/permute/cat) → Math (matmul/clamp/topk)
2. Autograd: requires_grad → backward() → no_grad / detach
3. Networks: nn.Module → Layers (Linear/Conv2d/LSTM/Attention) → Norm + Dropout
4. Training: Loss → Optimizer (AdamW) → Scheduler → clip_grad_norm_
5. Data: Dataset → transforms → DataLoader → pretrained models
6. Deploy: state_dict → TorchScript / ONNX → AMP / DDP / torch.compile

XV. Utilities & Performance Tips (实用工具与性能技巧)#

1. torch.no_grad() vs torch.inference_mode()#

2. torch.Tensor.pin_memory()#

3. torch.utils.checkpoint.checkpoint()#

4. torch.nn.functional.one_hot()#

5. torch.nn.functional.cosine_similarity()#

6. nn.SyncBatchNorm.convert_sync_batchnorm()#