XI. Advanced Features & Utilities (高级特性与实用工具)#

1. `torch.nn.utils.clip_grad_norm_()`#

Clips the global L2 norm of all parameter gradients to prevent Gradient Explosion (梯度爆炸). Essential for RNN/Transformer training.

1
optimizer.zero_grad()
2
loss.backward()
3
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
4
optimizer.step()

Note: max_norm=1.0 is standard for Transformers. Must call after backward(), before step().

2. `torch.nn.utils.weight_norm()`#

Decomposes parameters into direction and magnitude, accelerating convergence. Used in WaveNet and generative models.

1
from torch.nn.utils import weight_norm, remove_weight_norm
2
wn_conv = weight_norm(nn.Conv1d(64, 64, 3, padding=1))
3
remove_weight_norm(wn_conv)  # merge before deployment

Note: Always call remove_weight_norm before deployment to merge decomposed parameters.

3. `torch.nn.functional.interpolate()`#

Upsamples or downsamples feature maps with bilinear, nearest, bicubic, etc. interpolation modes.

1
x = torch.rand(1, 64, 28, 28)
2
up = F.interpolate(x, scale_factor=2, mode='bilinear', align_corners=False)
3
print(up.shape)  # [1, 64, 56, 56]

Note: align_corners=False matches TensorFlow's default behavior — important when migrating models.

4. `torch.nn.functional.grid_sample()`#

Samples from a feature map at normalized grid coordinates. The core of Spatial Transformer Networks (空间变换网络, STN).

1
theta = torch.eye(2, 3, dtype=torch.float).unsqueeze(0)
2
grid = F.affine_grid(theta, x.size())
3
out = F.grid_sample(x, grid, mode='bilinear')

Note: Coordinate range is [-1, 1]. padding_mode='reflection' is more natural for image borders.

5. `torch.einsum()`#

Einstein Summation (爱因斯坦求和): expresses complex tensor operations as a concise string equation.

1
c = torch.einsum('ij,jk->ik', a, b)  # matrix multiply
2

3
# Attention scores: Q:[B,H,L,D], K:[B,H,L,D]
4
scores = torch.einsum('bhld,bhmd->bhlm', Q, K)

Note: Mastering einsum dramatically simplifies Transformer and graph neural network code.

6. `torch.profiler.profile()`#

Performance profiler that records per-operator CPU/GPU time and memory usage to locate bottlenecks.

1
from torch.profiler import profile, ProfilerActivity
2
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
3
    model(x)
4
print(prof.key_averages().table(sort_by='cuda_time_total'))

Note: Export a Chrome trace with prof.export_chrome_trace() for visual bottleneck analysis in a browser.

7. `torch.nn.init.*`#

Provides Xavier, Kaiming, Orthogonal and other Parameter Initialization (参数初始化) strategies. Directly impacts training stability.

1
def init_weights(m):
2
    if isinstance(m, nn.Linear):
3
        nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
4
        nn.init.zeros_(m.bias)
5

6
model.apply(init_weights)

Note: Sigmoid/Tanh → Xavier; ReLU family → Kaiming; Transformer → Orthogonal initialization.

8. `torch.Tensor.item()`#

Converts a single-element Tensor to a Python scalar. Commonly used to log loss values.

1
loss = criterion(output, label)
2
loss_val = loss.item()        # detaches from graph
3
print(f'Loss: {loss_val:.4f}')
4

5
# WRONG: total_loss += loss  ← graph grows → OOM
6
# CORRECT:
7
total_loss += loss.item()

Note: Accumulating loss tensors directly causes the computation graph to grow unboundedly → OOM. Always use .item().

9. `torch.Tensor.clone()`#

Creates a deep copy (深拷贝) of a Tensor with fully independent data, while preserving gradient propagation.

1
x = torch.rand(3, requires_grad=True)
2
y = x.clone()           # gradient can still propagate
3
z = x.detach().clone()  # gradient detached
4

5
buf = torch.empty(3)
6
buf.copy_(x)            # in-place copy

Note: Use clone().detach() for Target Network (目标网络) parameter copying in RL / momentum update.

10. `torch.where()`#

Conditional selection: returns values from x where condition is True, else from y. Vectorized if-else.

1
x = torch.tensor([-1., 2., -3., 4.])
2
y = torch.zeros_like(x)
3
out = torch.where(x > 0, x, y)  # tensor([0., 2., 0., 4.])  — manual ReLU

Note: Both branches participate in gradient computation; the gradient of the unselected branch is zero.

11. `torch.gather()`#

Gathers values from a source Tensor by an index Tensor — enables irregular indexing (不规则索引) like NMS.

1
logits = torch.rand(4, 10)
2
targets = torch.tensor([3, 7, 1, 5]).unsqueeze(1)  # [4, 1]
3
scores = logits.gather(dim=1, index=targets)         # [4, 1]

Note: Core tool for sequence decoding, top-k sampling, and Q-value selection in Reinforcement Learning (强化学习).

12. `torch.scatter_()`#

Scatters values from src into self at positions specified by index (in-place).

1
y = torch.zeros(4, 5)
2
labels = torch.tensor([[2], [0], [4], [1]])
3
y.scatter_(dim=1, index=labels, value=1.0)  # one-hot encoding

Note: scatter_add_ implements segment sum — the fundamental primitive for Graph Neural Network (图神经网络) message aggregation.

13. `torch.masked_fill()`#

Fills positions where the mask is True with a specified value. Essential for Attention Masking (注意力掩码).

1
L = 5
2
mask = torch.triu(torch.ones(L, L), diagonal=1).bool()
3
scores = torch.rand(L, L)
4
scores = scores.masked_fill(mask, float('-inf'))  # causal mask

Note: Transformer decoder's Causal Self-Attention (因果自注意力) must use this to block future information.

14. `torch.nn.utils.rnn.pad_sequence()`#

Pads a list of variable-length sequences to a uniform-length tensor for NLP batch processing.

1
from torch.nn.utils.rnn import pad_sequence
2
seqs = [torch.tensor([1, 2, 3]), torch.tensor([4, 5]), torch.tensor([6])]
3
padded = pad_sequence(seqs, batch_first=True)  # [3, 3]

Note: Combine with pack_padded_sequence to skip LSTM computation on padding positions.

15. `nn.TransformerEncoder()`#

Multi-layer Transformer Encoder with built-in Multi-head Attention + FFN + Residual Normalization.

1
enc_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8, dim_feedforward=2048, batch_first=True)
2
encoder = nn.TransformerEncoder(enc_layer, num_layers=6)
3
out = encoder(src, src_key_padding_mask=m)

Note: norm_first=True (Pre-LN) trains more stably — recommended for large models.

16. `torch.Tensor.contiguous()`#

Returns a Contiguous Memory (连续内存) copy of the Tensor. Returns itself (zero overhead) if already contiguous.

1
x = torch.rand(4, 5, 6)
2
y = x.permute(2, 0, 1)         # non-contiguous
3
print(y.is_contiguous())        # False
4
z = y.contiguous()
5
w = z.view(6, -1)               # safe to view now

Note: In most cases, reshape handles this automatically. Call manually only when contiguous memory is truly required.

17. `torch.Tensor.type()` / `.to(dtype)`#

Converts the Tensor's Data Type (数据类型): float32 ↔ float16 ↔ int64, etc.

1
x = torch.tensor([1, 2, 3])
2
f = x.float()   # int → float32
3
h = x.half()    # float32 → float16
4
l = x.long()    # → int64
5
y = x.to(dtype=torch.float32)  # recommended

Note: Cross-entropy labels need long(); normalized image pixels need float(); inference acceleration uses half().

💡 One-line Takeaway
The advanced toolkit: clip_grad_norm_ (stability), einsum (clarity), gather/scatter (indexing), masked_fill (attention), item() (memory safety).

XI. Advanced Features & Utilities (高级特性与实用工具)#

1. torch.nn.utils.clip_grad_norm_()#

2. torch.nn.utils.weight_norm()#

3. torch.nn.functional.interpolate()#

4. torch.nn.functional.grid_sample()#

5. torch.einsum()#

6. torch.profiler.profile()#

7. torch.nn.init.*#

8. torch.Tensor.item()#

9. torch.Tensor.clone()#

10. torch.where()#

11. torch.gather()#

12. torch.scatter_()#

13. torch.masked_fill()#

14. torch.nn.utils.rnn.pad_sequence()#

15. nn.TransformerEncoder()#

16. torch.Tensor.contiguous()#

17. torch.Tensor.type() / .to(dtype)#