573 words
3 minutes
NumPy Statistical Analysis

IV. NumPy Statistical Analysis (统计分析)
NumPy's statistical functions let you summarize arrays with one line of code. Most functions accept an axis (轴) argument — without it, they operate on all elements; with it, they reduce along the specified dimension.
1. sum() / mean() — Total & Average (总和与均值)
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
np.sum(a) # 21 — sum of ALL elementsnp.sum(a, axis=0) # [5 7 9] — column sums (按列求和)np.sum(a, axis=1) # [6 15] — row sums (按行求和)
np.mean(a) # 3.5np.mean(a, axis=0) # [2.5 3.5 4.5]Note:
axis=0 collapses rows (operates down each column); axis=1 collapses columns (operates across each row).2. min() / max() — Extreme Values (极值)
np.min(a) # 1np.max(a) # 6np.min(a, axis=1) # [1 4] — min of each rownp.max(a, axis=0) # [4 5 6] — max of each column
np.ptp(a) # 5 — peak-to-peak = max - min (极差)3. argmin() / argmax() — Index of Extreme Values (极值索引)
Core idea: Returns the index (索引) of the minimum or maximum element, not the value itself.
b = np.array([3, 1, 4, 1, 5, 9, 2])
np.argmin(b) # 1 (index of first minimum value 1)np.argmax(b) # 5 (index of maximum value 9)
# Along an axisnp.argmax(a, axis=0) # [1 1 1] → row index of max in each column4. std() / var() — Spread Measures (离散程度)
Core idea: Measure how spread out the data is. Standard deviation (标准差) =
a = np.array([2, 4, 4, 4, 5, 5, 7, 9])
np.std(a) # 2.0 — population std (总体标准差)np.var(a) # 4.0 — population variance (总体方差)
# Sample std/var (样本标准差/方差): use ddof=1np.std(a, ddof=1) # 2.138...np.var(a, ddof=1) # 4.571...Note: Default
ddof=0 gives population statistics. Use ddof=1 for sample statistics (common in data analysis).5. cumsum() / cumprod() — Cumulative Functions (累积函数)
Core idea: Returns running totals — each output element is the sum (or product) of all elements up to that position.
a = np.array([1, 2, 3, 4])
np.cumsum(a) # [1 3 6 10] — running sum (累积和)np.cumprod(a) # [1 2 6 24] — running product (累积积)
# 2-D with axism = np.array([[1,2],[3,4]])np.cumsum(m, axis=0) # [[1,2],[4,6]] — cumulative down columns6. median() / percentile() — Percentile Stats (百分位数)
a = np.array([1, 2, 3, 4, 5])
np.median(a) # 3.0 — middle value (中位数)np.percentile(a, 25) # 2.0 — 25th percentile (四分位数)np.percentile(a, [25, 50, 75]) # [2. 3. 4.]7. Quick Comparison Table
| Function (函数) | Returns | axis support? |
|---|---|---|
sum() | Total of elements | ✅ |
mean() | Average value | ✅ |
min() / max() | Smallest / largest value | ✅ |
argmin() / argmax() | Index of min / max | ✅ |
std() | Standard deviation | ✅ |
var() | Variance | ✅ |
cumsum() | Running sum array | ✅ |
median() | Middle value | ✅ |
percentile(a, q) | q-th percentile | ✅ |
💡 One-line Takeaway
Always specify
Always specify
axis for multi-dimensional arrays, and use ddof=1 when computing sample (not population) statistics. NumPy Statistical Analysis
https://lxy-alexander.github.io/blog/posts/numpy/api/04numpy-statistical-analysis/