“Vectorize your code” is standard Python performance advice. But vectorization isn’t one thing — it’s a family of techniques with very different performance profiles depending on what you’re doing. NumPy, pandas, Polars, and Numba all “vectorize” operations, but through completely different mechanisms.
This post benchmarks all four approaches on realistic operations and gives you a decision framework based on what actually matters: dataset size, operation type, and memory constraints.
Table of contents
Open Table of contents
- What “Vectorization” Actually Means
- The Benchmark Suite
- Benchmark 1: Elementwise Arithmetic
- Benchmark 2: Conditional Assignment
- Benchmark 3: GroupBy + Aggregation
- Benchmark 4: Rolling Window
- Benchmark 5: Sequential Dependency
- Decision Framework
- Memory Considerations by Approach
- The Practical Conclusion
- Related posts
What “Vectorization” Actually Means
In Python performance, “vectorized” means different things at different layers:
NumPy vectorization: Operations applied to entire arrays at once using optimized C/Fortran code. The Python interpreter is invoked once per operation, not once per element. Enables SIMD (Single Instruction Multiple Data) — one CPU instruction processes 4-8 values simultaneously.
Pandas vectorization: Built on NumPy, adds column-level operations on DataFrames. The overhead comes from metadata management, index alignment, and dtype handling.
Polars: Rust-based, multithreaded, Arrow-native. Operations run in parallel across CPU cores. No Python interpreter in the hot path.
Numba JIT: Compiles Python functions to machine code on first call using LLVM. Handles loops, conditionals, and operations that can’t be expressed as array operations. The “escape hatch” for truly sequential algorithms.
The Benchmark Suite
Operations representative of real data engineering work:
- Elementwise arithmetic:
(a * b + c) / don 1D arrays - Conditional assignment:
np.where(condition, x, y)equivalent - GroupBy + aggregation: Sum by category
- Rolling window: 30-period moving average
- Sequential dependency: Running cumulative sum with conditional reset
Dataset sizes: 100K, 1M, 10M, 100M elements.
Machine: AMD Ryzen 9 5900X (12C/24T), 64GB RAM, Python 3.12.
Benchmark 1: Elementwise Arithmetic
Simple math: result[i] = (a[i] * b[i] + c[i]) / d[i]
import numpy as npimport pandas as pdimport polars as plfrom numba import njitimport time
n = 10_000_000a = np.random.rand(n).astype(np.float32)b = np.random.rand(n).astype(np.float32)c = np.random.rand(n).astype(np.float32)d = np.random.rand(n).astype(np.float32) + 0.1 # avoid div-by-zero
# NumPyresult_np = (a * b + c) / d
# Pandas (wraps NumPy)s_a, s_b, s_c, s_d = pd.Series(a), pd.Series(b), pd.Series(c), pd.Series(d)result_pd = (s_a * s_b + s_c) / s_d
# Polarsdf = pl.DataFrame({'a': a, 'b': b, 'c': c, 'd': d})result_pl = df.select((pl.col('a') * pl.col('b') + pl.col('c')) / pl.col('d'))
# Numba (prange required for parallel=True to engage parallelism)from numba import prange
@njit(parallel=True)def compute_numba(a, b, c, d): result = np.empty(len(a), dtype=np.float32) for i in prange(len(a)): result[i] = (a[i] * b[i] + c[i]) / d[i] return result
_ = compute_numba(a, b, c, d) # warm up JITresult_nb = compute_numba(a, b, c, d)Results (10M elements, seconds):
| Method | Time | Relative to NumPy |
|---|---|---|
| Pure Python loop | 12.4s | 103x slower |
| NumPy | 0.12s | 1x (baseline) |
| pandas Series | 0.18s | 1.5x |
| Polars (single column) | 0.08s | 0.67x |
| Numba (parallel=True) | 0.03s | 0.25x |
For simple elementwise ops: Numba parallel wins, followed by Polars. NumPy is the baseline. pandas adds overhead from index tracking.
The Numba advantage shrinks for smaller arrays (JIT overhead) and disappears for operations that can’t parallelize.
Benchmark 2: Conditional Assignment
Equivalent to: if amount > 1000, set category = 'high', elif > 100, set 'medium', else 'low'
amounts = np.random.exponential(500, n).astype(np.float32)
conditions = [amounts > 1000, amounts > 100]choices = [2, 1] # 2=high, 1=medium, 0=lowresult_np = np.select(conditions, choices, default=0)
# Pandas: samepd_amounts = pd.Series(amounts)result_pd = pd.cut(pd_amounts, bins=[-np.inf, 100, 1000, np.inf], labels=[0, 1, 2])
# Polarsdf_pl = pl.DataFrame({'amount': amounts})result_pl = df_pl.with_columns( pl.when(pl.col('amount') > 1000).then(2) .when(pl.col('amount') > 100).then(1) .otherwise(0) .alias('category'))
# Numba@njit(parallel=True)def categorize_numba(amounts): result = np.empty(len(amounts), dtype=np.int8) for i in prange(len(amounts)): if amounts[i] > 1000: result[i] = 2 elif amounts[i] > 100: result[i] = 1 else: result[i] = 0 return resultResults (10M elements):
| Method | Time |
|---|---|
| NumPy np.select | 0.09s |
| pandas pd.cut | 0.21s |
| Polars when/then | 0.06s |
| Numba parallel | 0.02s |
Pattern holds: Numba → Polars → NumPy → pandas. Polars is closer to NumPy here (branch-heavy operations don’t SIMD as well).
Benchmark 3: GroupBy + Aggregation
Sum value grouped by category (10K unique categories in 10M rows).
categories = np.random.randint(0, 10_000, n).astype(np.int32)values = np.random.rand(n).astype(np.float32)
# Pandasdf_pd = pd.DataFrame({'cat': categories, 'val': values})result_pd = df_pd.groupby('cat')['val'].sum()
# Polarsdf_pl = pl.DataFrame({'cat': categories, 'val': values})result_pl = df_pl.group_by('cat').agg(pl.col('val').sum())
# NumPy (manual groupby using sort + reduceat)sort_idx = np.argsort(categories)sorted_cats = categories[sort_idx]sorted_vals = values[sort_idx]unique_cats, counts = np.unique(sorted_cats, return_counts=True)split_indices = np.concatenate([[0], np.cumsum(counts[:-1])])result_np = np.array([sorted_vals[i:i+c].sum() for i, c in zip(split_indices, counts)])# Note: this loop is unavoidable for groupby in pure NumPy
# Numba groupby (hash-based)@njitdef groupby_sum_numba(cats, vals, n_cats): result = np.zeros(n_cats, dtype=np.float64) for i in range(len(cats)): result[cats[i]] += vals[i] return resultResults (10M rows, 10K groups):
| Method | Time |
|---|---|
| NumPy (sort + reduceat) | 0.82s |
| pandas groupby | 0.31s |
| Polars group_by | 0.09s |
| Numba (sequential) | 0.18s |
Polars wins decisively for groupby. This is where its hash-join implementation and multithreading shine. pandas is 3x faster than NumPy here (pandas groupby is a well-optimized C extension). Numba struggles because groupby requires random memory access — poor cache behavior.
Benchmark 4: Rolling Window
30-period rolling mean. The operation has a sequential dependency — each output depends on the previous 29 inputs.
# Pandaspd_series = pd.Series(values)result_pd = pd_series.rolling(30, min_periods=1).mean()
# NumPy (using stride tricks for vectorized sliding window)# Note: zero-padding means the first 29 values differ from pandas/Polars (which use growing windows)from numpy.lib.stride_tricks import sliding_window_viewwindows = sliding_window_view(np.pad(values, (29, 0), constant_values=np.nan), 30)result_np = np.nanmean(windows, axis=1) # nanmean handles the NaN padding correctly
# Polarsdf_pl = pl.DataFrame({'val': values})result_pl = df_pl.with_columns( pl.col('val').rolling_mean(30, min_periods=1))
# Numba (truly sequential — can't parallelize)@njitdef rolling_mean_numba(arr, window): result = np.empty(len(arr), dtype=np.float64) cumsum = 0.0 for i in range(len(arr)): cumsum += arr[i] if i >= window: cumsum -= arr[i - window] result[i] = cumsum / min(i + 1, window) return resultResults (10M elements, window=30):
| Method | Time |
|---|---|
| NumPy stride_tricks | 1.2s |
| pandas rolling | 0.24s |
| Polars rolling_mean | 0.18s |
| Numba sequential | 0.09s |
Numba wins because its O(n) single-pass cumulative sum is optimal. Polars can’t parallelize rolling fully. NumPy’s sliding_window_view creates a zero-copy view (no extra memory), but computing .mean(axis=1) over the strided view is cache-unfriendly, hence the slower timing.
Benchmark 5: Sequential Dependency
Cumulative sum with conditional reset — common in financial time series (e.g., “cumulative returns since last drawdown”).
# Pure Python would be: for each element, if condition: reset sum# This cannot be expressed as a single array operation in NumPy/pandas/Polars
# The only good option: Numba@njitdef cumsum_with_reset(values, reset_condition): result = np.empty(len(values), dtype=np.float64) running_sum = 0.0 for i in range(len(values)): if reset_condition[i]: running_sum = 0.0 running_sum += values[i] result[i] = running_sum return result
# Numba: 0.04s for 10M elements# pandas apply: 8.2s# polars apply: 6.1sNumba is the only viable option for sequential dependencies.
Decision Framework
Operation type?├─ Pure elementwise (a op b, no conditionals)│ └─ Use NumPy or Polars (both fast, Polars wins on large data)│├─ Conditional elementwise (where/select/when-then)│ └─ Use NumPy np.select() or Polars when/then│ For complex logic: Numba @njit(parallel=True)│├─ GroupBy + aggregation│ └─ Use Polars group_by (fastest by far)│ Fallback: pandas groupby (still good)│├─ Window/rolling operations│ └─ Small windows (< 100): pandas rolling or Polars rolling_*│ Sequential pattern: Numba @njit (can't parallelize anyway)│└─ Sequential dependency (each output depends on previous) └─ Numba @njit — nothing else is close Second option: Cython Avoid: pandas apply (10-50x slower)Memory Considerations by Approach
| Approach | Intermediate allocations | Notes |
|---|---|---|
| NumPy | 1-3x input size | Each operation creates a new array |
| pandas | 2-5x input size | Metadata, index, dtype overhead |
| Polars lazy | <1x (streaming) | Predicate/projection pushdown |
| Numba | ~0 extra | In-place where possible |
For memory-constrained environments: Polars lazy API (scan_* + .collect()) is unmatched.
The Practical Conclusion
For a typical data engineering pipeline:
- Columnar transforms: Polars for data >1M rows, pandas for smaller
- Complex aggregations: Polars group_by
- Time series rolling: Polars if simple, Numba if sequential
- Custom ML features with loops: Numba
- Quick scripts/exploration: NumPy for raw arrays, pandas for tabular
The hierarchy isn’t fixed. For specific operations, the rankings shift. Profile your actual workload — don’t assume Polars always wins because it’s newer, or NumPy always wins because it’s foundational. The numbers tell the truth.
Related posts
- Polars vs Pandas Benchmark — When to Switch and When to Stay — detailed DataFrame operation benchmarks that complement the vectorization comparisons here
- Python Time Series at Scale — Lessons from Processing 400M Financial Records — applying these vectorization choices to a real production pipeline at ECB scale