Skip to content

Vectorization in Python — NumPy vs Pandas vs Polars vs Numba

Posted on:April 14, 2025 at 10:00 AM

“Vectorize your code” is standard Python performance advice. But vectorization isn’t one thing — it’s a family of techniques with very different performance profiles depending on what you’re doing. NumPy, pandas, Polars, and Numba all “vectorize” operations, but through completely different mechanisms.

This post benchmarks all four approaches on realistic operations and gives you a decision framework based on what actually matters: dataset size, operation type, and memory constraints.

Table of contents

Open Table of contents

What “Vectorization” Actually Means

In Python performance, “vectorized” means different things at different layers:

NumPy vectorization: Operations applied to entire arrays at once using optimized C/Fortran code. The Python interpreter is invoked once per operation, not once per element. Enables SIMD (Single Instruction Multiple Data) — one CPU instruction processes 4-8 values simultaneously.

Pandas vectorization: Built on NumPy, adds column-level operations on DataFrames. The overhead comes from metadata management, index alignment, and dtype handling.

Polars: Rust-based, multithreaded, Arrow-native. Operations run in parallel across CPU cores. No Python interpreter in the hot path.

Numba JIT: Compiles Python functions to machine code on first call using LLVM. Handles loops, conditionals, and operations that can’t be expressed as array operations. The “escape hatch” for truly sequential algorithms.

The Benchmark Suite

Operations representative of real data engineering work:

  1. Elementwise arithmetic: (a * b + c) / d on 1D arrays
  2. Conditional assignment: np.where(condition, x, y) equivalent
  3. GroupBy + aggregation: Sum by category
  4. Rolling window: 30-period moving average
  5. Sequential dependency: Running cumulative sum with conditional reset

Dataset sizes: 100K, 1M, 10M, 100M elements.

Machine: AMD Ryzen 9 5900X (12C/24T), 64GB RAM, Python 3.12.

Benchmark 1: Elementwise Arithmetic

Simple math: result[i] = (a[i] * b[i] + c[i]) / d[i]

import numpy as np
import pandas as pd
import polars as pl
from numba import njit
import time
n = 10_000_000
a = np.random.rand(n).astype(np.float32)
b = np.random.rand(n).astype(np.float32)
c = np.random.rand(n).astype(np.float32)
d = np.random.rand(n).astype(np.float32) + 0.1 # avoid div-by-zero
# NumPy
result_np = (a * b + c) / d
# Pandas (wraps NumPy)
s_a, s_b, s_c, s_d = pd.Series(a), pd.Series(b), pd.Series(c), pd.Series(d)
result_pd = (s_a * s_b + s_c) / s_d
# Polars
df = pl.DataFrame({'a': a, 'b': b, 'c': c, 'd': d})
result_pl = df.select((pl.col('a') * pl.col('b') + pl.col('c')) / pl.col('d'))
# Numba (prange required for parallel=True to engage parallelism)
from numba import prange
@njit(parallel=True)
def compute_numba(a, b, c, d):
result = np.empty(len(a), dtype=np.float32)
for i in prange(len(a)):
result[i] = (a[i] * b[i] + c[i]) / d[i]
return result
_ = compute_numba(a, b, c, d) # warm up JIT
result_nb = compute_numba(a, b, c, d)

Results (10M elements, seconds):

MethodTimeRelative to NumPy
Pure Python loop12.4s103x slower
NumPy0.12s1x (baseline)
pandas Series0.18s1.5x
Polars (single column)0.08s0.67x
Numba (parallel=True)0.03s0.25x

For simple elementwise ops: Numba parallel wins, followed by Polars. NumPy is the baseline. pandas adds overhead from index tracking.

The Numba advantage shrinks for smaller arrays (JIT overhead) and disappears for operations that can’t parallelize.

Benchmark 2: Conditional Assignment

Equivalent to: if amount > 1000, set category = 'high', elif > 100, set 'medium', else 'low'

np.select
amounts = np.random.exponential(500, n).astype(np.float32)
conditions = [amounts > 1000, amounts > 100]
choices = [2, 1] # 2=high, 1=medium, 0=low
result_np = np.select(conditions, choices, default=0)
# Pandas: same
pd_amounts = pd.Series(amounts)
result_pd = pd.cut(pd_amounts, bins=[-np.inf, 100, 1000, np.inf], labels=[0, 1, 2])
# Polars
df_pl = pl.DataFrame({'amount': amounts})
result_pl = df_pl.with_columns(
pl.when(pl.col('amount') > 1000).then(2)
.when(pl.col('amount') > 100).then(1)
.otherwise(0)
.alias('category')
)
# Numba
@njit(parallel=True)
def categorize_numba(amounts):
result = np.empty(len(amounts), dtype=np.int8)
for i in prange(len(amounts)):
if amounts[i] > 1000:
result[i] = 2
elif amounts[i] > 100:
result[i] = 1
else:
result[i] = 0
return result

Results (10M elements):

MethodTime
NumPy np.select0.09s
pandas pd.cut0.21s
Polars when/then0.06s
Numba parallel0.02s

Pattern holds: Numba → Polars → NumPy → pandas. Polars is closer to NumPy here (branch-heavy operations don’t SIMD as well).

Benchmark 3: GroupBy + Aggregation

Sum value grouped by category (10K unique categories in 10M rows).

categories = np.random.randint(0, 10_000, n).astype(np.int32)
values = np.random.rand(n).astype(np.float32)
# Pandas
df_pd = pd.DataFrame({'cat': categories, 'val': values})
result_pd = df_pd.groupby('cat')['val'].sum()
# Polars
df_pl = pl.DataFrame({'cat': categories, 'val': values})
result_pl = df_pl.group_by('cat').agg(pl.col('val').sum())
# NumPy (manual groupby using sort + reduceat)
sort_idx = np.argsort(categories)
sorted_cats = categories[sort_idx]
sorted_vals = values[sort_idx]
unique_cats, counts = np.unique(sorted_cats, return_counts=True)
split_indices = np.concatenate([[0], np.cumsum(counts[:-1])])
result_np = np.array([sorted_vals[i:i+c].sum()
for i, c in zip(split_indices, counts)])
# Note: this loop is unavoidable for groupby in pure NumPy
# Numba groupby (hash-based)
@njit
def groupby_sum_numba(cats, vals, n_cats):
result = np.zeros(n_cats, dtype=np.float64)
for i in range(len(cats)):
result[cats[i]] += vals[i]
return result

Results (10M rows, 10K groups):

MethodTime
NumPy (sort + reduceat)0.82s
pandas groupby0.31s
Polars group_by0.09s
Numba (sequential)0.18s

Polars wins decisively for groupby. This is where its hash-join implementation and multithreading shine. pandas is 3x faster than NumPy here (pandas groupby is a well-optimized C extension). Numba struggles because groupby requires random memory access — poor cache behavior.

Benchmark 4: Rolling Window

30-period rolling mean. The operation has a sequential dependency — each output depends on the previous 29 inputs.

# Pandas
pd_series = pd.Series(values)
result_pd = pd_series.rolling(30, min_periods=1).mean()
# NumPy (using stride tricks for vectorized sliding window)
# Note: zero-padding means the first 29 values differ from pandas/Polars (which use growing windows)
from numpy.lib.stride_tricks import sliding_window_view
windows = sliding_window_view(np.pad(values, (29, 0), constant_values=np.nan), 30)
result_np = np.nanmean(windows, axis=1) # nanmean handles the NaN padding correctly
# Polars
df_pl = pl.DataFrame({'val': values})
result_pl = df_pl.with_columns(
pl.col('val').rolling_mean(30, min_periods=1)
)
# Numba (truly sequential — can't parallelize)
@njit
def rolling_mean_numba(arr, window):
result = np.empty(len(arr), dtype=np.float64)
cumsum = 0.0
for i in range(len(arr)):
cumsum += arr[i]
if i >= window:
cumsum -= arr[i - window]
result[i] = cumsum / min(i + 1, window)
return result

Results (10M elements, window=30):

MethodTime
NumPy stride_tricks1.2s
pandas rolling0.24s
Polars rolling_mean0.18s
Numba sequential0.09s

Numba wins because its O(n) single-pass cumulative sum is optimal. Polars can’t parallelize rolling fully. NumPy’s sliding_window_view creates a zero-copy view (no extra memory), but computing .mean(axis=1) over the strided view is cache-unfriendly, hence the slower timing.

Benchmark 5: Sequential Dependency

Cumulative sum with conditional reset — common in financial time series (e.g., “cumulative returns since last drawdown”).

# Pure Python would be: for each element, if condition: reset sum
# This cannot be expressed as a single array operation in NumPy/pandas/Polars
# The only good option: Numba
@njit
def cumsum_with_reset(values, reset_condition):
result = np.empty(len(values), dtype=np.float64)
running_sum = 0.0
for i in range(len(values)):
if reset_condition[i]:
running_sum = 0.0
running_sum += values[i]
result[i] = running_sum
return result
# Numba: 0.04s for 10M elements
# pandas apply: 8.2s
# polars apply: 6.1s

Numba is the only viable option for sequential dependencies.

Decision Framework

Operation type?
├─ Pure elementwise (a op b, no conditionals)
│ └─ Use NumPy or Polars (both fast, Polars wins on large data)
├─ Conditional elementwise (where/select/when-then)
│ └─ Use NumPy np.select() or Polars when/then
│ For complex logic: Numba @njit(parallel=True)
├─ GroupBy + aggregation
│ └─ Use Polars group_by (fastest by far)
│ Fallback: pandas groupby (still good)
├─ Window/rolling operations
│ └─ Small windows (< 100): pandas rolling or Polars rolling_*
│ Sequential pattern: Numba @njit (can't parallelize anyway)
└─ Sequential dependency (each output depends on previous)
└─ Numba @njit — nothing else is close
Second option: Cython
Avoid: pandas apply (10-50x slower)

Memory Considerations by Approach

ApproachIntermediate allocationsNotes
NumPy1-3x input sizeEach operation creates a new array
pandas2-5x input sizeMetadata, index, dtype overhead
Polars lazy<1x (streaming)Predicate/projection pushdown
Numba~0 extraIn-place where possible

For memory-constrained environments: Polars lazy API (scan_* + .collect()) is unmatched.

The Practical Conclusion

For a typical data engineering pipeline:

The hierarchy isn’t fixed. For specific operations, the rankings shift. Profile your actual workload — don’t assume Polars always wins because it’s newer, or NumPy always wins because it’s foundational. The numbers tell the truth.