Every data engineer has written it. Some have shipped it to production. The crime: iterating over a pandas DataFrame with .iterrows(). In this post, I’ll benchmark five different approaches to the same problem, explain why the performance differs so dramatically, and give you a decision tree for choosing the right method.
Table of contents
Open Table of contents
- The Problem: Categorizing Financial Transactions
- Method 1: .iterrows() — The Antipattern
- Method 2: .itertuples() — The Better Iterator
- Method 3: .apply() — Middle Ground
- Method 4: np.where() and np.select() — Vectorized
- Method 5: Pandas .loc[] with Boolean Masks — Most Readable Vectorized
- Benchmark Summary
- When You Actually Need a Loop
- The Decision Tree
- Memory Considerations
The Problem: Categorizing Financial Transactions
We’ll use a realistic example: categorizing 1 million bank transactions based on amount and merchant type. Not a toy example — the kind of thing that runs daily in fintech pipelines.
import pandas as pdimport numpy as npimport time
# Generate 1M transactionsnp.random.seed(42)n = 1_000_000
df = pd.DataFrame({ 'amount': np.random.uniform(0.01, 10_000, n), 'merchant_type': np.random.choice( ['retail', 'food', 'transport', 'entertainment', 'other'], n ), 'is_international': np.random.choice([True, False], n, p=[0.15, 0.85])})The task: add a category column with values 'high_value', 'medium_value', or 'low_value' based on amount, and apply a 5% fee multiplier to international transactions.
Method 1: .iterrows() — The Antipattern
def categorize_iterrows(df): categories = [] fees = [] for idx, row in df.iterrows(): if row['amount'] > 1000: categories.append('high_value') elif row['amount'] > 100: categories.append('medium_value') else: categories.append('low_value')
fee = row['amount'] * 0.05 if row['is_international'] else 0 fees.append(fee)
df['category'] = categories df['fee'] = fees return dfTime on 1M rows: ~180 seconds
Yes, three minutes for a million rows. This is not a joke.
Why Is .iterrows() So Slow?
.iterrows() converts each row into a Python Series object on every iteration. This involves:
- Creating a new Python object (memory allocation)
- Boxing each value into a Python object (int → Python int, float → Python float)
- Hash-based column lookup for each
row['column']access - Python interpreter overhead for the loop itself
For 1M rows, that’s 1M object allocations, ~3M value boxing operations, and ~2M dict lookups. In a language with a GIL and no JIT (base CPython), this is genuinely expensive.
Method 2: .itertuples() — The Better Iterator
def categorize_itertuples(df): categories = [] fees = [] for row in df.itertuples(index=False): if row.amount > 1000: categories.append('high_value') elif row.amount > 100: categories.append('medium_value') else: categories.append('low_value')
fee = row.amount * 0.05 if row.is_international else 0 fees.append(fee)
df = df.copy() df['category'] = categories df['fee'] = fees return dfTime on 1M rows: ~12 seconds
itertuples() returns named tuples, not Series objects. Named tuples are subclasses of Python tuples — attribute access is a fast tuple.__getitem__ by index, not a dict lookup. The speedup comes from avoiding Series creation and index alignment per row, not from avoiding value boxing (individual cell values are still Python objects). Result: 15x faster than iterrows().
Still not great. We haven’t touched vectorization yet.
Method 3: .apply() — Middle Ground
def categorize_apply(df): def process_row(row): category = 'high_value' if row['amount'] > 1000 else ( 'medium_value' if row['amount'] > 100 else 'low_value' ) fee = row['amount'] * 0.05 if row['is_international'] else 0 return pd.Series({'category': category, 'fee': fee})
result = df.apply(process_row, axis=1) return pd.concat([df, result], axis=1)Time on 1M rows: ~95 seconds
Worse than itertuples! apply(axis=1) has the overhead of both iteration and creating a Series for each result. Use apply() for column-level operations (axis=0), not row-level.
The one case where apply() shines is complex string operations or calling external functions where vectorization isn’t possible.
Method 4: np.where() and np.select() — Vectorized
def categorize_vectorized(df): df = df.copy()
# np.select for multi-condition categorization conditions = [ df['amount'] > 1000, df['amount'] > 100 ] choices = ['high_value', 'medium_value'] df['category'] = np.select(conditions, choices, default='low_value')
# np.where for binary fee calculation df['fee'] = np.where(df['is_international'], df['amount'] * 0.05, 0)
return dfTime on 1M rows: ~0.08 seconds (80ms)
2,250x faster than iterrows. We went from 3 minutes to 80 milliseconds.
What Makes This Fast?
np.select() and np.where() operate on entire NumPy arrays at once, entirely in C. No Python interpreter overhead per element. The CPU can vectorize these operations using SIMD instructions (AVX2 on modern CPUs processes 8 floats simultaneously). Memory access is sequential (cache-friendly), unlike the scattered object graph of Python objects.
Method 5: Pandas .loc[] with Boolean Masks — Most Readable Vectorized
def categorize_masked(df): df = df.copy()
# Assign default, then override with conditions df['category'] = 'low_value' df.loc[df['amount'] > 100, 'category'] = 'medium_value' df.loc[df['amount'] > 1000, 'category'] = 'high_value'
df['fee'] = 0.0 df.loc[df['is_international'], 'fee'] = df.loc[df['is_international'], 'amount'] * 0.05
return dfTime on 1M rows: ~0.15 seconds (150ms)
Slightly slower than np.select() (boolean mask operations in pandas have some overhead), but arguably more readable. Same order of magnitude.
Benchmark Summary
| Method | Time (1M rows) | Speedup vs iterrows |
|---|---|---|
.iterrows() | 180,000 ms | 1x (baseline) |
.apply(axis=1) | 95,000 ms | 1.9x |
.itertuples() | 12,000 ms | 15x |
.loc[] + bool masks | 150 ms | 1,200x |
np.where() / np.select() | 80 ms | 2,250x |
When You Actually Need a Loop
Vectorization isn’t always possible. Legitimate use cases for row iteration:
Sequential dependencies: Computing a running balance where each row depends on the previous result. In this case, use Numba’s @jit:
from numba import jitimport numpy as np
@jit(nopython=True)def running_balance(amounts, fees): balance = np.zeros(len(amounts)) balance[0] = amounts[0] - fees[0] for i in range(1, len(amounts)): balance[i] = balance[i-1] + amounts[i] - fees[i] return balanceCalling external APIs per row: No way around it — each API call is independent I/O. Use asyncio with batching instead of sequential loops.
Complex business logic with multiple if/elif branches: Consider pd.cut() and pd.map() first. If the logic is genuinely impossible to vectorize, itertuples() is your fallback.
The Decision Tree
Need row-level operation?├─ Can it be expressed as column comparisons / arithmetic?│ └─ YES → np.where() / np.select() / boolean masks [BEST]├─ Does each row call an external function you can't change?│ ├─ Does each call depend on previous results?│ │ └─ YES → Numba @jit or Cython│ └─ Are calls independent?│ └─ YES → itertuples() or apply() on chunks with multiprocessing└─ Complex sequential logic with state? └─ Numba @jit (orders of magnitude faster than pure Python loops)Memory Considerations
Vectorized operations aren’t free on memory. np.select() with 1M rows creates intermediate boolean arrays (~1MB each, since NumPy booleans are 1 byte). For truly large DataFrames (100M+ rows), consider:
- Chunked processing:
pd.read_csv(chunksize=100_000) - Polars: Lazy evaluation avoids materializing all intermediate results
- Dask: Distributed pandas operations across cores/machines
The right tool depends on your data size. For anything fitting in RAM, pandas + NumPy vectorization is almost always sufficient — and the bottleneck is rarely computation once you stop iterating.
Stop using .iterrows(). Your future self (and your on-call rotation) will thank you.