Skip to content

Pandas Performance — Stop Using .iterrows() (with Benchmarks)

Posted on:March 14, 2024 at 10:00 AM

Every data engineer has written it. Some have shipped it to production. The crime: iterating over a pandas DataFrame with .iterrows(). In this post, I’ll benchmark five different approaches to the same problem, explain why the performance differs so dramatically, and give you a decision tree for choosing the right method.

Table of contents

Open Table of contents

The Problem: Categorizing Financial Transactions

We’ll use a realistic example: categorizing 1 million bank transactions based on amount and merchant type. Not a toy example — the kind of thing that runs daily in fintech pipelines.

import pandas as pd
import numpy as np
import time
# Generate 1M transactions
np.random.seed(42)
n = 1_000_000
df = pd.DataFrame({
'amount': np.random.uniform(0.01, 10_000, n),
'merchant_type': np.random.choice(
['retail', 'food', 'transport', 'entertainment', 'other'], n
),
'is_international': np.random.choice([True, False], n, p=[0.15, 0.85])
})

The task: add a category column with values 'high_value', 'medium_value', or 'low_value' based on amount, and apply a 5% fee multiplier to international transactions.

Method 1: .iterrows() — The Antipattern

def categorize_iterrows(df):
categories = []
fees = []
for idx, row in df.iterrows():
if row['amount'] > 1000:
categories.append('high_value')
elif row['amount'] > 100:
categories.append('medium_value')
else:
categories.append('low_value')
fee = row['amount'] * 0.05 if row['is_international'] else 0
fees.append(fee)
df['category'] = categories
df['fee'] = fees
return df

Time on 1M rows: ~180 seconds

Yes, three minutes for a million rows. This is not a joke.

Why Is .iterrows() So Slow?

.iterrows() converts each row into a Python Series object on every iteration. This involves:

  1. Creating a new Python object (memory allocation)
  2. Boxing each value into a Python object (int → Python int, float → Python float)
  3. Hash-based column lookup for each row['column'] access
  4. Python interpreter overhead for the loop itself

For 1M rows, that’s 1M object allocations, ~3M value boxing operations, and ~2M dict lookups. In a language with a GIL and no JIT (base CPython), this is genuinely expensive.

Method 2: .itertuples() — The Better Iterator

def categorize_itertuples(df):
categories = []
fees = []
for row in df.itertuples(index=False):
if row.amount > 1000:
categories.append('high_value')
elif row.amount > 100:
categories.append('medium_value')
else:
categories.append('low_value')
fee = row.amount * 0.05 if row.is_international else 0
fees.append(fee)
df = df.copy()
df['category'] = categories
df['fee'] = fees
return df

Time on 1M rows: ~12 seconds

itertuples() returns named tuples, not Series objects. Named tuples are subclasses of Python tuples — attribute access is a fast tuple.__getitem__ by index, not a dict lookup. The speedup comes from avoiding Series creation and index alignment per row, not from avoiding value boxing (individual cell values are still Python objects). Result: 15x faster than iterrows().

Still not great. We haven’t touched vectorization yet.

Method 3: .apply() — Middle Ground

def categorize_apply(df):
def process_row(row):
category = 'high_value' if row['amount'] > 1000 else (
'medium_value' if row['amount'] > 100 else 'low_value'
)
fee = row['amount'] * 0.05 if row['is_international'] else 0
return pd.Series({'category': category, 'fee': fee})
result = df.apply(process_row, axis=1)
return pd.concat([df, result], axis=1)

Time on 1M rows: ~95 seconds

Worse than itertuples! apply(axis=1) has the overhead of both iteration and creating a Series for each result. Use apply() for column-level operations (axis=0), not row-level.

The one case where apply() shines is complex string operations or calling external functions where vectorization isn’t possible.

Method 4: np.where() and np.select() — Vectorized

def categorize_vectorized(df):
df = df.copy()
# np.select for multi-condition categorization
conditions = [
df['amount'] > 1000,
df['amount'] > 100
]
choices = ['high_value', 'medium_value']
df['category'] = np.select(conditions, choices, default='low_value')
# np.where for binary fee calculation
df['fee'] = np.where(df['is_international'], df['amount'] * 0.05, 0)
return df

Time on 1M rows: ~0.08 seconds (80ms)

2,250x faster than iterrows. We went from 3 minutes to 80 milliseconds.

What Makes This Fast?

np.select() and np.where() operate on entire NumPy arrays at once, entirely in C. No Python interpreter overhead per element. The CPU can vectorize these operations using SIMD instructions (AVX2 on modern CPUs processes 8 floats simultaneously). Memory access is sequential (cache-friendly), unlike the scattered object graph of Python objects.

Method 5: Pandas .loc[] with Boolean Masks — Most Readable Vectorized

def categorize_masked(df):
df = df.copy()
# Assign default, then override with conditions
df['category'] = 'low_value'
df.loc[df['amount'] > 100, 'category'] = 'medium_value'
df.loc[df['amount'] > 1000, 'category'] = 'high_value'
df['fee'] = 0.0
df.loc[df['is_international'], 'fee'] = df.loc[df['is_international'], 'amount'] * 0.05
return df

Time on 1M rows: ~0.15 seconds (150ms)

Slightly slower than np.select() (boolean mask operations in pandas have some overhead), but arguably more readable. Same order of magnitude.

Benchmark Summary

MethodTime (1M rows)Speedup vs iterrows
.iterrows()180,000 ms1x (baseline)
.apply(axis=1)95,000 ms1.9x
.itertuples()12,000 ms15x
.loc[] + bool masks150 ms1,200x
np.where() / np.select()80 ms2,250x

When You Actually Need a Loop

Vectorization isn’t always possible. Legitimate use cases for row iteration:

Sequential dependencies: Computing a running balance where each row depends on the previous result. In this case, use Numba’s @jit:

from numba import jit
import numpy as np
@jit(nopython=True)
def running_balance(amounts, fees):
balance = np.zeros(len(amounts))
balance[0] = amounts[0] - fees[0]
for i in range(1, len(amounts)):
balance[i] = balance[i-1] + amounts[i] - fees[i]
return balance

Calling external APIs per row: No way around it — each API call is independent I/O. Use asyncio with batching instead of sequential loops.

Complex business logic with multiple if/elif branches: Consider pd.cut() and pd.map() first. If the logic is genuinely impossible to vectorize, itertuples() is your fallback.

The Decision Tree

Need row-level operation?
├─ Can it be expressed as column comparisons / arithmetic?
│ └─ YES → np.where() / np.select() / boolean masks [BEST]
├─ Does each row call an external function you can't change?
│ ├─ Does each call depend on previous results?
│ │ └─ YES → Numba @jit or Cython
│ └─ Are calls independent?
│ └─ YES → itertuples() or apply() on chunks with multiprocessing
└─ Complex sequential logic with state?
└─ Numba @jit (orders of magnitude faster than pure Python loops)

Memory Considerations

Vectorized operations aren’t free on memory. np.select() with 1M rows creates intermediate boolean arrays (~1MB each, since NumPy booleans are 1 byte). For truly large DataFrames (100M+ rows), consider:

The right tool depends on your data size. For anything fitting in RAM, pandas + NumPy vectorization is almost always sufficient — and the bottleneck is rarely computation once you stop iterating.

Stop using .iterrows(). Your future self (and your on-call rotation) will thank you.