Skip to content

Python Time Series at Scale — Lessons from Processing 400M Financial Records

Posted on:July 22, 2024 at 10:00 AM

The European Central Bank’s SPACE platform (Statistical Data Warehouse) processes statistical data from all Eurozone institutions — hundreds of millions of time series covering everything from bank balance sheets to money market volumes. Working on the data validation and enrichment pipeline for this system taught me more about Python memory management and time series processing than any tutorial ever could.

This post distills the most impactful lessons. Everything here I’ve applied to production pipelines processing 400M+ records daily on modest hardware.

Table of contents

Open Table of contents

The Problem Space

Financial time series have specific characteristics that make them challenging:

A naive pandas approach (load everything into memory, operate, save) stops working around 10-20M rows on a standard machine. At 400M, you need to be deliberate about every operation.

Lesson 1: dtypes Are Everything

The most impactful optimization is always dtype selection. Pandas defaults to float64 and object — rarely the right choice:

import pandas as pd
import numpy as np
# BAD: default dtypes for a typical financial time series DataFrame
df = pd.read_csv('timeseries.csv')
print(df.dtypes)
# date object ← 50 bytes/row
# series_id object ← 50 bytes/row
# value float64 ← 8 bytes/row
# status object ← 50 bytes/row
# revision int64 ← 8 bytes/row
print(df.memory_usage(deep=True).sum() / 1e9, "GB")
# 1.2 GB for 10M rows
# GOOD: explicit dtype specification at read time
dtype_map = {
'series_id': 'category', # repeating string → category
'value': 'float32', # 4 bytes instead of 8 (sufficient for most financial values)
'status': 'category', # e.g., 'A', 'P', 'R' → category
'revision': 'int16', # revision numbers rarely exceed 32,767
}
df = pd.read_csv(
'timeseries.csv',
dtype=dtype_map,
parse_dates=['date'],
)
df['date'] = df['date'].astype('datetime64[D]') # date only, not datetime64[ns]
print(df.memory_usage(deep=True).sum() / 1e9, "GB")
# 0.18 GB — 6.5x reduction

The category dtype stores repeated strings as an integer index into a lookup table. For a column with 1000 unique series_id values across 10M rows, instead of storing 10M strings (50 bytes each = 500MB), you store 10M integers (1-4 bytes each = 10-40MB) plus a 1000-entry lookup table.

datetime64[D] (day precision) still uses 8 bytes internally (all datetime64 variants are 64-bit), but signals intent and avoids accidental sub-day arithmetic. For date-only time series, nanosecond precision is conceptual waste even though storage is the same.

Memory Usage Reference

dtypeBytes/elementUse when
float648High-precision calculations
float324Most financial values (6-7 sig digits sufficient)
int648Large integers (IDs, counts)
int324Values fitting in [-2B, 2B]
int162Small integers (-32768 to 32767)
int81Categorical integers (0-127)
bool1Boolean flags
categoryvariesColumns with repeated values
datetime64[ns]8Timestamps with sub-second precision
datetime64[D]8Date-only (same storage as ns, signals intent)
object50+Avoid — always replace

Lesson 2: Chunked Processing with Generator Pipelines

For files larger than available RAM, never load everything at once. Process in chunks:

def process_timeseries_chunk(chunk: pd.DataFrame) -> pd.DataFrame:
"""Process a single chunk — pure function, no side effects"""
# Validate
chunk = chunk.dropna(subset=['series_id', 'date'])
chunk = chunk[chunk['value'].between(-1e15, 1e15)]
# Enrich with derived columns
chunk['year'] = chunk['date'].dt.year
chunk['quarter'] = chunk['date'].dt.quarter
# Compute moving average within each series_id group
chunk = chunk.sort_values(['series_id', 'date'])
chunk['value_ma4'] = (
chunk.groupby('series_id')['value']
.transform(lambda x: x.rolling(4, min_periods=1).mean())
)
return chunk
def chunked_pipeline(filepath: str, chunksize: int = 500_000):
"""Generator that yields processed chunks"""
reader = pd.read_csv(
filepath,
dtype=dtype_map,
parse_dates=['date'],
chunksize=chunksize,
)
for chunk in reader:
chunk['date'] = chunk['date'].astype('datetime64[D]')
yield process_timeseries_chunk(chunk)
# Write results incrementally using PyArrow writer (pandas has no append mode for Parquet)
import pyarrow as pa
import pyarrow.parquet as pq
output_path = 'processed.parquet'
writer = None
for chunk in chunked_pipeline('timeseries.csv'):
table = pa.Table.from_pandas(chunk, preserve_index=False)
if writer is None:
writer = pq.ParquetWriter(output_path, table.schema)
writer.write_table(table)
if writer:
writer.close()

The chunksize of 500,000 rows balances throughput (larger chunks = fewer I/O round trips) against memory (smaller chunks = lower peak usage). For our dtype-optimized DataFrame, 500K rows ≈ ~20MB — comfortable on any machine with 4GB RAM.

Critical caveat: Rolling window calculations across chunk boundaries are incorrect. A 4-period moving average at the start of chunk 2 doesn’t include the last 3 rows of chunk 1. Solutions:

  1. Overlap chunks: Add the last N rows of the previous chunk to the beginning of the next
  2. Process per-group: Group by series_id first, then process each group
  3. Use Polars (discussed below) with lazy evaluation

Lesson 3: Parquet Over CSV — Always

For any data that isn’t final output to humans, use Parquet:

FormatRead speedWrite speedSizedtype preservation
CSV1x1x1xNo (everything is string)
Parquet (snappy)5-10x3-5x0.3xYes
Parquet (zstd)4-8x2-4x0.2xYes
Feather/Arrow10-20x8-15x0.5xYes

Parquet stores metadata including column dtypes, enabling pandas to read with correct types without specifying dtype manually. It also supports predicate pushdown — reading only the rows that match a filter without loading the full file:

import pyarrow.parquet as pq
# Read only Q4 data without loading the full dataset
table = pq.read_table(
'timeseries.parquet',
filters=[('year', '=', 2023), ('quarter', '=', 4)]
)
df = table.to_pandas()

For our 400M record dataset: CSV is ~45GB, Parquet (zstd) is ~9GB. Read time drops from ~8 minutes to ~45 seconds.

Lesson 4: Vectorized Date Operations

Date parsing is surprisingly expensive. Avoid it in inner loops:

# BAD: string parsing on every row
df['year'] = df['date_str'].apply(lambda x: int(x[:4]))
df['month'] = df['date_str'].apply(lambda x: int(x[5:7]))
# GOOD: parse once, extract with vectorized accessors
df['date'] = pd.to_datetime(df['date_str'], format='%Y-%m-%d')
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
# EVEN BETTER: use integer representation for grouping
# date.dt.year and .dt.month allocate new arrays
# For groupby operations, a pre-computed period column is faster:
df['period'] = df['date'].dt.to_period('M') # '2024-07', '2024-08' etc.
monthly = df.groupby('period')['value'].mean()

For time series alignment (resampling to a standard frequency), pandas .resample() is highly optimized:

# Resample irregular observations to monthly, forward-filling gaps
monthly = (
df.set_index('date')
.groupby('series_id')['value']
.resample('MS') # Month Start
.last() # Last observed value in the period
.ffill(limit=3) # Forward fill up to 3 periods for short gaps
)

Lesson 5: The Sparse Data Problem

Financial reporting data is inherently sparse. A series might have 60% of periods populated. Storing NaN for every missing period wastes both memory and compute:

# For very sparse data: pandas SparseDtype
# Only stores non-null values + their positions
df['value'] = df['value'].astype(pd.SparseDtype('float32', fill_value=np.nan))
print(df.memory_usage(deep=True)['value'] / 1e6, "MB")
# 60% sparse: saves ~60% memory on the value column

For extremely sparse series (< 10% populated), consider a different storage model — store only the observations that exist, indexed by (series_id, date), rather than a full period×series matrix.

Lesson 6: Memory Profiling Before Optimizing

Don’t guess — measure. memory_profiler is invaluable:

# pip install memory-profiler
from memory_profiler import profile
@profile
def process_large_file():
df = pd.read_csv('timeseries.csv', dtype=dtype_map)
# ... operations
result = df.groupby('series_id')['value'].agg(['mean', 'std'])
return result
# Run: python -m memory_profiler my_script.py

Output shows memory usage line-by-line. In our pipeline, the groupby().agg() operation was creating 3 intermediate DataFrames (the groups, the aggregation result, and the final concatenation) — not obvious from the code alone.

Real Numbers: Before and After

Pipeline that validates and enriches 400M time series observations:

MetricBeforeAfter
Peak memory48 GB8 GB
Runtime4.5 hours38 minutes
Max file size processed~50M rowsUnlimited (chunked)
Server cost/run$45 (128GB RAM)$3 (16GB RAM)

The 7x improvement came from: dtype optimization (3x), chunked processing (enables large files), Parquet I/O (2x), and eliminating redundant operations found via profiling.

When to Move to Polars

Polars is a Rust-based DataFrame library that outperforms pandas in most benchmarks, especially for large datasets. The Python API is similar enough to migrate gradually:

import polars as pl
# Polars: lazy evaluation (nothing computed until .collect())
result = (
pl.scan_parquet('timeseries.parquet') # doesn't load into memory
.filter(pl.col('year') == 2023)
.group_by(['series_id', 'quarter'])
.agg([
pl.col('value').mean().alias('mean_value'),
pl.col('value').std().alias('std_value'),
])
.collect() # execute the query
)

Polars handles the chunking internally with streaming mode, parallelizes across cores automatically, and avoids the GIL entirely (Rust threads). For our 400M record use case, Polars would cut runtime further to ~12 minutes with no manual chunking.

The tradeoff: smaller ecosystem, different API, some pandas-specific features don’t exist yet. For new pipelines: Polars. For existing pandas code: optimize pandas first.

The Golden Rules

  1. Specify dtypes at read time — never let pandas guess
  2. Use Parquet for everything except final human-readable output
  3. Chunk large files — never load more than fits in ~25% of available RAM
  4. Profile before optimizing — you’ll always be surprised
  5. Vectorize everything — if you’re writing a Python for-loop over rows, there’s a better way
  6. Category dtype for string columns with repeated values
  7. datetime64[D] for date-only series (signals intent and avoids sub-day arithmetic bugs)

Processing at scale is mostly about being respectful of memory. The algorithm is often secondary to avoiding unnecessary allocations.