Python Time Series at Scale — Lessons from Processing 400M Financial Records

The European Central Bank’s SPACE platform (Statistical Data Warehouse) processes statistical data from all Eurozone institutions — hundreds of millions of time series covering everything from bank balance sheets to money market volumes. Working on the data validation and enrichment pipeline for this system taught me more about Python memory management and time series processing than any tutorial ever could.

This post distills the most impactful lessons. Everything here I’ve applied to production pipelines processing 400M+ records daily on modest hardware.

Open Table of contents

The Problem Space
Lesson 1: dtypes Are Everything
- Memory Usage Reference
Lesson 2: Chunked Processing with Generator Pipelines
Lesson 3: Parquet Over CSV — Always
Lesson 4: Vectorized Date Operations
Lesson 5: The Sparse Data Problem
Lesson 6: Memory Profiling Before Optimizing
Real Numbers: Before and After
When to Move to Polars
The Golden Rules
Related posts

The Problem Space

Financial time series have specific characteristics that make them challenging:

Irregular timestamps: Submissions arrive late, corrected, or in non-standard formats
Sparse data: A reporting entity might have data for 60% of expected periods (the rest is null)
Multiple granularities: Monthly, quarterly, and annual series mixed together
Revision history: The same period might have multiple vintages (preliminary → revised → final)
Reference data joins: Each observation links to organizational hierarchies that change over time

A naive pandas approach (load everything into memory, operate, save) stops working around 10-20M rows on a standard machine. At 400M, you need to be deliberate about every operation.

Lesson 1: dtypes Are Everything

The most impactful optimization is always dtype selection. Pandas defaults to float64 and object — rarely the right choice:

import pandas as pd
import numpy as np

# BAD: default dtypes for a typical financial time series DataFrame
df = pd.read_csv('timeseries.csv')
print(df.dtypes)
# date         object    ← 50 bytes/row
# series_id    object    ← 50 bytes/row
# value        float64   ← 8 bytes/row
# status       object    ← 50 bytes/row
# revision     int64     ← 8 bytes/row

print(df.memory_usage(deep=True).sum() / 1e9, "GB")
# 1.2 GB for 10M rows

# GOOD: explicit dtype specification at read time
dtype_map = {
    'series_id': 'category',   # repeating string → category
    'value': 'float32',        # 4 bytes instead of 8 (sufficient for most financial values)
    'status': 'category',      # e.g., 'A', 'P', 'R' → category
    'revision': 'int16',       # revision numbers rarely exceed 32,767
}

df = pd.read_csv(
    'timeseries.csv',
    dtype=dtype_map,
    parse_dates=['date'],
)
df['date'] = df['date'].astype('datetime64[D]')  # date only, not datetime64[ns]

print(df.memory_usage(deep=True).sum() / 1e9, "GB")
# 0.18 GB — 6.5x reduction

The category dtype stores repeated strings as an integer index into a lookup table. For a column with 1000 unique series_id values across 10M rows, instead of storing 10M strings (50 bytes each = 500MB), you store 10M integers (1-4 bytes each = 10-40MB) plus a 1000-entry lookup table.

datetime64[D] (day precision) still uses 8 bytes internally (all datetime64 variants are 64-bit), but signals intent and avoids accidental sub-day arithmetic. For date-only time series, nanosecond precision is conceptual waste even though storage is the same.

Memory Usage Reference

dtype	Bytes/element	Use when
float64	8	High-precision calculations
float32	4	Most financial values (6-7 sig digits sufficient)
int64	8	Large integers (IDs, counts)
int32	4	Values fitting in [-2B, 2B]
int16	2	Small integers (-32768 to 32767)
int8	1	Categorical integers (0-127)
bool	1	Boolean flags
category	varies	Columns with repeated values
datetime64[ns]	8	Timestamps with sub-second precision
datetime64[D]	8	Date-only (same storage as ns, signals intent)
object	50+	Avoid — always replace

Lesson 2: Chunked Processing with Generator Pipelines

For files larger than available RAM, never load everything at once. Process in chunks:

def process_timeseries_chunk(chunk: pd.DataFrame) -> pd.DataFrame:
    """Process a single chunk — pure function, no side effects"""
    # Validate
    chunk = chunk.dropna(subset=['series_id', 'date'])
    chunk = chunk[chunk['value'].between(-1e15, 1e15)]

    # Enrich with derived columns
    chunk['year'] = chunk['date'].dt.year
    chunk['quarter'] = chunk['date'].dt.quarter

    # Compute moving average within each series_id group
    chunk = chunk.sort_values(['series_id', 'date'])
    chunk['value_ma4'] = (
        chunk.groupby('series_id')['value']
        .transform(lambda x: x.rolling(4, min_periods=1).mean())
    )

    return chunk

def chunked_pipeline(filepath: str, chunksize: int = 500_000):
    """Generator that yields processed chunks"""
    reader = pd.read_csv(
        filepath,
        dtype=dtype_map,
        parse_dates=['date'],
        chunksize=chunksize,
    )

    for chunk in reader:
        chunk['date'] = chunk['date'].astype('datetime64[D]')
        yield process_timeseries_chunk(chunk)

# Write results incrementally using PyArrow writer (pandas has no append mode for Parquet)
import pyarrow as pa
import pyarrow.parquet as pq

output_path = 'processed.parquet'
writer = None

for chunk in chunked_pipeline('timeseries.csv'):
    table = pa.Table.from_pandas(chunk, preserve_index=False)
    if writer is None:
        writer = pq.ParquetWriter(output_path, table.schema)
    writer.write_table(table)

if writer:
    writer.close()

The chunksize of 500,000 rows balances throughput (larger chunks = fewer I/O round trips) against memory (smaller chunks = lower peak usage). For our dtype-optimized DataFrame, 500K rows ≈ ~20MB — comfortable on any machine with 4GB RAM.

Critical caveat: Rolling window calculations across chunk boundaries are incorrect. A 4-period moving average at the start of chunk 2 doesn’t include the last 3 rows of chunk 1. Solutions:

Overlap chunks: Add the last N rows of the previous chunk to the beginning of the next
Process per-group: Group by series_id first, then process each group
Use Polars (discussed below) with lazy evaluation

Lesson 3: Parquet Over CSV — Always

For any data that isn’t final output to humans, use Parquet:

Format	Read speed	Write speed	Size	dtype preservation
CSV	1x	1x	1x	No (everything is string)
Parquet (snappy)	5-10x	3-5x	0.3x	Yes
Parquet (zstd)	4-8x	2-4x	0.2x	Yes
Feather/Arrow	10-20x	8-15x	0.5x	Yes

Parquet stores metadata including column dtypes, enabling pandas to read with correct types without specifying dtype manually. It also supports predicate pushdown — reading only the rows that match a filter without loading the full file:

import pyarrow.parquet as pq

# Read only Q4 data without loading the full dataset
table = pq.read_table(
    'timeseries.parquet',
    filters=[('year', '=', 2023), ('quarter', '=', 4)]
)
df = table.to_pandas()

For our 400M record dataset: CSV is ~45GB, Parquet (zstd) is ~9GB. Read time drops from ~8 minutes to ~45 seconds.

Lesson 4: Vectorized Date Operations

Date parsing is surprisingly expensive. Avoid it in inner loops:

# BAD: string parsing on every row
df['year'] = df['date_str'].apply(lambda x: int(x[:4]))
df['month'] = df['date_str'].apply(lambda x: int(x[5:7]))

# GOOD: parse once, extract with vectorized accessors
df['date'] = pd.to_datetime(df['date_str'], format='%Y-%m-%d')
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month

# EVEN BETTER: use integer representation for grouping
# date.dt.year and .dt.month allocate new arrays
# For groupby operations, a pre-computed period column is faster:
df['period'] = df['date'].dt.to_period('M')  # '2024-07', '2024-08' etc.
monthly = df.groupby('period')['value'].mean()

For time series alignment (resampling to a standard frequency), pandas .resample() is highly optimized:

# Resample irregular observations to monthly, forward-filling gaps
monthly = (
    df.set_index('date')
    .groupby('series_id')['value']
    .resample('MS')  # Month Start
    .last()          # Last observed value in the period
    .ffill(limit=3)  # Forward fill up to 3 periods for short gaps
)

Lesson 5: The Sparse Data Problem

Financial reporting data is inherently sparse. A series might have 60% of periods populated. Storing NaN for every missing period wastes both memory and compute:

# For very sparse data: pandas SparseDtype
# Only stores non-null values + their positions
df['value'] = df['value'].astype(pd.SparseDtype('float32', fill_value=np.nan))
print(df.memory_usage(deep=True)['value'] / 1e6, "MB")
# 60% sparse: saves ~60% memory on the value column

For extremely sparse series (< 10% populated), consider a different storage model — store only the observations that exist, indexed by (series_id, date), rather than a full period×series matrix.

Lesson 6: Memory Profiling Before Optimizing

Don’t guess — measure. memory_profiler is invaluable:

# pip install memory-profiler
from memory_profiler import profile

@profile
def process_large_file():
    df = pd.read_csv('timeseries.csv', dtype=dtype_map)
    # ... operations
    result = df.groupby('series_id')['value'].agg(['mean', 'std'])
    return result

# Run: python -m memory_profiler my_script.py

Output shows memory usage line-by-line. In our pipeline, the groupby().agg() operation was creating 3 intermediate DataFrames (the groups, the aggregation result, and the final concatenation) — not obvious from the code alone.

Real Numbers: Before and After

Pipeline that validates and enriches 400M time series observations:

Metric	Before	After
Peak memory	48 GB	8 GB
Runtime	4.5 hours	38 minutes
Max file size processed	~50M rows	Unlimited (chunked)
Server cost/run	$45 (128GB RAM)	$3 (16GB RAM)

The 7x improvement came from: dtype optimization (3x), chunked processing (enables large files), Parquet I/O (2x), and eliminating redundant operations found via profiling.

When to Move to Polars

Polars is a Rust-based DataFrame library that outperforms pandas in most benchmarks, especially for large datasets. The Python API is similar enough to migrate gradually:

import polars as pl

# Polars: lazy evaluation (nothing computed until .collect())
result = (
    pl.scan_parquet('timeseries.parquet')  # doesn't load into memory
    .filter(pl.col('year') == 2023)
    .group_by(['series_id', 'quarter'])
    .agg([
        pl.col('value').mean().alias('mean_value'),
        pl.col('value').std().alias('std_value'),
    ])
    .collect()  # execute the query
)

Polars handles the chunking internally with streaming mode, parallelizes across cores automatically, and avoids the GIL entirely (Rust threads). For our 400M record use case, Polars would cut runtime further to ~12 minutes with no manual chunking.

The tradeoff: smaller ecosystem, different API, some pandas-specific features don’t exist yet. For new pipelines: Polars. For existing pandas code: optimize pandas first.

The Golden Rules

Specify dtypes at read time — never let pandas guess
Use Parquet for everything except final human-readable output
Chunk large files — never load more than fits in ~25% of available RAM
Profile before optimizing — you’ll always be surprised
Vectorize everything — if you’re writing a Python for-loop over rows, there’s a better way
Category dtype for string columns with repeated values
datetime64[D] for date-only series (signals intent and avoids sub-day arithmetic bugs)

Processing at scale is mostly about being respectful of memory. The algorithm is often secondary to avoiding unnecessary allocations.

Vectorization in Python — NumPy vs Pandas vs Polars vs Numba — systematic benchmarks for the vectorization techniques used throughout this pipeline
Polars vs Pandas Benchmark — When to Switch and When to Stay — comparative benchmarks on the kind of aggregation and resampling work common in financial time series