The European Central Bank’s SPACE platform (Statistical Data Warehouse) processes statistical data from all Eurozone institutions — hundreds of millions of time series covering everything from bank balance sheets to money market volumes. Working on the data validation and enrichment pipeline for this system taught me more about Python memory management and time series processing than any tutorial ever could.
This post distills the most impactful lessons. Everything here I’ve applied to production pipelines processing 400M+ records daily on modest hardware.
Table of contents
Open Table of contents
- The Problem Space
- Lesson 1: dtypes Are Everything
- Lesson 2: Chunked Processing with Generator Pipelines
- Lesson 3: Parquet Over CSV — Always
- Lesson 4: Vectorized Date Operations
- Lesson 5: The Sparse Data Problem
- Lesson 6: Memory Profiling Before Optimizing
- Real Numbers: Before and After
- When to Move to Polars
- The Golden Rules
- Related posts
The Problem Space
Financial time series have specific characteristics that make them challenging:
- Irregular timestamps: Submissions arrive late, corrected, or in non-standard formats
- Sparse data: A reporting entity might have data for 60% of expected periods (the rest is null)
- Multiple granularities: Monthly, quarterly, and annual series mixed together
- Revision history: The same period might have multiple vintages (preliminary → revised → final)
- Reference data joins: Each observation links to organizational hierarchies that change over time
A naive pandas approach (load everything into memory, operate, save) stops working around 10-20M rows on a standard machine. At 400M, you need to be deliberate about every operation.
Lesson 1: dtypes Are Everything
The most impactful optimization is always dtype selection. Pandas defaults to float64 and object — rarely the right choice:
import pandas as pdimport numpy as np
# BAD: default dtypes for a typical financial time series DataFramedf = pd.read_csv('timeseries.csv')print(df.dtypes)# date object ← 50 bytes/row# series_id object ← 50 bytes/row# value float64 ← 8 bytes/row# status object ← 50 bytes/row# revision int64 ← 8 bytes/row
print(df.memory_usage(deep=True).sum() / 1e9, "GB")# 1.2 GB for 10M rows# GOOD: explicit dtype specification at read timedtype_map = { 'series_id': 'category', # repeating string → category 'value': 'float32', # 4 bytes instead of 8 (sufficient for most financial values) 'status': 'category', # e.g., 'A', 'P', 'R' → category 'revision': 'int16', # revision numbers rarely exceed 32,767}
df = pd.read_csv( 'timeseries.csv', dtype=dtype_map, parse_dates=['date'],)df['date'] = df['date'].astype('datetime64[D]') # date only, not datetime64[ns]
print(df.memory_usage(deep=True).sum() / 1e9, "GB")# 0.18 GB — 6.5x reductionThe category dtype stores repeated strings as an integer index into a lookup table. For a column with 1000 unique series_id values across 10M rows, instead of storing 10M strings (50 bytes each = 500MB), you store 10M integers (1-4 bytes each = 10-40MB) plus a 1000-entry lookup table.
datetime64[D] (day precision) still uses 8 bytes internally (all datetime64 variants are 64-bit), but signals intent and avoids accidental sub-day arithmetic. For date-only time series, nanosecond precision is conceptual waste even though storage is the same.
Memory Usage Reference
| dtype | Bytes/element | Use when |
|---|---|---|
| float64 | 8 | High-precision calculations |
| float32 | 4 | Most financial values (6-7 sig digits sufficient) |
| int64 | 8 | Large integers (IDs, counts) |
| int32 | 4 | Values fitting in [-2B, 2B] |
| int16 | 2 | Small integers (-32768 to 32767) |
| int8 | 1 | Categorical integers (0-127) |
| bool | 1 | Boolean flags |
| category | varies | Columns with repeated values |
| datetime64[ns] | 8 | Timestamps with sub-second precision |
| datetime64[D] | 8 | Date-only (same storage as ns, signals intent) |
| object | 50+ | Avoid — always replace |
Lesson 2: Chunked Processing with Generator Pipelines
For files larger than available RAM, never load everything at once. Process in chunks:
def process_timeseries_chunk(chunk: pd.DataFrame) -> pd.DataFrame: """Process a single chunk — pure function, no side effects""" # Validate chunk = chunk.dropna(subset=['series_id', 'date']) chunk = chunk[chunk['value'].between(-1e15, 1e15)]
# Enrich with derived columns chunk['year'] = chunk['date'].dt.year chunk['quarter'] = chunk['date'].dt.quarter
# Compute moving average within each series_id group chunk = chunk.sort_values(['series_id', 'date']) chunk['value_ma4'] = ( chunk.groupby('series_id')['value'] .transform(lambda x: x.rolling(4, min_periods=1).mean()) )
return chunk
def chunked_pipeline(filepath: str, chunksize: int = 500_000): """Generator that yields processed chunks""" reader = pd.read_csv( filepath, dtype=dtype_map, parse_dates=['date'], chunksize=chunksize, )
for chunk in reader: chunk['date'] = chunk['date'].astype('datetime64[D]') yield process_timeseries_chunk(chunk)
# Write results incrementally using PyArrow writer (pandas has no append mode for Parquet)import pyarrow as paimport pyarrow.parquet as pq
output_path = 'processed.parquet'writer = None
for chunk in chunked_pipeline('timeseries.csv'): table = pa.Table.from_pandas(chunk, preserve_index=False) if writer is None: writer = pq.ParquetWriter(output_path, table.schema) writer.write_table(table)
if writer: writer.close()The chunksize of 500,000 rows balances throughput (larger chunks = fewer I/O round trips) against memory (smaller chunks = lower peak usage). For our dtype-optimized DataFrame, 500K rows ≈ ~20MB — comfortable on any machine with 4GB RAM.
Critical caveat: Rolling window calculations across chunk boundaries are incorrect. A 4-period moving average at the start of chunk 2 doesn’t include the last 3 rows of chunk 1. Solutions:
- Overlap chunks: Add the last N rows of the previous chunk to the beginning of the next
- Process per-group: Group by
series_idfirst, then process each group - Use Polars (discussed below) with lazy evaluation
Lesson 3: Parquet Over CSV — Always
For any data that isn’t final output to humans, use Parquet:
| Format | Read speed | Write speed | Size | dtype preservation |
|---|---|---|---|---|
| CSV | 1x | 1x | 1x | No (everything is string) |
| Parquet (snappy) | 5-10x | 3-5x | 0.3x | Yes |
| Parquet (zstd) | 4-8x | 2-4x | 0.2x | Yes |
| Feather/Arrow | 10-20x | 8-15x | 0.5x | Yes |
Parquet stores metadata including column dtypes, enabling pandas to read with correct types without specifying dtype manually. It also supports predicate pushdown — reading only the rows that match a filter without loading the full file:
import pyarrow.parquet as pq
# Read only Q4 data without loading the full datasettable = pq.read_table( 'timeseries.parquet', filters=[('year', '=', 2023), ('quarter', '=', 4)])df = table.to_pandas()For our 400M record dataset: CSV is ~45GB, Parquet (zstd) is ~9GB. Read time drops from ~8 minutes to ~45 seconds.
Lesson 4: Vectorized Date Operations
Date parsing is surprisingly expensive. Avoid it in inner loops:
# BAD: string parsing on every rowdf['year'] = df['date_str'].apply(lambda x: int(x[:4]))df['month'] = df['date_str'].apply(lambda x: int(x[5:7]))
# GOOD: parse once, extract with vectorized accessorsdf['date'] = pd.to_datetime(df['date_str'], format='%Y-%m-%d')df['year'] = df['date'].dt.yeardf['month'] = df['date'].dt.month
# EVEN BETTER: use integer representation for grouping# date.dt.year and .dt.month allocate new arrays# For groupby operations, a pre-computed period column is faster:df['period'] = df['date'].dt.to_period('M') # '2024-07', '2024-08' etc.monthly = df.groupby('period')['value'].mean()For time series alignment (resampling to a standard frequency), pandas .resample() is highly optimized:
# Resample irregular observations to monthly, forward-filling gapsmonthly = ( df.set_index('date') .groupby('series_id')['value'] .resample('MS') # Month Start .last() # Last observed value in the period .ffill(limit=3) # Forward fill up to 3 periods for short gaps)Lesson 5: The Sparse Data Problem
Financial reporting data is inherently sparse. A series might have 60% of periods populated. Storing NaN for every missing period wastes both memory and compute:
# For very sparse data: pandas SparseDtype# Only stores non-null values + their positionsdf['value'] = df['value'].astype(pd.SparseDtype('float32', fill_value=np.nan))print(df.memory_usage(deep=True)['value'] / 1e6, "MB")# 60% sparse: saves ~60% memory on the value columnFor extremely sparse series (< 10% populated), consider a different storage model — store only the observations that exist, indexed by (series_id, date), rather than a full period×series matrix.
Lesson 6: Memory Profiling Before Optimizing
Don’t guess — measure. memory_profiler is invaluable:
# pip install memory-profilerfrom memory_profiler import profile
@profiledef process_large_file(): df = pd.read_csv('timeseries.csv', dtype=dtype_map) # ... operations result = df.groupby('series_id')['value'].agg(['mean', 'std']) return result
# Run: python -m memory_profiler my_script.pyOutput shows memory usage line-by-line. In our pipeline, the groupby().agg() operation was creating 3 intermediate DataFrames (the groups, the aggregation result, and the final concatenation) — not obvious from the code alone.
Real Numbers: Before and After
Pipeline that validates and enriches 400M time series observations:
| Metric | Before | After |
|---|---|---|
| Peak memory | 48 GB | 8 GB |
| Runtime | 4.5 hours | 38 minutes |
| Max file size processed | ~50M rows | Unlimited (chunked) |
| Server cost/run | $45 (128GB RAM) | $3 (16GB RAM) |
The 7x improvement came from: dtype optimization (3x), chunked processing (enables large files), Parquet I/O (2x), and eliminating redundant operations found via profiling.
When to Move to Polars
Polars is a Rust-based DataFrame library that outperforms pandas in most benchmarks, especially for large datasets. The Python API is similar enough to migrate gradually:
import polars as pl
# Polars: lazy evaluation (nothing computed until .collect())result = ( pl.scan_parquet('timeseries.parquet') # doesn't load into memory .filter(pl.col('year') == 2023) .group_by(['series_id', 'quarter']) .agg([ pl.col('value').mean().alias('mean_value'), pl.col('value').std().alias('std_value'), ]) .collect() # execute the query)Polars handles the chunking internally with streaming mode, parallelizes across cores automatically, and avoids the GIL entirely (Rust threads). For our 400M record use case, Polars would cut runtime further to ~12 minutes with no manual chunking.
The tradeoff: smaller ecosystem, different API, some pandas-specific features don’t exist yet. For new pipelines: Polars. For existing pandas code: optimize pandas first.
The Golden Rules
- Specify dtypes at read time — never let pandas guess
- Use Parquet for everything except final human-readable output
- Chunk large files — never load more than fits in ~25% of available RAM
- Profile before optimizing — you’ll always be surprised
- Vectorize everything — if you’re writing a Python for-loop over rows, there’s a better way
- Category dtype for string columns with repeated values
- datetime64[D] for date-only series (signals intent and avoids sub-day arithmetic bugs)
Processing at scale is mostly about being respectful of memory. The algorithm is often secondary to avoiding unnecessary allocations.
Related posts
- Vectorization in Python — NumPy vs Pandas vs Polars vs Numba — systematic benchmarks for the vectorization techniques used throughout this pipeline
- Polars vs Pandas Benchmark — When to Switch and When to Stay — comparative benchmarks on the kind of aggregation and resampling work common in financial time series