Statistical data quality is a different discipline than application testing. When you’re validating financial time series — interest rates, bank balance sheet positions, money market volumes — the failures are subtle. A value that’s technically valid (a non-null number in the right range) can be statistically impossible given the surrounding context.
This post documents the validation approaches I developed working on a large-scale statistical data platform processing reporting data from Eurozone financial institutions. The patterns are general; the specific failure modes are specific to financial time series at scale.
Table of contents
Open Table of contents
The Domain
European Central Bank statistical platforms like SPACE receive reporting data from central banks, commercial banks, and financial institutions across the Eurozone. This includes:
- Bank balance sheet statistics (assets, liabilities, by category and maturity)
- Interest rate statistics (lending rates, deposit rates by segment)
- Money market transaction volumes (repo, unsecured, OIS)
- Securities data (holdings, issuance)
The data comes from hundreds of reporting entities, at different frequencies (monthly, quarterly, annual), with complex reference data relationships (organizational hierarchies, instrument classifications). Total volume: hundreds of millions of time series, billions of observations.
Failure Modes Unique to Financial Time Series
1. Sign Errors
Financial data has natural sign conventions. A bank’s loan portfolio is positive (asset). Deposits are positive (liability). Net positions can be positive or negative. Errors happen when:
# Example: a reporting entity submits deposits as negative (confusion with net position)# Valid in range? Yes (it's a number). Correct? No.
def validate_sign_convention(series: pd.Series, expected_sign: str) -> pd.Series: """ expected_sign: 'positive', 'negative', 'any' Returns boolean Series (True = passes validation) """ if expected_sign == 'positive': return series.isna() | (series >= 0) elif expected_sign == 'negative': return series.isna() | (series <= 0) return pd.Series([True] * len(series), index=series.index)2. Magnitude Jumps (Non-Structural)
A balance sheet position that jumps 500% in one month without a structural break (merger, reporting methodology change) is almost certainly an error:
def detect_magnitude_jumps( series: pd.Series, threshold_pct: float = 100.0, min_absolute: float = 1e6, # Ignore tiny position changes) -> pd.Series: """Returns boolean mask where True = suspicious jump detected.""" pct_change = series.pct_change().abs() * 100 abs_change = series.diff().abs()
# Jump is suspicious if BOTH the percentage AND absolute change are large suspicious = (pct_change > threshold_pct) & (abs_change > min_absolute)
return suspiciousThe key insight: percentage-only thresholds generate false positives for small positions (a $1M → $2M move is 100% but often legitimate). Requiring both percentage AND absolute thresholds reduces noise.
3. Revision Magnitude
Financial data is frequently revised — preliminary → revised → final. But revisions should be small:
def validate_revision( original: pd.Series, revised: pd.Series, max_revision_pct: float = 5.0, # 5% revision is normal; 50% is suspicious) -> pd.Series: """Flag revisions larger than threshold as potentially erroneous.""" revision_pct = ((revised - original) / original.abs() * 100).abs() return revision_pct > max_revision_pctFor interest rates, a revision from 2.50% to 2.55% is fine. A revision from 2.50% to 25.0% is an error (likely a decimal point shift — a common data entry mistake).
4. Additive Consistency
Many financial series have additive constraints: sub-items should sum to the total. This sounds obvious, but fails frequently:
def validate_additivity( total: pd.Series, components: list[pd.Series], tolerance_pct: float = 0.01, # 0.01% rounding tolerance) -> pd.Series: """Check that components sum to total within tolerance.""" computed_total = sum(components) discrepancy_pct = ((total - computed_total) / total.abs() * 100).abs() return discrepancy_pct > tolerance_pctAt scale, additivity failures are common because:
- Different systems compute the total and components independently
- Rounding rules differ between systems
- Data comes from different sources that haven’t been reconciled
The tolerance_pct parameter is important — requiring exact equality generates false positives from legitimate floating-point differences.
5. Temporal Consistency
Certain values should be monotonically non-decreasing (cumulative totals) or bounded (rates between 0-100%):
import numpy as np
def validate_temporal_constraints( series: pd.Series, constraint: str, # 'non_decreasing', 'non_negative', 'rate') -> pd.Series: if constraint == 'non_decreasing': return series.diff().fillna(0) < -1e-9 # Allow tiny rounding errors
elif constraint == 'non_negative': return series < 0
elif constraint == 'rate': # Interest/exchange rates: should be between -100 and 100 (as percentage) return (series < -100) | (series > 100)
return pd.Series([False] * len(series), index=series.index)6. Missing Value Patterns
In financial reporting, missing values have meaning:
- Structural zero: The entity has no activity in this category (legitimate zero)
- Not applicable: The series doesn’t apply to this entity (should be null)
- Missing: The entity failed to report (data quality issue)
- Not yet reported: Future periods are legitimately null
Distinguishing these requires metadata:
def classify_missing_values( series: pd.Series, expected_frequency: str, # 'monthly', 'quarterly' reporting_deadline: pd.Timestamp, current_date: pd.Timestamp,) -> pd.DataFrame: """ Returns DataFrame with 'classification' for each null period. """ null_periods = series[series.isna()] classifications = []
for period in null_periods.index: if period > current_date: classification = 'future' elif period > reporting_deadline: classification = 'awaited' elif is_structurally_zero_period(period, expected_frequency): classification = 'structural' else: classification = 'missing' # Potential data quality issue
classifications.append({ 'period': period, 'classification': classification, })
return pd.DataFrame(classifications)Building the Validation Pipeline
The architecture: each series passes through a sequence of validators, accumulating a quality report:
from dataclasses import dataclass, fieldfrom typing import Callableimport pandas as pd
@dataclassclass ValidationIssue: rule_id: str severity: str # 'error', 'warning', 'info' description: str affected_periods: list details: dict = field(default_factory=dict)
@dataclassclass ValidationResult: series_id: str passed: bool issues: list[ValidationIssue] = field(default_factory=list)
@property def errors(self): return [i for i in self.issues if i.severity == 'error']
@property def warnings(self): return [i for i in self.issues if i.severity == 'warning']
class ValidationPipeline: def __init__(self): self.validators: list[Callable] = []
def add_rule(self, validator_fn: Callable): self.validators.append(validator_fn) return self # Enable chaining
def run(self, series_id: str, series: pd.Series, metadata: dict) -> ValidationResult: result = ValidationResult(series_id=series_id, passed=True)
for validator in self.validators: issues = validator(series, metadata) result.issues.extend(issues) if any(i.severity == 'error' for i in issues): result.passed = False
return result
# Build the pipeline for bank balance sheet datadef build_balance_sheet_pipeline() -> ValidationPipeline: pipeline = ValidationPipeline()
def check_completeness(series, meta): missing = series.isna().sum() if missing / len(series) > 0.2: # >20% missing = error return [ValidationIssue( rule_id='COMPL_001', severity='error', description=f'Series has {missing/len(series)*100:.1f}% missing values', affected_periods=list(series[series.isna()].index), )] return []
def check_sign(series, meta): expected_sign = meta.get('expected_sign', 'positive') passes = validate_sign_convention(series, expected_sign) if (~passes).any(): # any values that FAIL validation? return [ValidationIssue( rule_id='SIGN_001', severity='error', description=f'Sign convention violation: expected {expected_sign}', affected_periods=list(series[~passes].index), )] return []
def check_jumps(series, meta): suspicious = detect_magnitude_jumps(series, threshold_pct=200.0) if suspicious.any(): return [ValidationIssue( rule_id='JUMP_001', severity='warning', description='Large magnitude jumps detected', affected_periods=list(series[suspicious].index), details={'max_jump_pct': float(series.pct_change().abs().max() * 100)}, )] return []
pipeline.add_rule(check_completeness) pipeline.add_rule(check_sign) pipeline.add_rule(check_jumps)
return pipelineStatistical Outlier Detection
For series without known constraints, statistical methods catch anomalies:
def detect_statistical_outliers( series: pd.Series, method: str = 'iqr', sensitivity: float = 1.5,) -> pd.Series: """ Returns boolean mask of outlier positions. method: 'iqr' (interquartile range) or 'zscore' """ clean = series.dropna()
if method == 'iqr': Q1, Q3 = clean.quantile(0.25), clean.quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - sensitivity * IQR upper_bound = Q3 + sensitivity * IQR return (series < lower_bound) | (series > upper_bound)
elif method == 'zscore': mean, std = clean.mean(), clean.std() z_scores = (series - mean) / std return z_scores.abs() > sensitivity # sensitivity = threshold in std devs
raise ValueError(f"Unknown method: {method}")For financial data, IQR is preferred over z-score because financial distributions are fat-tailed — z-score would flag too many legitimate extreme values.
Reporting Quality Metrics
At scale, you need aggregate quality metrics, not just per-series results:
def compute_quality_metrics(results: list[ValidationResult]) -> dict: total = len(results) passed = sum(1 for r in results if r.passed) error_counts = {} warning_counts = {}
for result in results: for issue in result.errors: error_counts[issue.rule_id] = error_counts.get(issue.rule_id, 0) + 1 for issue in result.warnings: warning_counts[issue.rule_id] = warning_counts.get(issue.rule_id, 0) + 1
return { 'total_series': total, 'passed': passed, 'pass_rate': passed / total * 100, 'failed': total - passed, 'top_errors': sorted(error_counts.items(), key=lambda x: -x[1])[:10], 'top_warnings': sorted(warning_counts.items(), key=lambda x: -x[1])[:10], }Performance at Scale
Running this pipeline on 400M observations requires careful batching:
def run_pipeline_at_scale( data_path: str, pipeline: ValidationPipeline, batch_size: int = 10_000, # series per batch) -> pd.DataFrame: """Process all series in batches, write results incrementally.""" all_results = []
# pd.read_parquet doesn't support chunksize — read row groups via pyarrow import pyarrow.parquet as pq pf = pq.ParquetFile(data_path) for row_group_idx in range(pf.metadata.num_row_groups): batch_df = pf.read_row_group(row_group_idx, columns=['series_id', 'date', 'value']).to_pandas() for series_id, group in batch_df.groupby('series_id'): series = group.set_index('date')['value'].sort_index() metadata = get_series_metadata(series_id) result = pipeline.run(series_id, series, metadata) all_results.append({ 'series_id': series_id, 'passed': result.passed, 'error_count': len(result.errors), 'warning_count': len(result.warnings), })
return pd.DataFrame(all_results)At 400M observations distributed across millions of series, each series is small enough to process in memory. The bottleneck is I/O — Parquet with predicate pushdown (filter by series_id) reduces this dramatically.
Lessons Learned
-
Statistical validity ≠ domain validity: A number can be syntactically correct and statistically plausible but violate domain knowledge (wrong sign, impossible revision, additivity failure). Domain rules catch what statistics miss.
-
Tolerance parameters matter enormously: Too strict → alert fatigue. Too loose → real errors slip through. Calibrate against known-good historical data.
-
False positives have cost: Analysts investigating false alarms develop alert fatigue. A 99% specificity rule at 1M checks/day generates 10,000 false positives. Design rules with precision as a first-class concern.
-
Metadata is as important as data: Knowing that a series is quarterly vs monthly, that an entity had a merger in a specific period, that a methodology change was approved — this context is what turns statistical anomalies into actionable issues.
-
Quality monitoring is continuous: Data quality degrades over time as reporting entities change systems, staff, and practices. Automated validation must run on every new submission, not just at initial onboarding.
The investment in data quality infrastructure pays compound returns. Every downstream analysis — forecasts, regulatory reports, research — is only as good as the data it’s built on.