Tag: data-engineering
All the articles with the tag "data-engineering".
Python vs JavaScript DataFrames in the Browser — Live Benchmarks with No Backend
Posted on:March 22, 2026 at 10:00 AMBoth Python (pandas via Pyodide WebAssembly) and JavaScript (arquero) can process DataFrames entirely in the browser. This post runs the same groupby, filter, and pivot benchmarks in both — live, client-side, no server needed — and measures the real tradeoffs.
Financial Time Series Validation — QA Lessons from a European Central Bank Platform
Posted on:November 10, 2025 at 10:00 AMData quality lessons from building the validation pipeline for a statistical platform processing hundreds of millions of financial time series. Covers the specific failure modes that occur at scale, validation strategies that work, and anomaly detection approaches for financial data.
MongoDB Aggregation Pipelines — The Analytics Engine Inside Your Document Database
Posted on:June 18, 2025 at 10:00 AMMost engineers use MongoDB for CRUD and reach for PostgreSQL the moment they need analytics. But the aggregation pipeline is a full transformation engine — composable stages, multi-collection joins, array unpacking. Here's what it can actually do, using real financial event data as the example.
Polars vs Pandas — A Benchmark That Changed How I Process Data
Posted on:October 14, 2024 at 10:00 AMComprehensive benchmarks comparing Polars and pandas across groupby, join, filter, and window operations on datasets from 1M to 100M rows. Polars wins by 5-20x in most scenarios — here's what that means for your data pipelines.
Data Science Fundamentals — Why Choosing the Right Average Matters More Than You Think
Posted on:July 23, 2024 at 10:00 AMA companion to my technical article on measures of central tendency: arithmetic, geometric, and harmonic means, median, mode, and when each one is correct. Understanding which average to use — and which one to distrust — is the foundation of honest data analysis.
Python Time Series at Scale — Lessons from Processing 400M Financial Records
Posted on:July 22, 2024 at 10:00 AMReal-world lessons from building a time series pipeline that processes 400 million financial data points daily. Covers memory layout, chunked processing, dtype optimization, and the specific pandas/NumPy patterns that keep memory under control at scale.
Pandas Performance — Stop Using .iterrows() (with Benchmarks)
Posted on:March 14, 2024 at 10:00 AMBenchmarking five approaches to row-level operations in pandas — from the naive .iterrows() to fully vectorized NumPy operations — with real timing numbers. Shows 100-1000x speedups using vectorization and explains why Python's object model makes .iterrows() so slow.