Posts for: #Data-Engineering

PostgreSQL for Production: The Generalist’s Database

PostgreSQL for Production: The Generalist's Database

PostgreSQL appears in every example stack across these articles. Not by accident. It’s the generalist’s database - handles relational data, JSON documents, full-text search, vector embeddings, time-series, and geospatial without specialized databases for each.

One database to learn deeply beats five databases known shallowly. Especially when AI-assisted development makes human verification the bottleneck.

Why PostgreSQL Over Specialized Databases

For structured data: PostgreSQL’s ACID compliance and relational model work.

For semi-structured data: JSONB columns with indexing eliminate need for MongoDB.

[Read more]

The Three Truths of Data-Oriented Development: Lessons from Production AI Systems

The Three Truths of Data-Oriented Development: Lessons from Production AI Systems

Mike Acton’s 2014 CppCon talk on data-oriented design fundamentally changed how I approach software engineering. After building AI systems serving millions of users, these principles have proven even more critical in production environments where data volume, transformation pipelines, and hardware constraints dominate success metrics.

Rather than frame these as “lies to avoid,” I’ve found greater value in articulating them as positive truths to embrace. These three principles have guided every production system I’ve architected, particularly in AI/ML contexts where data-oriented thinking isn’t optional—it’s fundamental.

[Read more]

FFmpeg for AI Training Data: Jump-Cut Automation

FFmpeg for AI Training Data: Jump-Cut Automation

Video data represents one of the richest sources for training AI models, from action recognition to content moderation systems. However, raw video often contains significant noise - dead air, redundant frames, and irrelevant segments. Here’s a production-tested approach to automated video processing that has streamlined our training data preparation.

The Challenge: Extracting Signal from Video Noise

When building datasets for video understanding models, we frequently encounter:

  • Long pauses that add no informational value
  • Redundant segments that can skew model training
  • Inconsistent formats that break processing pipelines
  • Massive file sizes that inflate storage costs

The solution? Automated jump-cut processing with intelligent backup strategies.

[Read more]

Makefiles for ML Pipelines: Reproducible Builds That Scale

Makefiles for ML Pipelines: Reproducible Builds That Scale

In the era of complex ML pipelines, where data processing, model training, and deployment involve dozens of interdependent steps, Makefiles provide a battle-tested solution for orchestration. While newer tools promise simplicity through abstraction, Makefiles offer transparency, portability, and power that modern AI systems demand.

Why Makefiles Excel in AI/ML Workflows

Modern ML projects involve intricate dependency chains:

  • Raw data → Cleaned data → Features → Training → Evaluation → Deployment
  • Model artifacts depend on specific data versions
  • Experiments must be reproducible across environments
  • Partial re-runs save computational resources

Makefiles handle these challenges elegantly through their fundamental design: declarative dependency management with intelligent rebuild detection.

[Read more]