Posts for: #Monitoring

Monitoring File Handles with 1975 Technology

Monitoring File Handles with 1975 Technology

Your process is leaking file handles. You need to track which processes are consuming handles over time, spot anomalies, and correlate with system behavior. Modern observability platforms want you to install 200MB Docker images, connect to cloud services, and pay subscription fees.

Or you could use six shell scripts totaling 150 lines.

The Tools

collect - Sample file handle counts every 5 minutes avg - Calculate statistics (count, average, min, max) graph - ASCII chart of handle counts over time spikes - Find anomalies (2x average or custom threshold) top - Show processes by handle consumption timeline - Aggregate by time buckets (hourly, daily)

[Read more]

Happy Hashes: Know What’s Actually Running in Production

Happy Hashes: Know What's Actually Running in Production

“It works on my machine.” “I thought we deployed that fix.” “Which commit is in prod?” “Is staging up to date?”

Version tags like v1.2.3 can point to multiple commits. Tags move. Tags get retagged. Git hashes don’t. Same hash equals identical code, guaranteed. Cryptographic proof.

The solution: Every service exposes a /version endpoint returning its git hash. Instantly verify what’s deployed.

Backend: Capture Hash at Build Time

Docker images don’t contain .git directories. Capture the hash during build and bake it into the image:

[Read more]

Learning from Failed Experiments: The Path to Production AI Success

Learning from Failed Experiments: The Path to Production AI Success

Our failures teach us more than our successes. The teams that excel aren’t those that avoid failure - they’re those that fail fast, learn systematically, and iterate relentlessly.

Reframing Failure in AI Development

In traditional software, bugs are failures. In AI development, most experiments fail, and that’s not just acceptable - it’s essential. The key distinction is between:

  • Productive failures: Experiments that conclusively prove an approach won’t work
  • Wasteful failures: Repeated mistakes from not capturing lessons learned
  • System failures: Production issues that impact users

Each requires different responses and offers different learning opportunities.

[Read more]