PostgreSQL for Production: The Generalist's Database

PostgreSQL appears in every example stack across these articles. Not by accident. It’s the generalist’s database - handles relational data, JSON documents, full-text search, vector embeddings, time-series, and geospatial without specialized databases for each.

One database to learn deeply beats five databases known shallowly. Especially when AI-assisted development makes human verification the bottleneck.

Why PostgreSQL Over Specialized Databases#

For structured data: PostgreSQL’s ACID compliance and relational model work.

For semi-structured data: JSONB columns with indexing eliminate need for MongoDB.

For full-text search: Built-in text search capabilities handle most use cases without Elasticsearch.

For vector embeddings: pgvector extension supports cosine similarity without Pinecone.

For time-series: Partitioning and continuous aggregates cover basic time-series needs without InfluxDB.

For caching: Materialized views provide read-heavy optimization without separate Redis (though Redis still useful for ephemeral data).

You learn one database deeply instead of six databases shallowly. When AI generates queries, you can verify them. When performance degrades, you can optimize. When things break, you can fix them.

JSONB: When Schema Flexibility Matters#

AI systems evolve. Model outputs change. Features get added. Rigid schemas become migration nightmares.

JSONB provides schema flexibility with indexing:

CREATE TABLE model_outputs (
    id SERIAL PRIMARY KEY,
    model_name TEXT NOT NULL,
    input_data JSONB NOT NULL,
    prediction JSONB NOT NULL,
    metadata JSONB NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

-- Index specific JSON fields
CREATE INDEX idx_model_name ON model_outputs ((metadata->>'model_version'));
CREATE INDEX idx_prediction_class ON model_outputs ((prediction->>'class'));

-- Query with JSON operators
SELECT * FROM model_outputs
WHERE prediction->>'confidence' > '0.9'
  AND metadata->>'model_version' = 'v2.1.0';

Schema changes without migrations. Add fields by inserting JSON with new structure. Query existing and new fields simultaneously.

Connection Pooling at Scale#

PostgreSQL connections are expensive. Each connection consumes memory. Default limit: 100 connections.

Production systems exceed this easily:

10 API servers × 20 connections each = 200 connections needed

Solution: PgBouncer

# pgbouncer.ini
[databases]
myapp = host=postgres.internal port=5432 dbname=production

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 20

1000 client connections map to 20 PostgreSQL connections. 50x reduction.

Application connects to PgBouncer instead of PostgreSQL directly:

# Instead of postgres.internal:5432
DATABASE_URL = "postgresql://pgbouncer.internal:6432/myapp"

Query Performance: N+1 Problem#

Common pattern that kills performance:

# Anti-pattern: N+1 queries
users = db.query("SELECT id FROM users WHERE active = true")
for user in users:
    orders = db.query("SELECT * FROM orders WHERE user_id = ?", user.id)
    # Process orders

100 users = 101 queries (1 for users, 100 for orders).

Fix with JOIN:

query = """
    SELECT u.id, u.email, o.id as order_id, o.total, o.status
    FROM users u
    LEFT JOIN orders o ON o.user_id = u.id
    WHERE u.active = true
"""
results = db.query(query)

One query. Same data. 100x faster.

Indexing Strategy#

Every WHERE clause, JOIN condition, and ORDER BY needs an index:

-- Slow query
SELECT * FROM model_logs
WHERE model_name = 'classifier-v2'
  AND created_at > '2025-12-01'
ORDER BY created_at DESC
LIMIT 100;

-- Add indexes
CREATE INDEX idx_model_logs_name_created
    ON model_logs (model_name, created_at DESC);

Use EXPLAIN ANALYZE to verify index usage:

EXPLAIN ANALYZE
SELECT * FROM model_logs
WHERE model_name = 'classifier-v2'
  AND created_at > '2025-12-01'
ORDER BY created_at DESC
LIMIT 100;

Look for “Index Scan” not “Seq Scan”. Seq Scan means missing index.

Partitioning for Large Tables#

Tables over 100M rows benefit from partitioning:

-- Partition by time for logs/events
CREATE TABLE model_logs (
    id BIGSERIAL,
    model_name TEXT,
    created_at TIMESTAMP NOT NULL,
    data JSONB
) PARTITION BY RANGE (created_at);

-- Create partitions
CREATE TABLE model_logs_2025_12
    PARTITION OF model_logs
    FOR VALUES FROM ('2025-12-01') TO ('2026-01-01');

CREATE TABLE model_logs_2026_01
    PARTITION OF model_logs
    FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');

Queries targeting specific time ranges scan only relevant partitions. Deleting old data becomes DROP TABLE instead of slow DELETE operations.

Full-Text Search#

-- Add text search column
ALTER TABLE documents ADD COLUMN search_vector tsvector;

-- Generate search vectors
UPDATE documents
SET search_vector = to_tsvector('english', title || ' ' || content);

-- Index for fast search
CREATE INDEX idx_documents_search ON documents USING gin(search_vector);

-- Search query
SELECT * FROM documents
WHERE search_vector @@ to_tsquery('english', 'machine & learning')
ORDER BY ts_rank(search_vector, to_tsquery('english', 'machine & learning')) DESC
LIMIT 10;

Handles most full-text search needs without Elasticsearch overhead.

Vector Embeddings with pgvector#

-- Install pgvector extension
CREATE EXTENSION vector;

-- Table with embeddings
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(512)  -- 512-dimensional vectors
);

-- Index for similarity search
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);

-- Similarity search
SELECT id, content, 1 - (embedding <=> query_embedding) as similarity
FROM documents
ORDER BY embedding <=> query_embedding
LIMIT 10;

Semantic search without external vector database. Works for millions of embeddings (though specialized vector DBs scale better for billions).

Materialized Views for Performance#

Expensive queries that don’t need real-time data:

-- Create materialized view
CREATE MATERIALIZED VIEW daily_metrics AS
SELECT
    date_trunc('day', created_at) as day,
    model_name,
    COUNT(*) as predictions,
    AVG((prediction->>'confidence')::float) as avg_confidence
FROM model_outputs
GROUP BY 1, 2;

-- Index it
CREATE INDEX ON daily_metrics (day, model_name);

-- Refresh nightly via cron
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_metrics;

Dashboard queries hit materialized view (fast) instead of scanning billions of rows (slow).

Transaction Isolation for Consistency#

AI systems often need consistent reads across multiple tables:

with db.transaction(isolation='repeatable read'):
    # Read model config
    config = db.query("SELECT * FROM model_configs WHERE active = true")

    # Read feature stats (guaranteed same snapshot)
    stats = db.query("SELECT * FROM feature_stats WHERE model_id = ?", config.id)

    # Compute using consistent snapshot
    result = compute(config, stats)

Without proper isolation, config might update between reads, causing inconsistent results.

Monitoring Critical Queries#

-- Enable pg_stat_statements
CREATE EXTENSION pg_stat_statements;

-- Find slow queries
SELECT
    query,
    calls,
    mean_exec_time,
    max_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

-- Find queries causing most load
SELECT
    query,
    calls,
    total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

Optimize the queries that matter most.

Backup and Recovery#

Continuous archiving with WAL:

# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://backups/wal/%f'

Every transaction logged to S3. Point-in-time recovery to any second.

Daily snapshots:

#!/bin/bash
# Daily backup script
pg_dump -Fc myapp > backup-$(date +%Y%m%d).dump
aws s3 cp backup-$(date +%Y%m%d).dump s3://backups/daily/

Full backup daily. WAL for recovery between snapshots.

Replication for High Availability#

# primary postgresql.conf
wal_level = replica
max_wal_senders = 3

# replica postgresql.conf
hot_standby = on

Setup streaming replication:

# On replica
pg_basebackup -h primary -D /var/lib/postgresql/data -U replication -P

Primary handles writes. Replicas handle reads. Automatic failover with tools like Patroni.

When to Use Something Else#

PostgreSQL isn’t optimal for:

Billions of time-series points: Use TimescaleDB (PostgreSQL extension) or InfluxDB

Petabyte-scale analytics: Use BigQuery or Snowflake

Sub-millisecond caching: Use Redis

Billions of vector embeddings: Use Pinecone or Weaviate

Graph traversals: Use Neo4j

But for 90% of use cases, PostgreSQL handles it. Start with PostgreSQL. Move to specialized databases only when PostgreSQL can’t meet requirements with concrete metrics.

The Complete Knowledge Principle#

Knowing PostgreSQL deeply means knowing:

Query optimization (EXPLAIN, indexes, query plans)
Schema design (normalization, partitioning)
Transaction isolation levels
Backup and recovery procedures
Replication setup
Performance tuning (shared_buffers, work_mem)
Extension ecosystem (pgvector, PostGIS, pg_stat_statements)

This knowledge enables effective AI-generated code review. AI suggests a query, you immediately know if it will perform. AI designs a schema, you catch normalization issues. AI configures replication, you spot the mistakes.

One database, known deeply, verified quickly, run reliably.

PostgreSQL for Production: The Generalist’s Database