The Three Truths of Data-Oriented Development: Lessons from Production AI Systems
Mike Acton’s 2014 CppCon talk on data-oriented design fundamentally changed how I approach software engineering. After building AI systems serving millions of users, these principles have proven even more critical in production environments where data volume, transformation pipelines, and hardware constraints dominate success metrics.
Rather than frame these as “lies to avoid,” I’ve found greater value in articulating them as positive truths to embrace. These three principles have guided every production system I’ve architected, particularly in AI/ML contexts where data-oriented thinking isn’t optional—it’s fundamental.
Truth 1: Programming is Data Transformation#
Core Principle: Every program exists to transform data from one form to another. Code is the mechanism; data is the substance.
The Production Reality#
In AI systems, this truth becomes undeniable. Consider a typical ML inference pipeline:
# Surface-level view: "We're running a model"
# Reality: We're transforming data through multiple representations
def process_inference_request(raw_input: bytes) -> Dict[str, float]:
"""
Data transformation pipeline for model inference.
Each step transforms data representation for the next stage.
"""
# Transform 1: bytes → structured dict
parsed_input = json.loads(raw_input)
# Transform 2: dict → validated schema
validated = InputSchema.validate(parsed_input)
# Transform 3: schema → normalized tensors
normalized = feature_normalizer.transform(validated)
# Transform 4: tensors → model input format
model_input = preprocessor.prepare_batch([normalized])
# Transform 5: model input → raw predictions
raw_output = model.predict(model_input)
# Transform 6: raw predictions → business metrics
predictions = postprocessor.interpret(raw_output)
return predictions
Each transformation stage has distinct characteristics:
- Input format: JSON, protobuf, tensor shapes
- Memory layout: Contiguous arrays, sparse matrices, GPU memory
- Numerical representation: int8 quantization, float32 precision
- Access patterns: Sequential scans, random lookups, batched operations
Understanding these data transformations—not just the algorithmic logic—determines system performance.
Practical Application in AI Systems#
Understand Your Data Deeply
Before architecting a solution, characterize your data:
def analyze_training_data(dataset_path: Path) -> DataProfile:
"""
Profile dataset characteristics to inform architecture decisions.
"""
profile = {
'volume': count_samples(dataset_path),
'dimensionality': measure_feature_space(dataset_path),
'sparsity': calculate_zero_ratio(dataset_path),
'distribution': analyze_value_distributions(dataset_path),
'access_pattern': determine_typical_queries(dataset_path),
'growth_rate': project_future_volume(dataset_path),
}
# Architecture follows data characteristics
if profile['sparsity'] > 0.9:
return SparseMatrixStrategy(profile)
elif profile['access_pattern'] == 'sequential':
return ColumnStoreStrategy(profile)
else:
return determine_optimal_layout(profile)
Different Data Demands Different Solutions
One of my most expensive lessons: attempting to force a single storage solution across diverse datasets.
# Anti-pattern: One-size-fits-all approach
class UniversalDataStore:
"""This seemed elegant. It was expensive."""
def store(self, data: Any) -> None:
# Generic storage that performs poorly for all use cases
pass
# Reality: Data characteristics drive storage decisions
class ImageEmbeddingStore:
"""Optimized for 512-dim float vectors, cosine similarity"""
def __init__(self):
self.index = FaissIVFIndex(dims=512, metric='cosine')
class UserEventStore:
"""Optimized for time-series writes, range queries"""
def __init__(self):
self.db = TimescaleDB()
class DocumentStore:
"""Optimized for full-text search, relevance ranking"""
def __init__(self):
self.index = ElasticsearchIndex()
Each solution reflects the data’s actual usage patterns, not an idealized abstraction.
Hardware Determines Solution Cost
In production AI, hardware awareness isn’t optional:
def optimize_batch_inference(
model: Model,
inputs: List[Input],
hardware_profile: HardwareProfile
) -> List[Output]:
"""
Batch size and processing strategy determined by hardware constraints.
"""
if hardware_profile.has_gpu:
# GPU throughput maximized by large batches
optimal_batch = min(
hardware_profile.gpu_memory // model.memory_per_sample,
128 # Sweet spot for most models
)
return gpu_batch_process(model, inputs, optimal_batch)
else:
# CPU optimized for smaller batches, parallel processing
optimal_batch = hardware_profile.cpu_cores * 2
return cpu_parallel_process(model, inputs, optimal_batch)
The same model with identical accuracy has 10x cost variance depending on hardware-aware optimization.
Preserve Data Context
Over-simplification destroys valuable information:
# Anti-pattern: Discarding context
def prepare_text(text: str) -> str:
"""Simple preprocessing that loses critical information"""
return text.lower().strip()
# Better: Preserve context for downstream decisions
def prepare_text_with_context(text: str) -> PreparedText:
"""Retain information that may matter for some use cases"""
return PreparedText(
original=text,
normalized=text.lower().strip(),
metadata={
'had_leading_whitespace': text[0].isspace(),
'had_mixed_case': text != text.lower(),
'length_original': len(text),
'detected_language': detect_language(text),
}
)
Different downstream consumers need different aspects of the data. Preserve optionality.
Truth 2: Design Around Actual Data Transformations, Not Idealized Models#
Core Principle: Ground your architecture in concrete data transformations rather than abstract world models. Engineering beats philosophy.
The Abstraction Trap in AI Systems#
Early in my career, I built what I thought was an elegant “universal feature processor”:
# The Attractive But Wrong Approach
class FeatureProcessor:
"""A beautiful abstraction that didn't survive production"""
def process(self, feature: Feature) -> ProcessedFeature:
# Generic processing pipeline for "any feature"
validated = self.validate(feature)
normalized = self.normalize(validated)
transformed = self.transform(normalized)
return transformed
This abstraction promised simplicity. It delivered complexity. Why?
- Image features need spatial locality preservation
- Text features need sequence order maintenance
- Numerical features need distribution-aware scaling
- Categorical features need cardinality-based encoding
The “universal” abstraction obscured these critical differences, creating a maintenance nightmare.
The Data-Grounded Alternative#
# Engineering-Based Approach: Explicit About Data Transformations
class ImageFeatureProcessor:
"""Optimized for spatial data, preserves locality"""
def process(self, image: np.ndarray) -> np.ndarray:
# Explicit transformations matching data structure
resized = cv2.resize(image, (224, 224))
normalized = (resized - self.mean) / self.std
# Spatial structure preserved throughout
return normalized
class TextFeatureProcessor:
"""Optimized for sequential data, preserves order"""
def process(self, text: str) -> List[int]:
# Explicit sequence transformations
tokens = self.tokenizer.tokenize(text)
ids = self.vocab.convert_tokens_to_ids(tokens)
# Sequence order preserved throughout
return ids
class NumericalFeatureProcessor:
"""Optimized for distributional properties"""
def process(self, values: np.ndarray) -> np.ndarray:
# Explicit statistical transformations
z_scores = (values - self.mean) / self.std
# Distribution shape preserved throughout
return z_scores
Each processor reflects the actual data transformations required. No forced abstraction.
Production Example: Feature Store Architecture#
When building feature stores for ML pipelines, data transformation patterns should drive design:
class FeatureStore:
"""Architecture follows data access patterns, not abstract ideals"""
def __init__(self):
# Real-time features: low-latency point queries
self.online_store = RedisCluster()
# Historical features: batch scan operations
self.offline_store = ParquetFiles()
# Streaming features: append-only time series
self.streaming_store = KafkaTopics()
def get_online_features(self, entity_id: str) -> Dict:
"""Point query optimized for sub-5ms latency"""
return self.online_store.get(entity_id)
def get_offline_features(self, entity_ids: List[str]) -> pd.DataFrame:
"""Batch query optimized for throughput over latency"""
return self.offline_store.read_batch(entity_ids)
def stream_features(self, window: TimeWindow) -> Iterator:
"""Streaming access for real-time aggregations"""
return self.streaming_store.consume(window)
Three storage backends, each optimized for its actual access pattern. No unified abstraction pretending they’re the same.
Maintainability Through Clarity#
Contrary to common belief, explicit data transformations improve maintainability:
# When debugging, this clarity is invaluable
def preprocess_for_model(raw_data: Dict) -> ModelInput:
"""
Explicit transformation pipeline.
Each step documented, each can be inspected independently.
"""
# Step 1: Extract fields (data: dict → data: typed)
extracted = extract_required_fields(raw_data)
# Step 2: Validate ranges (data: typed → data: validated)
validated = validate_field_ranges(extracted)
# Step 3: Normalize distributions (data: validated → data: normalized)
normalized = normalize_to_standard_distribution(validated)
# Step 4: Convert to tensors (data: normalized → data: tensor)
tensorized = convert_to_model_format(normalized)
return tensorized
# Each transformation is a clear checkpoint for debugging
# When failures occur, the exact transformation stage is immediately apparent
Truth 3: Hardware is the Platform#
Core Principle: Software runs on hardware, not abstractions. Understanding the execution environment is essential for production systems.
Hardware Reality in AI Systems#
AI workloads expose hardware constraints more dramatically than traditional software:
class ModelServing:
"""Hardware awareness determines serving strategy"""
def __init__(self, hardware_profile: HardwareProfile):
self.hardware = hardware_profile
self.strategy = self._select_strategy()
def _select_strategy(self) -> ServingStrategy:
"""Different hardware demands different approaches"""
if self.hardware.has_gpu:
# GPU: Maximize throughput with large batches
# Memory architecture: Optimize for coalesced access
# Instruction set: Leverage CUDA kernels
return GPUBatchStrategy(
batch_size=self._calculate_optimal_batch(),
memory_layout='NCHW', # GPU-friendly layout
precision='float16', # Tensor core acceleration
)
elif self.hardware.has_avx512:
# CPU with AVX-512: Vectorized operations
# Memory architecture: Optimize for cache locality
# Instruction set: Leverage SIMD
return CPUVectorizedStrategy(
vector_width=512,
memory_layout='NHWC', # CPU cache-friendly
precision='int8', # Quantized for speed
)
else:
# Generic CPU: Optimize for memory efficiency
# Standard instruction set only
return CPUScalarStrategy(
batch_size=1,
memory_layout='optimal_for_cache',
precision='float32',
)
Memory Architecture Matters#
One of our most impactful optimizations came from understanding memory layout:
# Original: Generic memory layout
features = np.array([
[f1_sample1, f2_sample1, f3_sample1], # Sample 1
[f1_sample2, f2_sample2, f3_sample2], # Sample 2
# ... (Row-major: samples contiguous)
])
# Problem: Feature scaling iterates inefficiently
for feature_idx in range(features.shape[1]):
features[:, feature_idx] = scale(features[:, feature_idx])
# Non-contiguous memory access, cache-inefficient
# Solution: Transposed layout for feature-wise operations
features = features.T # Now features contiguous
for feature_values in features:
feature_values[:] = scale(feature_values)
# Contiguous access, 3x faster on our hardware
This optimization is invisible at the algorithm level. It’s pure hardware awareness.
Endianness in Distributed Systems#
When distributing training across heterogeneous clusters:
class ModelCheckpoint:
"""Checkpoint format must handle hardware differences"""
def save(self, model: Model, path: Path) -> None:
"""Save with explicit byte order for cross-platform compatibility"""
checkpoint = {
'weights': model.get_weights(),
'config': model.get_config(),
'metadata': {
'byte_order': sys.byteorder,
'float_format': 'IEEE754',
'version': MODEL_VERSION,
}
}
# Save with platform-independent format
with path.open('wb') as f:
# Explicitly big-endian for network standard
pickle.dump(checkpoint, f, protocol=4)
def load(self, path: Path) -> Model:
"""Load with byte order verification"""
with path.open('rb') as f:
checkpoint = pickle.load(f)
# Verify or convert byte order
if checkpoint['metadata']['byte_order'] != sys.byteorder:
checkpoint = self._convert_byte_order(checkpoint)
return Model.from_checkpoint(checkpoint)
Ignoring endianness cost us two days debugging “corrupted” model weights that were just byte-swapped.
Performance Characteristics#
Understanding hardware performance profiles prevents expensive mistakes:
def optimize_data_loading(
dataset_size: int,
hardware: HardwareProfile
) -> DataLoader:
"""
Data loading strategy depends on hardware bottlenecks.
"""
# Measure actual bottlenecks
disk_bandwidth = hardware.measure_disk_throughput()
network_bandwidth = hardware.measure_network_throughput()
cpu_decompression = hardware.measure_decompression_speed()
if disk_bandwidth > network_bandwidth:
# Network bottleneck: fetch compressed, decompress locally
return NetworkDataLoader(
compression='gzip',
prefetch_buffer=network_bandwidth * 2, # 2x network latency
)
elif cpu_decompression < disk_bandwidth:
# CPU bottleneck: store uncompressed, trade disk for CPU
return DiskDataLoader(
compression=None,
prefetch_buffer=disk_bandwidth * 2,
)
else:
# Balanced system: optimize for cache efficiency
return MemoryMappedLoader(
mmap_threshold=hardware.l3_cache_size,
)
Different hardware profiles demand different strategies. No universal solution exists.
Integrating the Three Truths#
These truths reinforce each other in production systems:
Case Study: Real-time Feature Computation#
class RealtimeFeatureEngine:
"""
Demonstrates all three truths in one system.
"""
def __init__(self, hardware: HardwareProfile):
# Truth 3: Hardware determines architecture
self.hardware = hardware
# Truth 1: Design around data transformations
self.transformations = self._configure_transformations()
# Truth 2: Explicit about actual data flow
self.pipeline = self._build_pipeline()
def _configure_transformations(self) -> List[Transform]:
"""Define explicit data transformations"""
return [
# Transform 1: Events → Structured records
ParseEventTransform(schema=self.event_schema),
# Transform 2: Records → Time windows
WindowTransform(window_size=self.hardware.optimal_window()),
# Transform 3: Windows → Aggregated features
AggregateTransform(agg_functions=self.aggregations),
# Transform 4: Features → Model input format
SerializeTransform(format=self.model_input_format),
]
def _build_pipeline(self) -> Pipeline:
"""Pipeline optimized for hardware characteristics"""
if self.hardware.has_gpu:
# GPU: Batch processing for throughput
return GPUBatchPipeline(
batch_size=self.hardware.optimal_batch_size(),
transformations=self.transformations,
)
else:
# CPU: Stream processing for latency
return CPUStreamPipeline(
parallel_workers=self.hardware.cpu_cores,
transformations=self.transformations,
)
def compute_features(self, events: List[Event]) -> Features:
"""
Truth 1: Focus on data transformation
Truth 2: No abstract models, just concrete transformations
Truth 3: Hardware-aware execution
"""
return self.pipeline.process(events)
Practical Impact on Production Systems#
Applying these three truths has consistently improved system performance:
Before Data-Oriented Thinking#
- Feature engineering pipeline: 45 minutes for 1M samples
- Model serving latency: P99 = 250ms
- Infrastructure cost: $12K/month
- Development velocity: 2-week iterations
After Data-Oriented Thinking#
- Feature engineering pipeline: 4 minutes for 1M samples (10x improvement)
- Model serving latency: P99 = 18ms (14x improvement)
- Infrastructure cost: $3K/month (4x reduction)
- Development velocity: 2-day iterations (7x improvement)
The improvements stem from:
- Understanding data transformations eliminated unnecessary processing
- Avoiding premature abstractions reduced complexity and bugs
- Hardware-aware design maximized resource utilization
Programming is data transformation. Design around actual data transformations, not idealized abstractions. Hardware is the platform. Answer three questions: What data transformations does this system perform? What are the actual data characteristics and access patterns? What hardware will execute these transformations? Architecture follows from honest answers.
Acknowledgments: These principles are grounded in Mike Acton’s 2014 CppCon talk “Data-Oriented Design and C++”. The insights presented here represent my application of his principles to production AI systems over the past decade.