Most System Design Mistakes Hide Between the Boxes

Five gaps your architecture diagrams don't show

May 30, 2026

A user clicks Place Order.

The API returns 200 OK. The order appears in the database. The checkout page shows a success message. From the outside, everything looks fine.

Then support gets a ticket.

The customer never received a confirmation email. Inventory never moved. The warehouse system never saw the order. Analytics never recorded the purchase.

Now the team has a strange problem.

The order now exists in one place, but not in the rest of the system.

Not in the boxes on the diagram. Not in the clean arrows between services. Not in the happy path everyone reviewed during planning.

It gets lost in the gaps between components, where one operation succeeds, another fails, and the system enters a state nobody designed for.

Most system design mistakes hide between the boxes.

They show up when a database commit succeeds but an event never publishes. When a replica lags behind the primary. When one request triggers ten downstream workflows. When a queue keeps accepting work faster than consumers can process it. When a harmless schema change breaks a forgotten consumer.

Most teams already have databases, queues, caches, APIs, logs, dashboards, and cloud infrastructure. The harder part is understanding the failure patterns that appear when those tools interact.

These five concepts help build that instinct.

The agent harness wasn’t supposed to be the black box

Agent loop is the most important piece of infrastructure in your workflow right now and for most developers, it’s the one piece they can’t open up. Agent builders have to jump through all the hoops themselves, crafting the infrastructure and tools, testing the harness, while fighting to maintain what they’ve built.

Meet Cline SDK: agent harness behind Cline 2.0, fully open-sourced. The same runtime that powers Cline across VS Code, JetBrains, and the CLI is now an npm install away: npm i @cline/sdk. Inspect it, fork it, extend it, ship on it.

Best-in-class harness: 74.2% on Terminal-Bench 2.0 with Claude Opus 4.7 ahead of Claude Code (69.4%) and strongest numbers published on open-weight models.
Open model & provider choice: Anthropic, OpenAI, Google, Bedrock, Mistral, or any OpenAI-compatible endpoint.
Real plugin system: Register tools, hooks, commands, providers, message builders. Prototype as a local file, harden into a package. Extend it freely for any of your agent use cases.
Scheduled + event-driven agents: Cron and event specs for PR reviews, dependency checks, coverage audits, changelogs no separate orchestration layer.

Stop building around your agent. Start building on it.

Install Cline SDK today: npm i @cline/sdk Or try the rebuilt harness directly: npm i -g @cline

Get Started Today

1. The Dual-Write Problem

The dual-write problem happens when one business action needs to update two different systems.

A common example is saving an order to a database and publishing an OrderCreated event to a broker.

The flow looks simple:

You start with code like this:

This code looks reasonable. It may work for months. It may pass every test. It may survive low traffic without obvious problems.

Then production does what production always does.

The database write succeeds, but the event publish fails.

Now the order exists, but nobody else knows about it. Inventory does not reserve stock. Email does not send confirmation. Shipping does not prepare fulfillment. Analytics does not count the sale.

The system has split into two versions of reality.

One part says, “The order exists.”

Another part says, “I never heard about it.”

That is the dual-write problem.

What’s at stake

The scary part is not the failure itself. The scary part is how quiet the failure can be.

The API may still return success. The database may look correct. The logs may show one small broker timeout buried under thousands of successful requests. But downstream systems now depend on an event that never arrived.

That creates missing workflows, broken reports, confused support teams, and customer experiences that feel random.

This is why “just publish an event” is not a complete architecture decision.

Event-driven systems need more than publishing events. They need a reliable way to record that something happened and make sure the rest of the system eventually learns about it.

Walk through a solution

The database write and the event publish should not live as two unrelated operations. They represent one business fact:

The order was created.

A common solution is the Transactional Outbox Pattern.

Instead of saving the order and publishing the event directly, the service saves the order and writes an event record into an outbox table in the same database transaction.

The code becomes:

Now the order and the event record commit together.

If the transaction fails, neither one gets saved. If the transaction succeeds, the system has a durable record that an event must be published.

A separate process reads from the outbox table and sends events to the broker.

This turns a fragile two-step operation into a retryable workflow.

The request path no longer depends on the broker being available at the exact moment the user clicks Place Order. It only needs to persist the event intent safely.

Even if the broker is down, slow, or unreachable, the intent survives in the database, and the publisher will get to it.

How to apply this tomorrow

For brokers and webhooks, an outbox-style publisher often works well. For external APIs, you may also need idempotency keys, retry policies, and compensation logic.

For the most critical path you find (for example, order creation), sketch how you would replace the direct publish with an outbox table write in the same transaction, plus a small background publisher that drains that table

Realistic considerations

The outbox pattern improves reliability, but it does not remove all complexity. It only changes the shape of the problem.

The publisher might publish the event and crash before marking it as published. When it restarts, it may publish the same event again.

The outbox does not guarantee exactly-once processing. It guarantees the event intent is durably recorded with the business write. Publishing and consumption are still usually at-least-once, so consumers must remain idempotent.

This is why idempotency matters in event-driven systems. You cannot assume each message arrives only once.

You also need monitoring. If the outbox table grows, your publisher may be falling behind. That means the main system accepts orders faster than the rest of the platform can react to them.

The dual-write problem teaches you how systems split when two operations pretend to be one. But once you have a reliable event stream, new problems emerge when different views of the data fall out of sync.

The next concept teaches you why even successful writes can still produce stale user experiences.

2. Read-Write Splitting

Reads and writes do not scale the same way.

Reads are easier to multiply. You can add read replicas, cache responses, create materialized views, use search indexes, or push static assets to a CDN.

Writes are harder because a write changes state. Once state changes, the system must decide who owns the truth, how conflicts get resolved, and when other copies catch up.

That is why scaling reads often feels simple, while scaling writes exposes deeper design questions.

Most systems begin with one database.

This works for a while. Then the product grows.

Dashboards query large tables. Mobile apps poll more often. Search pages scan too much data. Internal tools run heavy reports. The primary database starts doing too much work.

So the team adds read replicas.

Read traffic now spreads across multiple machines.

That helps.

But it introduces a new problem: replication lag.

The primary database gets the write first. Replicas catch up later. That delay may be milliseconds, seconds, or during incidents, minutes.