Everything Started with the Promise of Loosely Coupled Systems
Lessons from using Event-Driven Architecture
It all started with good intentions.
We wanted to build a scalable, loosely coupled system, so we turned to Event-Driven Architecture (EDA), which promises independence between services, easier scaling, and cleaner boundaries.
And then… things got messy.
Want to master real-world software engineering techniques? CodeCrafters offers hands-on challenges that help you build production-ready systems from the ground up.
Write your own Redis, Git, or Docker to understand how the best tools work under the hood.
Develop resilience and scalability by building real-world software components.
Level up your system design skills with practical exercises inspired by industry challenges.
Sign up and get 40% off if you upgrade.
Thank you to our sponsors who keep this newsletter free!
The Dream vs. Reality
What we expected:
Clean communication through events.
Independent teams deploying without coordination.
Clear ownership and boundaries.
What we got:
Hidden service dependencies.
Long, fragile chains of events.
Debugging nightmares across multiple services and queues.
Where It Broke
One-Way Events, Two-Way Needs
Events are great at broadcasting. Not so great when Service A needs a response from Service B. You end up patching with HTTP calls, callbacks, or workarounds that defeat the purpose.
Eventual Consistency ≠ Business Acceptable
Some flows must be strongly consistent, like payments or inventory. Waiting for a series of async events isn't always good enough.
The Cascade Trap
Events triggering more events, triggering more events. No one knows the full chain. One failure creates a mess five services deep.
Realistic Considerations
Trade-offs
You gain scalability and flexibility at the cost of traceability and operational overhead.
Decoupling helps teams move faster, but slows down debugging and recovery.
Failure Scenarios
A single event not being processed (e.g., due to deserialization error or misconfiguration) can break downstream workflows silently.
Retry storms or duplicate events can cause side effects like double charging or inventory mismatches.
Timing Issues
Services may act on events before upstream processing is complete. Out-of-order delivery leads to inconsistent states.
Race conditions can emerge when multiple consumers process related events without coordination.
Edge Cases
Service goes down after emitting an event but before persisting internal state—causing partial failures that are hard to roll back.
Consumers making assumptions about event order or presence of other events lead to brittle logic.
When EDA Shines
Event-driven architecture isn't bad; it's just often misused. Here are situations where it works exceptionally well:
Fan-out Notifications
One service publishes an event; multiple consumers handle side effects (e.g., audit logging, metrics, emails) independently.Decoupled Feature Rollouts
Want to add a new service without touching existing ones? Subscribe to an existing event and start processing—no changes to publishers.IoT & Streaming
High-throughput environments where latency is acceptable, and services process data in real-time (e.g., telemetry, log pipelines).
Orchestration vs. Choreography
While choreography offers flexibility, orchestration gives clarity.
Use choreography for simple event fan-out scenarios. Use orchestration when the order, consistency, or recovery of operations matters.
Favor orchestration when you need:
Deterministic flow
Easier debugging
Strong error recovery
Knowing who's in charge is better, especially when things go wrong.
When Choreography Still Works:
Services are truly independent.
You want to enable plug-and-play consumers.
Failure doesn't require coordination (e.g., analytics, logs).
Additional Design Considerations Nobody Tells You Early
You need compensating transactions to handle multi-step workflows that span services. This helps maintain eventual consistency without distributed locks.
You need to make sure event consumers are idempotent; able to handle duplicate messages safely without side effects. This is critical for retry handling and delivery guarantees.
Events change. Always design with backward compatibility in mind. You need to use versioning, enforce schema validation, and coordinate evolution to prevent breaking consumers.
You need to secure event channels with authentication and authorization.
You need to validate that events come from trusted sources to avoid processing injected or malformed messages.
You need to "Make coupling obvious" means you should document service dependencies.
Use architectural diagrams.
Maintain producer-consumer documentation.
Treat your event contracts as seriously as API specs.
Tooling That Saves You
EDA without strong observability is chaos. Tools and techniques that help:
Distributed Tracing (e.g., OpenTelemetry, Sentry, Zipkin)
Helps track a request across multiple services—even through async events.Logging Layer (e.g., ELK, Datadog Logs, Fluentd)
Enables searching and correlating events with logs from distributed services.Dead Letter Queues (e.g., AWS SQS DLQ, RabbitMQ DLX, Kafka DLQ)
Capture failed event deliveries. Crucial for diagnosing silent failures. Combined with replay, allows restoring the state or recovering from bugs.Schema Contracts (e.g., AsyncAPI, Protobuf, Contract Tests)
Document events like you would REST APIs. Avoid guesswork and breakages. Schema enforcement helps prevent data evolution issues.Monitoring (e.g., Grafana, Prometheus, Datadog) Keep visibility into your system health, alert on abnormal patterns, and track metrics like queue depth, event lag, and failure rates.
Together, these layers ensure visibility, traceability, and recovery in event-driven systems.
Postmortem Snapshot
We once had a refund service consume
OrderCancelled
events—but due to a config typo, it ignored 15% of messages. Users weren't refunded, and we only discovered it a week later via support tickets.
Takeaway: fire-and-forget works—until someone forgets to monitor.
Lessons Learned
Don't Use EDA as a Default
Use it when the domain fits—fan-out use cases, audit logs, or workflows where latency isn't critical. Not every service needs to be decoupled.Favor Simplicity Over Indirection
EDA can feel smart, but every layer of indirection is a future maintenance cost. A direct API call might be the better choice.Make Coupling Obvious
EDA hides coupling behind queues. That makes local reasoning hard. Be explicit about what depends on what.
Final Thought
Complexity makes you look smart.
Simplicity makes you move fast, and makes you money.
Before going event-driven, ask:
Does this need async processing?
Can I tolerate delay and retries?
Can I explain the flow to a new engineer in 5 minutes?
Use events when they help. Don't build your system around them by default.
Start with clarity, not cleverness.
I'm building a community around System Design. Want in?
Join for free, or go paid to unlock private discussions and chat support.
Articles I enjoyed this week
20 Git Commands EVERY Developer Should Know by
From Typing URL To Showing the Webpage by
Understanding TLS: how data stays secure over the internet by
Be Critical About Any SWE Advice by
Thank you for reading System Design Classroom. If you like this post, share it with your friends!