System Design Classroom

Exactly! The closer our tests mirror real user behavior, the more confidence we have in our system’s reliability.

Thanks, Petar

Expand full comment

Fabien Ninoles

Mar 1

Thanks a lot for this article! Testing in production is something I pushed my team to aim too all the time, I still accept dev environment (I named them dev to emphasize they should be used for development, rather than for testing) but I make it clear that if they cannot show that a feature is working in prod, the feature is not ready to be released.

One thing that I found missing regarding this, is patterns to implement those elements. Devs usually have little knowledge on how to implement feature flags, rollback strategies, a/b testing, parallel runs, etc. The part they are missing are usually on the design part: proper rollback is not just a matter of the build pipeline, the code and the data should also support it. Idem for canaries (how to run two different versions of the same services in the same environment?), parallel runs (useless if you don't have proper observability, but what those that means?), chaos engineering (do your service implement a proper health check? Do you handle timeout? Do you have a circuit breaker?), etc.

It will be very interesting to visit those patterns in your channel. I think they are a major part of good system design, and too many teams relegated those issues to the "DevOps/SRE team" rather than building them for themselves, which is quite unfortunate because of how much those things can improve the developer's quality of life.

Expand full comment

You are right! I fully support the idea of the Dev team owning this effort, of course, they will need help, but the definition of DONE for me is working on production.

I'll do my best to cover some of the topics, although I think I already covered some of them in previous posts ;).

Curious—what’s your biggest challenge when getting teams to own these patterns rather than relying on SREs?

Thanks for the great comment!

Expand full comment

Rafa Páez

Mar 16

That meme is so good because it's the reality: I rarely test my code (okay, except for some unit tests) but when I do it, I do it in production.

Expand full comment

Mar 17

Classics never die ;)

Expand full comment

Daniel Moka

Mar 10

The trick of quality and confident testing is to get as close as possible to how the user would use our apps. Great article my friend.

Expand full comment

Mar 11

Spot on, Daniel.

The closer to the environment and behavior, the better.

Expand full comment

Marcos F. Lobo 🗻🧭

Mar 4

Very good one Raúl, thanks for sharing.

The strategy I like the most is what you call here "Synthetic transactions". I call it "Synthetic tests".

The main benefit of running this in a continuous manner is that you will notice if something is wrong with your system before the customer.

Without testing in Production you will stay in a reactive mode. Testing in Production you shift left to a proactive mode, reducing the impact on your customers.

One important question to deal with is: do you use a "fake" customer to perform the synthetic tests? or a real customer? 🧐

What do you guys think?

Expand full comment

Mar 5

Fake vs. Real is always a good question. I prefer to use the term "Controlled" instead of the term "Fake."

A Controlled Customer is one I can manipulate and change without the fear of messing up with someone's account, but at the same time, they behave like a real one.

Expand full comment

Saurabh Dashora

Mar 4

Great post Raul.

With the right strategy, testing in production is a very useful strategy many times. It can reveal issues that could never be found in lower environments.

Also, thanks for the mention!

Expand full comment

Mar 5Edited

Spot on, Saurabh.

Thanks!

Expand full comment

Vladimir Poplavskij

Mar 3

Bad practice has test only Production. But agree with check on production also make sense.

Expand full comment

Mar 5

Right, the idea is to use these strategies to augment, not replace.

Thanks, Vladimir.

Expand full comment

John

Good article thankyou. Testing in production (or TIP) is an important part of the release cycle. Production is quite different from QA/test, different machines, userids, security, file systems, firewalls, access etc and often needs re-iterating. Where different config files are defined to define these different system configurations, the first time these config files are tested is when the software is deployed. I have seen errors in config that are difficult to spot, and problems in release instructions in ensuring that all the appropriate system channels and security have been setup. These issues are *only* identified after deployment. It is key to ensure that each the elements in the config (routes, file systems, users etc) are all confirmed post deployment at a time that is least disruptive to the business. It is key to review this plan prior to deployment to ensure that each potential issue is covered.

Expand full comment

You're right, John; configuration mismatches are one of the biggest sources of production failures.

Thanks for adding this!

Expand full comment

Fabien Ninoles

My greatest challenge? Make them see it's possible and they can do it. Most of them haven't been exposed to that kind of code enough to have an idea of how it could happen. So, it's a second thought: rather than having code designed to be observable, they try to add observability on top of opaque(?) code design. It's very similar (and would even say it is exactly the same) as why TDD require to start with the test first. Adding tests to untestable code is very hard.

Expand full comment