11 Comments

The higher our tests resemble the way our customers use our system, the better. So testing in product is important, but we must make sure not to pollute the environment with wrong statistics and data.

Love this article, Raul!

Expand full comment

Exactly! The closer our tests mirror real user behavior, the more confidence we have in our system’s reliability.

Thanks, Petar

Expand full comment

Thanks a lot for this article! Testing in production is something I pushed my team to aim too all the time, I still accept dev environment (I named them dev to emphasize they should be used for development, rather than for testing) but I make it clear that if they cannot show that a feature is working in prod, the feature is not ready to be released.

One thing that I found missing regarding this, is patterns to implement those elements. Devs usually have little knowledge on how to implement feature flags, rollback strategies, a/b testing, parallel runs, etc. The part they are missing are usually on the design part: proper rollback is not just a matter of the build pipeline, the code and the data should also support it. Idem for canaries (how to run two different versions of the same services in the same environment?), parallel runs (useless if you don't have proper observability, but what those that means?), chaos engineering (do your service implement a proper health check? Do you handle timeout? Do you have a circuit breaker?), etc.

It will be very interesting to visit those patterns in your channel. I think they are a major part of good system design, and too many teams relegated those issues to the "DevOps/SRE team" rather than building them for themselves, which is quite unfortunate because of how much those things can improve the developer's quality of life.

Expand full comment

You are right! I fully support the idea of the Dev team owning this effort, of course, they will need help, but the definition of DONE for me is working on production.

I'll do my best to cover some of the topics, although I think I already covered some of them in previous posts ;).

Curious—what’s your biggest challenge when getting teams to own these patterns rather than relying on SREs?

Thanks for the great comment!

Expand full comment

Good article thankyou. Testing in production (or TIP) is an important part of the release cycle. Production is quite different from QA/test, different machines, userids, security, file systems, firewalls, access etc and often needs re-iterating. Where different config files are defined to define these different system configurations, the first time these config files are tested is when the software is deployed. I have seen errors in config that are difficult to spot, and problems in release instructions in ensuring that all the appropriate system channels and security have been setup. These issues are *only* identified after deployment. It is key to ensure that each the elements in the config (routes, file systems, users etc) are all confirmed post deployment at a time that is least disruptive to the business. It is key to review this plan prior to deployment to ensure that each potential issue is covered.

Expand full comment

You're right, John; configuration mismatches are one of the biggest sources of production failures.

Thanks for adding this!

Expand full comment

My greatest challenge? Make them see it's possible and they can do it. Most of them haven't been exposed to that kind of code enough to have an idea of how it could happen. So, it's a second thought: rather than having code designed to be observable, they try to add observability on top of opaque(?) code design. It's very similar (and would even say it is exactly the same) as why TDD require to start with the test first. Adding tests to untestable code is very hard.

Expand full comment

You nailed with this: "Adding tests to untestable code is very hard."

Expand full comment

Testing only reduces the probability of bugs from happening, it's impossible to make it 0.

So we should always test production.

Expand full comment

Spot on Neo.

Thanks!

Expand full comment

Bad practice has test only Production. But agree with check on production also make sense.

Expand full comment