Introduction
Chaos engineering is a relatively new discipline in the world of software development. The last couple of years have witnessed major organizations adopting chaos engineering. In simple terms, as the name suggests, chaos engineering is all about engineering (or introducing) chaos into a system with the intent of improving the system's reliability and safety.
Where did it begin?
Netflix evangelized chaos, and when this resulted in value-driven business, there grew a community to support, build, and spread the word about chaos engineering. Netflix came up with an application called Chaos Monkey.
Chaos Monkey
Every day, this application would iterate through a list of clusters, randomly pick an instance from a cluster, and, without warning, turn it off during business hours. That sounds bad, but the purpose was to understand the type of failure, that is, vanishing instances (which would affect service availability), and test the application’s resilience to failure. This was intentionally performed during business hours so that engineers could work on addressing the issue and fixing it immediately.
Eventually, Chaos Monkey worked, and multiple teams adopted it. Chaos Monkey pushed everyone to be robust enough to handle vanishing instances and loosely coupled to the way the issue is fixed.
Why is it necessary?
Chaos Monkey is a management principle wrapped up in a running piece of code. The concept is that “failures can happen at any point in time, and the application should have solutions to work around these failures”.
Chaos Monkey was successfully deployed, but it only worked on a small scale. A chaos engineering team was built at Netflix, which produced principles of chaos engineering.
The formal definition
“Chaos engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”
As the formal definition states, chaos engineering is a form of experimentation that helps build confidence.
The principles of chaos engineering list five advanced practices as a standard:
Build a hypothesis around steady-state behavior.
Extend real-world events.
Run experiments in production.
Automate experiments to run continuously.
Minimize blast radius
Conclusion
In a nutshell, chaos engineering is an important aspect of building and maintaining reliable and robust applications.