The Netflix Tech Blog has a number of recommendations for anyone moving to the cloud. Possibly the most indispensable piece of advice the blog gives is that “The best way to avoid failure is to fail constantly”.
“We’ve sometimes referred to the Netflix software architecture in AWS [Amazon Web Service] as our Rambo Architecture. Each system has to be able to succeed, no matter what, even all on its own. We’re designing each distributed system to expect and tolerate failure from other systems on which it depends. One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage”.
This advice might sound crazy, why would you actively attack your own deployed services? Because, regardless of what you do, individual components of your solution will fail. Even the most robust equipment eventually fails and if a system cannot recover from small failures and gracefully degrade from larger ones, then that system is only as strong as its weakest link.
Netflix compares this to getting a flat tyre on the motorway. The guy who’s been practicing every weekend by popping and replacing his tyres on his drive is going to be much better prepared than the guy who hasn’t looked at his spare in 6 months and doesn’t even know if it’s inflated.
With this philosophy Netflix has built a Simian Army. First, to ensure production instances failing did not affect customers, they created Chaos Monkey, a tool which randomly disables instances. Other simians followed. Latency Monkey was created to introduce artificial delays, Chaos Gorilla to simulate Amazon availability zone outages. The company began constantly testing their resilience to all kinds of failures and improving their ability to deal with the inevitable failures that would occur in production.
The ‘tyres burst’ when AWS suffered significant networking issues across the Americas on Christmas Eve 2012. The Netflix web site remained up, at a time when many popular websites crashed. While some streaming services for users in the Americas were degraded, nothing broke. Their commitment to recovery from failure had paid off.
So what can we learn from Netflix for the benefit of Solidsoft Reply’s customers?
The lessons are as much cultural as technical. Remember that the company was willing to take small risks with its production code. This was a good idea, but at the time it must have required a deep cultural commitment to producing quality software. Netflix had to be the sort of place that hire smart people and give them the space to make the smart decisions. I think as consultants it’s our duty to promote a positive software development culture, but I think that deserves its own post.
On a much less contemplative note, I think Chaos Monkey is a really good idea. In the best tradition of software development, I decided I’d pinch it. My current project involves creating a document scanning security solution based on resilient cloud services. Something in the spirit of Chaos Monkey would be useful to test its robustness against things like environmental failures, crashing dependencies and over-enthusiastic anti-virus programs.
Our solution integrates with a native code library. In the event this code crashes, the application has to recover gracefully. We’ve allowed for this by isolating the library in a separate process. To be robust, our multi-process/service system must continue to be successful during the failure of any of its constituent parts. In order to test this, I chose to unleash my own brand of chaos upon Windows Processes.
I called this the Process Gremlin. A configurable strategy based windows process eviscerator. The code is available at [My Github]. I’ll accept a gremlin isn’t a simian, but it’s still a good analogy.
There had been a distinct lack of Shatner on this blog until this point. Rocket Man aside, by providing our testers with a configurable, extensible tools for sanctioned mischief they can do interesting things. Below is a screenshot of my Process Gremlin at work. I’ve given it the parameters to randomly pick off busy notepad processes and kill them. I’ve coded a few other tactics, both for finding processes and determining what to do with them, but this is my favourite as it allows me to clobber instances just as they start picking up useful work.
As you can see from the beautifully coloured console window pictured, notepad instances are being selective picked off based on their processor usage. Two things: firstly feel to contact me for any UI consulting you may need, secondly I encourage you to pinch either my ideas or Netflix’s. My Process Gremlin is part of an evolving test strategy we’re developing here at Solidsoft Reply, but is available for anyone to use or extend.
Testing for resilience is a hard, time consuming and traditionally manual thing that needs to be done to produce software that won’t fail in the field. The classic developer maxim “it works on my machine” is no good here as far as I’m concerned. Simian Army and the Process Gremlin both automate away the difficulty. The question of resilience, like anything that is covered by good test automation, is now something that can be tested frequently and with little effort. This provides the basis needed to produce high quality enterprise applications.