• about reply
Solidsoft Reply Logo
Menu
  • What we do
  • Pharmaceutical Sector
  • The Solid Blog
  • Newsroom
  • Contact Us
Choose language:
  • about Reply
Solidsoft Reply Logo

Search

Focus On

Blog

Engineering Chaos

Author: Nathan Cooper

FOCUS ON: Blog,

The Netflix Tech Blog has a number of recommendations for anyone moving to the cloud. P​​ossibly the most indispensable piece of advice the blog gives is that “The best way to avoid failure is to fail constantly”.

“We’ve sometimes referred to the Netflix software architecture in AWS [Amazon Web Service] as our Rambo Architecture. Each system has to be able to succeed, no matter what, even all on its own. We’re designing each distributed system to expect and tolerate failure from other systems on which it depends. One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t const​antly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage”.

This advice might sound crazy, why would you actively attack your own deployed services? Because, regardless of what you do, individual components of your solution will fail. Even the most robust equipment eventually fails and if a system cannot recover from small failures and gracefully degrade from larger ones, then that system is only as strong as its weakest link.

Netflix compares this to getting a flat tyre on the motorway. The guy who’s been practicing every weekend by popping and replacing his tyres on his drive is going to be much better prepared than the guy who hasn’t looked at his spare in 6 months and doesn’t even know if it’s inflated.

With this philosophy Netflix has built a Simian Army. First, to ensure production instances failing did not affect customers, they created Chaos Monkey, a tool which randomly disables instances. Other simians followed. Latency Monkey was created to introduce artificial delays, Chaos Gorilla to simulate Amazon availability zone outages. The company began constantly testing their resilience to all kinds of failures and improving their ability to deal with the inevitable failures that would occur in production.

The ‘tyres burst’ when AWS suffered significant networking issues across the Americas on Christmas Eve 2012. The Netflix web site remained up, at a time when many popular websites crashed. While some streaming services for users in the Americas were degraded, nothing broke. Their commitment to recovery from failure had paid off.

So what can we learn from Netflix for the benefit of Solidsoft Reply’s customers?

The lessons are as much cultural as technical. Remember that the company was willing to take small risks with its production code. This was a good idea, but at the time it must have required a deep cultural commitment to producing quality software. Netflix had to be the sort of place that hire smart people and give them the space to make the smart decisions. I think as consultants it’s our duty to promote a positive software development culture, but I think that deserves its own post.

On a much less contemplative note, I think Chaos Monkey is a really good idea. In the best tradition of software development, I decided I’d pinch it. My current project involves creating a document scanning security solution based on resilient cloud services. Something in the spirit of Chaos Monkey would be useful to test its robustness against things like environmental failures, crashing dependencies and over-enthusiastic anti-virus programs.

Our solution integrates with a native code library. In the event this code crashes, the application has to recover gracefully. We’ve allowed for this by isolating the library in a separate process. To be robust, our multi-process/service system must continue to be successful during the failure of any of its constituent parts. In order to test this, I chose to unleash my own brand of chaos upon Windows Processes.

I called this the Process Gremlin. A configurable strategy based windows process eviscerator. The code is available at [My Github]. I’ll accept a gremlin isn’t a simian, but it’s still a good analogy.

There had been a distinct lack of Shatner on this blog until this point. Rocket Man aside, by providing our testers with a configurable, extensible tools for sanctioned mischief they can do interesting things. Below is a screenshot of my Process Gremlin at work. I’ve given it the parameters to randomly pick off busy notepad processes and kill them. I’ve coded a few other tactics, both for finding processes and determining what to do with them, but this is my favourite as it allows me to clobber instances just as they start picking up useful work.​

Engineering Chaos.png

As you can see from the beautifully coloured console window pictured, notepad instances are being selective picked off based on their processor usage. Two things: firstly feel to contact me for any UI consulting you may need, secondly I encourage you to pinch either my ideas or Netflix’s. My Process Gremlin is part of an evolving test strategy we’re developing here at Solidsoft Reply, but is available for anyone to use or extend.

Testing for resilience is a hard, time consuming and traditionally manual thing that needs to be done to produce software that won’t fail in the field. The classic developer maxim “it works on my machine” is no good here as far as I’m concerned. Simian Army and the Process Gremlin both automate away the difficulty. The question of resilience, like anything that is covered by good test automation, is now something that can be tested frequently and with little effort. This provides the basis needed to produce high quality enterprise applications.

​ ​

RELATED CONTENTS

11.07.2017

Blog

Comparison between DPA & GDPR

Before we start comparing, it’s worth highlighting that this is a bit of an “oranges & apples” thing because, technically speaking, the UK DPA (Data Protection Act) 1998 was enacted to bring British law into line with the 1995 EU DPD (Data Protection Directive, aka 95/46/EC) which is the one that is, now, being repealed and superseded by the GDPR (General Data Protection Regulation, aka 2016/679) that was adopted in 2016.

28.07.2017

Blog

MODERN VS TRADITIONAL TECHNOLOGY INTEGRATION

Every business today is looking for ways to unlock the value of their data, connect disparate systems, increase efficiency, optimise the use of current IT while at the same time keeping the organisation secure. Why? Because competitive advantage, positive customer experience and a bulging bottom line are all dependent on it.

20.07.2017

Blog

5 REAL-WORLD DRIVERS OF MODERN INTEGRATION

Technology is becoming increasingly flexible, mobile and connected- the evidence is probably either in your pocket or your hand.
Smart devices are now built around a huge number of modern applications, many of which are tied directly into business functions – in essence the line between consumer and business has blurred.

 
 
 
 
Reply ©​​ 2023​ - Company Information -
 PrivacyCookie Settings​
  • About Reply​​​
  • Inves​tors​​
  • Newsroom
  • Follow Reply on
  • ​
  • ​
​
  • ​About Solidsoft Reply
  • Privacy & Cookies Policy
  • Information (Client)
  • Information (Supplier)​