Move over Apple and Google, I think I have an new favorite company.
I only occasionally watch movies so I don’t have a Netflix subscription, but the company caught my attention at the Agile2013 conference. Gareth Bowles gave a very interesting talk on Netflix’s “self-service build and deployment” infrastructure. What intrigued me was the level of empowerment and the trust model in place, centered on “freedom and responsibility”. Subsequently I’ve noticed more and more reports on Netflix that fill in the pieces for a more complete picture.
Unmatched levels personal freedom and trust – although with corresponding levels of accountability, a conspicuous lack of pre-deployment verification of new features, and a company that goes out of their way to disable their own product in front of their customers to force themselves to get better. It’s not crazy, it’s Netflix – and it seems to work.
Managing 700 engineers working on a product line which serves 44 million picky customers in 40 countries obviously requires a lot of strict governance, verification, quality checks, processes and oversight, or… perhaps something completely different?
I can only observe from the outside, but from where I stand, the Netflix approach boils down to: assume success-path the majority of the time and deal quickly with the rare failure cases when they happen. Invest in the necessary infrastructure so that you can achieve high quality at high speed with low overhead. And stick to it.
The result is real Agility, but it takes commitment and conviction. The Netflix approach is not for everyone, but it should provide inspiration for us all to think about novel and counter-intuitive solutions.
Agile Manifesto and Lean Values
“Individuals and interactions over processes and tools” “The most efficient and effective method of conveying information to and within a development team is face-to-face conversation”
Netflix houses all of their 700 engineers in Los Gatos, California – part of Silicon Valley. They only hire senior staff and pay “top-of-market” compensation. There are no outsourcing or low-cost development sites to balance the burn-rate. They must have the most expensive labor force in the most expensive labor market. If you can stomach the burn-rate, you can have a highly skilled co-located group in the U.S. It would be hard to imagine a more expensive setup, but Netflix have understood that it’s not about the labor rate, it’s about the ROI on your R&D dollars. To enable quality and agility, they are willing to pay a high premium.
“Build projects around motivated individuals. Give them the environment and support they need, and trust them to get the job done” “Our highest priority is to satisfy the customer through early and continuous delivery of valuable software”
The HR policies and resulting culture of Netflix create a high-trust, high-performance and high-accountability environment. There is a breed of engineers that tend to like this kind of environment and perform at their best when they know they are trusted to freely work on something that creates value.
For example, any engineer at Netflix can push a code change to the live network at any time. Within two hours the change has gone through automated testing and through to deployment, into the customer’s hands. That is truly a high trust model considering the huge customer base served and the potential business damage that could be done.
It’s not as wild-west as you might imagine. Rather than going through layers of verification before deployment, Netflix manages the introduction of the new code carefully. Netflix pushes the change to a small number of customers first, monitors the behavior and either pulls the change back or widens the deployment based on continuous monitoring of a select set of metrics. Deploy with a limited scope, then either widen the scope or pull back depending on the results. I would imagine that engineers which repeatedly make mistakes don’t last long at Netflix, but they are open about that and if you can take the responsibility then you partake in the benefits. Not everyone’s cup of tea, for sure, but it enables fast and continuous delivery of new functionality.
Lean Goal: sustainably shortest lead time. Best quality and value to people and society.
With a 2-hour build and release cycle it is hard to imagine any quicker path to deployed software. I don’t have any quality data, but you couldn’t maintain a working system like this with poor quality. Netflix added 4 million subscribers in the last quarter of 2013, which couldn’t happen if their product had a poor reputation or didn’t perform as advertised.
Lean Pillar: respect for the individual
Netflix’s development and administrative policies are heavily weighted towards freedom in a trusted environment. There is corresponding responsibility, of course, but for those who accept and thrive in this situation, it translates into high trust and respect for the individual. If you don’t perform at the top of your game then your future at Netflix is less than certain. I would normally not say that this is a respectful approach, but Netflix are quite clear on their core values and the competitive nature of their environment. Anyone going to Netflix go there with eyes wide open, and in such a case I think it actually is an ok, open and honest way to go about it. You may not agree with the core values, but there’s nothing disrespectful about it.
Lean Pillar: Product Development Flow
Netflix has managed to take a huge step forward in achieving overall flow by (1) not batching individual code changes together for verification but releasing in small increments, and (2) removing the customary big-bang integration/verification phase. Not all code changes will break something. In fact, Netflix has recognized that there are many more passed tests than failed tests in the average project. If 90% of the tests pass, then why burden the project with anything but the 10% that reveal failures? Since we don’t know where the 10% hides, the straightforward thing to do is just to test it all. If you already have high quality in place, then the Netflix approach of releasing and then finding and resolving the 10% failure cases quickly is elegant. And probably less costly. Certainly it is faster for the 90% of features that deeply without problems.
Lean Foundation: Leadership
Kudos to the management team at Netflix. They have to really commit and have conviction that a counter-intuitive approach will work. Instead of putting more and more heavy layers of inspection and verification into their process, they are erring on the side of being too light. Instead of managing with a traditional heavy-handed approval culture, they focus on enabling smooth flow and high speed.You can see it working in a small startup… but a company with more than $4 Billion annual revenue?
Companies tend to calcify and become bureaucratic as they grow, yet Netflix has some of the most relaxed business policies around. Rather than degrading into a mess, in the right environment this can enable high performance.
High quality is necessary to delight customers and achieve high development speeds. If you really want high quality, then you have to pay for it. In most companies this means heavy verification cycles before final release. At Netflix this means (among other things) the high labor overhead described above and Chaos Monkey.
What a great concept! Chaos Monkey is a software program that continuously runs and disables pieces of the Netflix application. When you unexpectedly terminate part of an application you get unexpected behavior and… chaos. A good application can deal with partial shutdown gracefully, but most don’t. It’s simply too hard to predict what combination of problems will eventually crash your system. So, Chaos Monkey runs continuously and does its mad thing until something crashes. Chaos Monkey keeps regular office hours, so this way Netflix can be prepared and deal with the problems during the workday instead of in the middle of the night when emergency really strikes.
Maybe you don’t think this is a fair test. It’s a corner case that will almost never happen. That is what most engineers I know would say and they are right. But the real world is not fair, and unexpected problems will eventually happen. If you choose to ignore the unfair cases, that is your choice – but you accept a more fragile solution and you need a bigger customer support operation.
Neat approach, but what really sold me on the idea is that Chaos Monkey is not for controlled lab environments – it runs on the live production network which serves customers. Netflix runs on Amazon Web Services (AWS), and the cloud environment can be unpredictable. Instead of relying on AWS for resilience, the vulnerabilities identified by Chaos Monkey are fixed and the Netflix application itself recovers from unexpected failures. In the first year of existence Chaos Monkey terminated 65,000 live virtual instances on AWS. As far as I can tell Netflix has a much higher availability record than other services on AWS.
It’s gutsy, an inventive solution by the R&D team and also a reflects a fundamental commitment from the management team to continuously improving the resilience of Netflix. What other company do you know which intentionally breaks their own product while in the hands of the customer?
The bottom line
The Netflix approach is novel, counter-intuitive, quality-focused in a different way and it seems to pay off. Could you replicate the Netflix culture in your company? Probably not. But maybe you, like me, take inspiration from the Netflix story to not go the easy and traditional route, but look for innovative solutions even (or especially!) if it breaks with conventional knowledge. It should give you pause to think about your current setup and what economic model you are following: are you simply chasing the lowest possible labor rate, or are you more concerned about the overall return on investment?
Lean/Agile Product Development requires a different investment mindset. If you invest with product development flow in mind (regardless of your outsourcing situation) then the benefits are not 5% or 10% improvement, but 50% to 100% or more. The solution is not in lowest possible labor rate (although that always helps) but in the highest possible ROI on the next R&D dollar.