Scaling Agile – it’s not the same problem

I was prompted by Al Shalloway’s (@alshalloway) brief tweet this week combined with a long train journey to write this blog post.

That is exactly one of the key ideas in Nassim Nicholas Taleb’s (@nntaleb) Antifragile. The kinds of problem you get at scale are not the same kinds of problems you deal with in small numbers. To quote Nassim Taleb:

“A city is not a large village, a corporation is not a larger small business”.

Likewise a single Agile team does not behave the same way that a project team made up from several Agile teams does. The method and style of communication changes, as does the kind of risks and issues we deal with. It’s a transformation into something new, not a simple multiplication.

Why scale breeds complexity

If interactions grew linearly with the system size, then scale wouldn’t be such a problem. Each time you add new member to the system, the complexity (C) increases by one since you only add one more possible relationship as visualized below:

linear growth

Unfortunately as real systems grow in scale, their complexity grows exponentially because each component influences or interacts (directly or indirectly) with many others. That is, each member you add to the system adds exponentially more relationships or interactions. It’s an old idea that is surprisingly often neglected by those that design systems. You can visualize how the number of possible interactions grows as we add one more circle to each diagram below:

geometric growth

Complexity is a property of the behavior of the system, not the structure of it. That is, it’s not the number of components that makes it complex, but how they interact – as visualized by the number of lines in the diagrams above.

watchConversely we may have systems that have a complicated structure, but yield simple behavior. For example, a mechanical pocket watch has a complicated structure but very simple and (thankfully) predictable behavior.

So it’s complexity resulting from interactions and behaviors we have to be concerned about. This is why we see different types of problems at different levels of scale.

Aside: scale up and down

In order to be considered “scalable” the transformation must be two-way. I have seen software systems, for example, that are designed to “scale” but can’t run on modest hardware. That is not scaling, that is just big and bulky. Somehow we forget that scaling is not the same as simple expansion. To me a good scaling implementation is a dynamic one, that can scale both up and down. If you can only go one way, it’s not “scaling”.

So that’s the problem: the behavior of a small group is different than that of a large group, which is again different from that of a group of groups… and the effect shows up sooner than you might think because of the number of new communication paths.

The oft-cited Dunbar’s number says that people can have some sort of social relationship (but not necessarily a deep relationship) with up to around maximum 150 (or so) people, but we see behavior changes way sooner than that. The upper bound of a Scrum team (9 people) is based on experience and seems to be re-affirmed time and again. Beyond that people see themselves as part of a mass, not a team. In my organizations I never had a manager with more than 10-12 direct reports.

Sometimes trouble shows up at even smaller numbers. It is said that behind every successful man stands  a woman, behind every failed man there are two. I think that joke works with Scrum teams and Product Owners also, except it’s not funny.

Scale the problem down

This is where Agile and Antifragile meet again: big problems are best solved when you can scale them down and distribute the difficulty. The secret to successfully executing big projects is not to scale the project team up, it’s to scale the problem domain down.

What I mean by that is that we shouldn’t simply look at the project requirements and then figure out how to scale a team to build the whole thing all at once. Approaches such as Minimum Viable Product (MVP) and applying Design Thinking to focus the problem space can save a lot of time, effort and money by applying resources only to what is critical and important.

What about decision-making? The problem of scaling decision-making is a tough one. Many times we don’t even consider that we can change the way decisions are made, and so we fall back on a central person or core team that is responsible for making all the individual decisions. Its a fragile setup because these teams are disconnected from what happens on the ground.

Scaling decision-making

To me every scaling problem is a delegation and distribution problem. The most ineffective way of all is when someone decides to distribute workload but is unwilling to delegate authority. Micro-managers would object to the micro-manager label except they are too worn out from trying to apply their bottleneck everywhere at once… and they don’t read my blog anyway. When you’re too close to the tree you can’t see the forest.

So what can we do? First of all, recognize that you can’t possibly keep more than half a dozen balls in the air at one time. Divide and Conquer, distribute, subdivide… do what it takes to bring things into manageable chunks. This is after all what Agile methods do: break things down into manageable smaller pieces of work.

Distribute Distribute Distribute

Distributed systems, whether they are mechanical, software, political or social have some compelling properties. They are individually self-sufficient, resilient and effective. Centrally organized systems on the other hand are attractive only on paper or at very small scale. Works fine as long as the “central” part is available and capable of managing the information flow, but quickly breaks down and becomes a bottle-neck in any but the simplest projects.

So work obviously needs to be decomposed and distributed amongst multiple teams. Nothing new here, whether you distribute according to traditional functional teams or cross-functional Agile teams it has to be done.

It is not enough to just distribute the work, you also have to distribute authority and decision-making when you’re dealing with more than a couple of teams.

Central Mission command, local tactical decisions

One of the most powerful and elegant ideas in Don Reinertsen’s (@DReinertsen) well-equipped arsenal of golden nuggets is the idea of Mission Command. Instead of developing a detailed plan to be followed by all and centrally managed, we set higher-level objectives and let individuals and teams self-organize (and even improvise) and decide how to achieve the objectives.

Central mission command is very much different from central micro-management. We centrally decide on the overall (higher-level) objective to be achieved, then delegate responsibility, authority and decision-making for how to reach the goal to each team – while accountability still rests centrally.

Central command and local decision-making is not enough either. Even when teams self-organize around a central higher-level goal, their individual approaches may create trouble later down the road. So we can distribute decision-making and authority, but how do we ensure that everyone keeps the overall integrity of project and company in mind at any given time? And how do you retain central accountability?

Use decision-rules

As teams self-organize around delegated and common objectives, they don’t just need to meet their goals, they need to do their work and act in accordance with the company’s long-term interest and in concert with other teams. An orchestra only works if everyone plays in time with each other.

I recall with fondness my son’s first tuba recital in middle school band. There were 4 tubas on stage, and he finished first. Well what can I say, he’s competitive and has since grown into a splendid college athlete. He knows how to play in beautiful concert with his team on the field now but I still smile when thinking about that first recital.

Orchestration and alignment is needed both for the project goals and in the general backdrop of the company. For example, a team might decide to achieve their part of the project goal by using an open-source software package. Although that may result in achieving the project objectives, the team may have inadvertently put the company’s intellectual property at risk. Not all open-source licenses are the same. By simply using “freely” available software, you may be automatically entering into an agreement where your company must agree to make parts of their system software freely available to the rest of the world.

You can’t manage dozens of individuals by central decision-making, but on the other hand you can’t simply let everyone loose on their own either. One way to effectively sub-divide and delegate is to create a set of decision-rules that aligns everyone on how to make decisions. Think of it as a set of guide-rails that prevents you from falling off the path.

Identifying decision-rules is not the same as identifying who is responsible for making decisions and establishing an escalation path. That is not scalable either. It is all about agreeing on how teams will make local decisions on their own, what they can decide on their own and what must be decided centrally. Such rules can enable teams to make decisions consistent with the overall command objectives. These rules are set centrally, and is how the “central command” retains accountability for the decisions made in the project.

One of the best examples of decision-rules I have heard about came out of the Boeing-777 development program. When you design airplanes you have to make tradeoffs between weight, cost and space. If you increase the weight of your subsystem then it becomes a big deal since every ounce translates into increased operations cost for the customer. Every subsystem and every engineer is faced with these kinds of tradeoffs on a daily basis. How do you manage the choices of several thousand engineers all at once? Obviously a central design review committee wouldn’t work effectively. So the project team set “budgets” for each subsystem development team as to how much weight, cost and space they were allocated. If a team needed to exceed their weight allocation, they could do that as long as they found another team that would be willing to trade some of their weight against, for example, some additional cost. That way the weight/cost/space constraints of the 777 airplane were always satisfied overall. The individual teams could trade allowances between themselves as long as they stayed within the overall budget. They didn’t have to get every tradeoff approved centrally – they used decision-rules.

So instead of creating an elaborate escalation path for decision-making, create the guide-rails within which decisions can be made at the right level in alignment with the overall mission of the company.

The Agile Manifesto and the principles behind form an example of another such a decision-making framework. For every choice or decision to be made, every engineer or team can ask themselves, for example, “does what I am about to do satisfy the principle of simplicity?”, or “if we take this approach will we be able to show progress in terms of working software?”.  Similarly your company and project will have a different set of decision-rules that guide teams and individuals in making local decisions.

So is that it?

No that’s not all of it by a long mile. But it’s a start. If you can

  • understand that different levels of scale requires a different approach, and
  • distribute both work and decision-making authority, and
  • create a good set of decision-rules that can be used to align everyone,

…then you’ve laid the foundations for a much less stressful environment at scale.

Lean Systems: Antifragile Applied

“Systems subjected to randomness—and unpredictability—build a mechanism beyond the robust to opportunistically reinvent themselves each generation”

– Nassim Nicholas Taleb

In a previous post I introduced the concept of Antifragility – systems that benefit from shocks, randomness and disorder. Classifying the world in the triad of Fragile – Robust – Antifragile helps us understand and manage the potential impact of the uncertainty surrounding us.

It’s initially hard to imagine that anything useful could benefit from disorder, so the first thing to realize is that although objects and things can be Fragile or Robust, they can’t be Antifragile. Systems, on the other hand (which if course includes Product Development Systems) are made up of multiple interacting components. Systems exhibit behavior as they respond to their surroundings, and can be Fragile, Robust or Antifragile. It is this ability to respond and interact that opens the door to antifragility. Antifragility can bee seen as a type of evolutionary mechanism, continuously picking the best of the available options. So, when we look for examples of antifragility we need to look at systems, not objects.

Stressors: the fuel of Antifragility

A stressor is something that puts a strain on the system, pulls it away from its equilibrium. It’s the system’s response to stressors that classifies the system as either Fragile, Robust or Antifragile.

A system that gets weaker from the encounter with the stressor is Fragile. For example, a pyramid scheme collapses when exposed to the light of day. Not only dictators (the individual) but the foundation of the dictatorship (the system) crumbles when the forces of democratic thought are applied. The best-laid project plan with all its gantt-charts has a best-before date sometime before the first problem is discovered.

Robust systems neither gets weaker nor stronger in the presence of a stressor. Most government bureaucracies seem to fall in this category – their inability to learn and evolve astounds me, as does their unequaled staying power. Many companies operate in this way too. New ideas get rejected and expelled by the corporate immune system, allowing the company structure to stay the same even in the face of certain bankruptcy. Remember Kodak? GM?

Antifragile systems on the other hand enjoy randomness and stressors, at least up to a point. Shocks and disruption make them stronger because they keep the system alert and in shape. Stressors exercise and improve the system the same way physical activity stresses and improves your body. Strength training, for instance, involves pushing your muscles just past their breaking point. Your body is able to repair this damage and even over-shoots in the repair effort. The result is that you are left with a little more muscle mass than you had before. This is how Schwarzenegger became Schwarzenegger and Ahnold was again a cool and acceptable name for your first-born. Without these stressors the system would stagnate, much like a couch-potato grows the wrong kind of body mass and ends up with clogged arteries.

Of course, there is a limit to how much stressors are beneficial. Running at a reasonable effort level puts you in better shape; the first marathoner supposedly expired at the goal line, having historically over-exerted himself to deliver with his last gasp the one-word message to the king: “victory”.

(hang on – if they won the battle, then why the life-and-death rush? Good news would still be reasonably good the next morning, right?)

The next important thing to understand about Antifragile Systems is that they work in layers. It is not enough that individual members get stronger, the system as a whole needs to be able to survive and thrive. It needs to be able to learn and select.

It’s in the DNA of the System

Going back to our example of Mother Nature as the ultimate antifragile system, we can observe that the individual member of a species are inherently fragile. In fact, each member will eventually die off, no matter how strong it is. There is a natural turnover to make room for the newer and more fit members. By natural selection and replacement of individuals the system becomes more and more fit. There is a layering effect here. Individual members (at the lowest layer) compete with each other. The strong propagate their DNA and have (presumably) stronger offspring, the weaker gradually (or abruptly, as the case may be) exit the gene pool. The system as a whole (at a higher layer) grows stronger as a result. The system survives the demise of each of its members because the information that makes up the system is preserved in its DNA, surviving generation after generation of individuals.

By evolution such a system improves gradually even if there is no master plan and things happen at random. The system continuously Inspects and Adapts, and the current “best recipe” is carried forward in our DNA. As long as we recognize and seize opportunity, even a random walk will be beneficial. Antifragile systems love errors and variation for that reason.

Lean Systems: Fragile?

Lean systems are called Lean because they deliberately operate with very small error margins. For example, Lean Manufacturing systems are sometimes called “zero-inventory” systems because they have almost no buffer inventory to absorb variations and problems at individual stations. If there is a problem somewhere on the production line, the whole system could shut down. This is by design: in a tightly coupled system small problems are amplified to make them painfully obvious, and every problem becomes an urgent matter.

In one sense Lean systems are therefore very fragile to disorder and error so one might be tempted to simply put Lean in the Fragile category. But it’s not that simple. The antifragility of Lean is in the DNA of the system.

Lean Systems: Antifragile

So we need to reconcile the apparent fragility of the small operating margins of a Lean system with the claim that Lean systems are antifragile.

I like Steven Spear’s (The High Velocity Edge) summary of a good Lean implementation:

  1. Build a system of “dynamic discovery” designed to reveal operational problems and weaknesses as they arise
  2. Attack and solve problems when and where they occur, converting weaknesses into strengths
  3. Disseminate knowledge gained from solving local problems throughout the company as a whole
  4. Lead by developing capabilities 1, 2 and 3

The ingenuity and beauty of Lean is that even small problems become intolerable at the system level. Lean Systems use this fragile tight coupling as a way to accelerate system-level learning. If a problem develops, it immediately becomes painfully obvious that something is wrong.

Rather than working around or ignoring these small problems, the team in charge is obligated to immediately seize the opportunity to improve the way the system works before the small problem becomes a big problem. A good lean team will swarm the problem to get it fixed, and put in place measures to ensure that similar problems don’t occur in the future. The result is that the particular process step which failed now has improved and is less likely to fail in the future.

Antifragile systems love errors, and so do Lean systems. The fragility of small error tolerances acts as a forcing function which brings problems to the surface, causing the old faulty processing step to evolve and be replaced with a new and more fit one. Each small failure alters the DNA of the Lean system just a little bit, evolving and improving. One more problem spot has been eliminated, and the probability of future defects is reduced.

So here is a perfect example of a system that is designed evolve over time, to learn from mistakes and to grow more capable after each error. It needs no top-down direction other than living the Lean Principles. There is no master plan, yet Lean systems evolve on their own to become the most competitive and effective man-made systems we have on our planet.

Evolving. Learning. Antifragile. Lean. Wonderful.