On Operational Efficiency: The Phoenix Project Notes

Published by

on

Phoenix emerging from a computer screen

After working in the software industry for 15 years, I realized that I knew surprisingly little about how companies deliver cloud software. To better understand what it takes to develop and deploy software in the cloud, I recently read The Phoenix Project, which a colleague recommended. I was surprised at how much I liked the book! I found its story engaging and its ideas to be immediately applicable outside of software operations. (Note, all book links in this post are my Amazon Associate links. I make a small commission if you buy the book using them.)

The Phoenix Project describes how to apply insights from operations research in manufacturing to improve software development operations (DevOps). It does so through a story, in the style of The Goal: A Process of Ongoing Improvement. In the story, we follow the IT department of Parts Unlimited, a fictitious manufacturing company that also operates a retail division. The company has been losing market share to competitors who responded more quickly to market demands and offered customers a better buying experience. In response, Parts Unlimited formulated a strategy to leapfrog its competitors. This new strategy requires it to revamp its IT stack. The IT project is named “The Phoenix Project,” because it intends to help the company rise from its ashes. But the project isn’t going well. In the story, we follow Parts Ulimited’s IT team as it undergoes a journey from a state of chaos to a high-functioning team, enabled by their adoption of manufacturing operations principles. By adopting these principles, the IT department creates a DevOps methodology that improves the efficiency and quality of their software development and deployment capabilities. With their improved capabilities, the IT department becomes able to deliver the Phoenix Project.

A flaming bird is coming out of my computer screen! The capability of GPUs is getting out of hand. (Imaged generated by Leonardo.ai)

Below, I’ll share my notes from the book and notes on how I’ve applied the insights in my own life. I hope these give you ideas that you can also apply immediately. I would highly recommend the book, which really brought these concepts to life for me. I found the story so well-written that I felt stressed out reading about the IT department’s initial chaos!

DevOps is based on principles from operations research, which focuses on understanding “work” in a given system and optimizing its output.

The key insight that started DevOps is that IT work has similarities to a factory. Each piece of IT work is like a piece of factory work. And each IT person is like a manufacturing workstation. While the output of factory work is visible (a person using tools to make an assembled part), IT work is invisible (a person typing on a computer to change software code). Because of the invisibility of IT work, it’s critical to make that work visible. To do so, the work must first be identified and categorized. Then, the work must be mapped out and understood.

Who knows what the heck that person typing away at their computer is doing, but this guy is clearly working! (image generated using Leonardo.ai)

Step 1: Identify and categorize the work

Work – tasks, activities, or processes undertaken to accomplish an objective, which require resources and time to perform.

There are 4 types of work in DevOps:

Planned Work – Work that is known in advance and can be planned for. Types 1 to 3 are considered planned work.

  1. Business Projects – Projects that impact business processes and outcomes. Includes anything related to designing, building, delivering, and supporting products and services.
  2. Internal Projects – Projects related to improving and streamlining operations, such as paying down technical debt, upgrading network devices, and decommissioning software or hardware. These are often less closely managed than business projects.
  3. Changes – Smaller tasks related to the modification or transformation of processes, or systems. This includes patching servers, updating software, and bug fixes. There can be a huge number of changes, which, if not properly coordinated, can cause problems; each change can interact with other changes in unexpected ways.

4. Unplanned Work (aka Firefighting) – Work that is not known in advance. It often comes as a surprise and needs to be handled immediately. It’s the most destructive type of work, because it takes organizations away from their goals. These include fixing critical bugs discovered by users and recovering from system outages.

[Author note: Upon learning this, I immediately asked myself and my team about what sorts of unplanned work we encounter. And I asked how we could turn it into planned work to minimize the interruptions it caused us. The discussions were fruitful.]

When planning projects, don’t forget to include Nonfunctional Requirements in the design requirements. Oftentimes, IT projects focus on delivering features and functionality to customers, but forget about the nonfunctional deployment and maintenance work needed to maintain the IT stack. Ignoring nonfunctional requirements can result in the failure of an otherwise good project.

Step 2: Map out and understand the work

Making your work visible involves a few steps:

  • Mapping out your ongoing processes and projects
  • Understanding all work by making it visual on a kanban board
  • Identifying the interdependencies and constraints of your work tasks and processes

[Author note: I’ve been using a digital kanban board at work on the Workflowy app, which I also use to implement my Getting Things Done personal operating system. It forces me to focus on one task at a time. I found the book Personal Kanban useful in developing this practice.]

After climbing this hill, the map tells me that I need to walk off this cliff and into that swamp. What a helpful map! (image generated using Leonardo.ai)

Once the work is made visible and understood, you can identify ways to improve its flow. A few ideas to improve flow are:

  • Understand the time it takes to complete a piece of work (clock time). (page 234 in the book)
    • Clock Time = Touch Time + Wait Time
      • Touch Time – the time it takes all resources (factory workstations or IT people) to do the work when they’re started it 
      • Wait Time – the time spent waiting for each resource to pick up the work to do it once it’s handed-off to them.
  • Wait time is influenced by how busy your resources are. The higher their utilization, the longer the “wait time” each piece of work must endure before processing. So to reduce wait time and improve your flow rate, you should leave some slack in the system.
    • Wait Time =  [ % Busy ] / [ % Idle ] (p234)
    • As you make your resources busier, you increase wait time exponentially.
  • It’s critical to identify bottlenecks in your system. Bottlenecks are resources in the system that cannot process its work as quickly as the work is handed-off to that resource. This causes a constraint that must be improved.
    • “Any improvements made anywhere besides the bottleneck are an illusion.” (Theory of Constraints p90)
    • To relieve bottlenecks: (p162)
  1. Identify the constraint
  2. Exploit the constraint – the constraint isn’t allowed to waste any time
  3. Subordinate the constraint – release all work in accordance to the rate it could be consumed. Using a kanban board to release work, and control Work-in-Progress (WIP).
  • Work-in-Progress (WIP) is the silent killer. The more WIP you have in your system, the more inefficiency it creates. It creates congestion at bottlenecks, thus increasing clock time. It causes distraction by making resources context switch. And it can cause you to focus on managing the volume of work, rather than focusing on quality and addressing root causes of problems. For a deeper discussion of WIP and its relationship to waste, see my notes on The Toyota Production System by Kaiichi Ohno.

[Author note: Since reading this, I’ve been trying to map out my team’s processes and understand our work patterns and bottlenecks. I’ve been trying to relieve bottlenecks by changing the scheduling of work and potentially utilizing other resources to help. This involves creating standard operating procedures that would allow us to easily train others to do the work.]

Just looking at all of the work-in-progress on this giant kanban board is silently killing me! (image generated by Leonardo.ai)

To pull these principles together, the authors develop a Taoism-like framework that they call “The 3 Ways.” I’ll share the key ideas below in bullet points.

The 3 Ways (relates to both Agile Development and Lean Manufacturing):

  1. Principle of Flow – Systems thinking. Increase your flow of work to deliver value to customers more quickly
    1. Make work visible – Kanban board, planning techniques
    2. Limit WIP
    3. Reduce batch sizes
    4. Reduce number of handoffs
    5. Continually identify and elevate our constraints
    6. Eliminate hardships and waste in the value stream
  2. Principle of Feedback – Amplify feedback loops. Utilize feedback and metrics from the operations to diagnose and resolve problems. “Safe” systems may not be achievable, but we can make them safer.
    1. See problems as they occur
    2. Swarm and solve problems to build new knowledge
    3. Keep pushing quality closer to the source
    4. Enable optimizing for downstream work centers
  3. Principle of Continual Learning and Experimentation
    1. Enable organizational learning and a safety culture
    2. Institutionalize the improvement of daily work
    3. Transform local discoveries into global improvements
    4. Inject resilience patterns into our daily work
    5. Leaders reinforce a learning culture
      1. What was your last step and what happened?
      2. What did you learn?
      3. What is your condition now?
      4. What is your next target condition?
      5. What obstacle are you working on now?
      6. What is your next step?
      7. What is your expected outcome?
      8. When can we check?

Following the 3 Ways should, in theory, help an organization become a “generative,” as defined by the Westrum organizational typology model below. (p416)

One response to “On Operational Efficiency: The Phoenix Project Notes”

  1. Allen Avatar
    Allen

    With only 2 years experience in SaaS I have noticed many of these ideas play out myself!

    Being at a start-up the challenge for us was unknown unknowns, which (IMO) typically happen because of a task/project being new for the entire team of engineers OR* having too much tech debt. I think the latter could fit into these ideas but I am not sure what to do about unknown unknowns (other than try to reduce them during the research and project planning phase)