I am working with Tim Coonfield to develop a talk for the one day Agile Shift conference scheduled for April 12, 2019 in Houston, TX titled “10 Biggest Mistakes We Made on Our Agile Journey (and Why We are Glad We Made Them)“. This is the sixth in a series of articles that Tim and I will use to explore some of those mistakes and what we learned from them. You can find other articles in this series here.
If you are interested in hearing this talk or some of the other awesome speakers and topics that will be covered at the event, you can learn more about the conference and purchase tickets here.
Everything I will share here happened at Global Custom Commerce (GCC), a Home Depot Company, as we developed and improved our 2nd generation e-commerce platform, called Autobahn, designed to make it easy for customers to shop for and buy more complex configurable products and services like custom window blinds, custom windows, flooring and decks.
In 2011, when the Autobahn platform started development, GCC was already the #1 seller of custom window coverings online and owned several brands including blinds.com. We were a couple years away from acquisition by the Home Depot and had about 80 employees. The existing e-commerce platform had been in production for a number of years and was still actively being improved by a large team following a traditional project management philosophy using Gantt charts and reporting on percent complete.
The Autobahn project marked a number of firsts for GCC including our first use of cloud hosting at AWS and our first use of agile methodologies. This article highlights one of our bigger mistakes and how we were able to improve as a result.
From the start, the Autobahn development team included engineers with a diverse set of skills. Some team members had T-shaped expertise — deep in a couple areas with some knowledge across the whole technology stack. Some were I-shaped — very deep in one critical technology with little expertise in the rest. Some of us specialized on testing and some of us on building features.
In planning, we worked as a team to estimate stories by breaking down each one into a set of tasks. Usually, the people that specialized in a certain kind of task ended up providing the estimate. Although this meant many planned tasks were only understood by a select few on the team, it didn’t appear to be a problem. Planning remained fast and seemed quite collaborative
Once the sprint started, team members naturally focused on tasks within their expertise. This usually meant starting several stories because there was almost never exactly the right amount of specialized work needed to keep everyone busy on a single story.
We developed a comfortable rhythm. Within the first couple days of the sprint, we’d have four or more stories in process. Around the middle of the second week, we’d start finishing up stories and merging them into a release branch. Late in the sprint, we’d finish up testing the stories and call them done.
It was also very efficient. Because team members worked in their strongest areas, tasks got done quickly. Team members that ran out of work in the sprint would start on stories prioritized for the next sprint. It always seemed like we were ahead because we were all busy bees all the time.
But then we hit a snag. Late in the second week of a sprint we realized we could not finish regression testing the release branch by the end of the sprint. No problem. QA engineers could perform regression testing at the start of the next sprint. After all, they weren’t very busy the first few days of each sprint because the software engineers were busy starting multiple stories.
And then we hit another snag. One sprint one of our specialists ran out of work and, as per our standard operating procedure, starting working on what was prioritized as the first story for the next sprint. Even better it seemed, he actually knocked off the front-end work for the first five stories planned for the next sprint. Unfortunately, most of that work was wasted because the last four of those stories got deprioritized by the business before the next sprint started and eventually fell off the backlog entirely.
That caught our attention. Clearly, we were doing something wrong. Despite keeping all our specialists busy doing what they did best, we were having trouble getting stories done by the end of the sprint.
An agile consultant suggested limiting ourselves to one story at a time. The term we heard at the time for this practice was “swarming”. The entire team would work together to finish one story before moving onto the next. If a specialist ran out of work, she would pair with others to work on tasks outside her specialty or would spend the idle time developing new skills, working on a pet project or helping the team out in other ways.
Swarming worked. The most important stories were guaranteed to complete in the sprint. Collaboration improved too. Since the PO saw completed stories earlier and provided feedback sooner, the team was more open to responding to feedback and making changes within the sprint. Overall throughput improved. That is, stories were getting completed at a steady rate throughout the sprint and delivered what the business needed.
In formal terms, we found that decreasing the amount of work in process, or WIP, increased the throughput of our process.
However, swarming didn’t feel efficient. Specialists often ran out of tasks they felt competent to complete on their own. Some were willing to pair and learned new skills along the way, but often felt they were not making the best use of their time. To make matters worse, the people that were experts on the tasks that remained felt like pairing with someone less skilled was distracting and slowed them down.
Increasing the WIP limit helped up to a point. Because it allowed specialists to go faster, more tasks got done. Often, it also meant that more stories got completed in the sprint. In formal terms, we learned that increasing the WIP limit increased our efficiency and our overall throughput up to some maximum value where throughput would peak.
But increasing the WIP limit came with risks. When WIP got too high we started to see some of the same old problems — stories wouldn’t come together at the end of the sprint and would remain incomplete. Sometimes that was OK, but sometimes that unfinished work became waste. We also found that our ability to respond to change within the sprint was compromised when WIP crept up because the PO was rarely able to provide feedback early enough in the sprint.
I’d like to be able to say that we worked through the challenges and found the right balance over the next few sprints. The truth is we didn’t.
Part of the problem was rooted in how we interpreted the Scrum process to require one release per sprint. After all, review happened at the end and we really couldn’t deploy until we had approval from the stakeholders. The focus on one release per sprint made it seem reasonable to merge code late in the sprint and focus on regression testing the entire sprint release in the last couple days of the sprint.
Part of the problem was how we misused velocity to some degree. Pretty naturally, we focused on measuring velocity and used it to judge progress against our overall plan. When our specialists worked ahead on the next sprint instead of sitting idle or working slowly on tasks they weren’t good at, average velocity increased. Unfortunately, our measurements ignored some of the waste we were building up when testing spilled into the next sprint or stakeholder feedback couldn’t be incorporated into the current sprint.
As we grew from one team to twelve, we continued to control WIP with limited success; it kept popping back up as a problem from time to time as teams would slip back into chasing efficiency.
We finally fixed that by changing our culture to focus on deploying increments of functionality as soon as they were ready. This automatically started focused teams on completing one or two stories at a time. The reduction in WIP has resulted in greater throughput, faster feedback and fewer bugs and production problems. Obviously, there’s more to that story, which you can read about in my early article on Bug Fix Thursday.
If you want to learn more about why WIP limits work to reduce waste and improve throughput, I highly recommend “The Goal: A Process of Ongoing Improvement” by Eliyahu M. Goldratt and Jeff Cox, which helped popularize lean manufacturing back in 1984. This skinny book, written in a fast-paced, thriller form, does a great job of explaining some of the core principles of what makes agile effective even though it was written almost two decades before the Agile Manifesto.