The Biggest Mistakes We Made on Our Agile Journey (and Why We are Glad We Made Them) — Bug Fix Thursday

I am working with Tim Coonfield to develop a talk for the one day Agile Shift conference scheduled for April 12, 2019 in Houston, TX titled “10 Biggest Mistakes We Made on Our Agile Journey (and Why We are Glad We Made Them)“. This is the second in a series of articles that Tim and I will use to explore some of those mistakes and what we learned from them.  You can find other articles in this series here

If you are interested in hearing this talk or some of the other awesome speakers and topics that will be covered at the event, you can learn more about the conference and purchase tickets here.

Everything I will share here happened at Global Custom Commerce (GCC), a Home Depot Company, as we developed and improved our 2nd generation e-commerce platform, called Autobahn, designed to make it easy for customers to shop for and buy more complex configurable products and services like custom window blinds, custom windows, flooring and decks.  

In 2011, when the Autobahn platform started development, GCC was already the #1 seller of custom window coverings online and owned several brands including blinds.com. We were a couple years away from acquisition by the Home Depot and had about 80 employees. The existing e-commerce platform had been in production for a number of years and was still actively being improved by a large team following a traditional project management philosophy using Gantt charts and reporting on percent complete.

The Autobahn project marked a number of firsts for GCC including our first use of cloud hosting at AWS and our first use of agile methodologies. This article highlights one of our bigger mistakes and how we were able to improve as a result.

In the early days of the effort to deliver our 2nd generation e-commerce platform, Autobahn, we adopted the Scrum methodology and were following all the practices “by the book” including sprint planning, stand-ups, sprint review and sprint retrospectives.  However, the company continued to use more traditional QA techniques and processes. As a result, QA engineers were assigned to the team but continued to work semi-independently with their own manager. This obvious mistake was rectified fairly quickly and is not really at the heart of what I want to tell you about here. It’s just important to note since it could be tempting to attribute what came next to where we started with QA engineers sitting half inside and half outside the team.

We also made one large decision around architecture that impacted this story somewhat. After attending Udi Dahan’s distributed architecture course, many on the team wanted to focus on building a microservices architecture with small, loosely coupled components connected generally by asynchronous messaging. Remember, this was 2011 when these ideas were just gaining widespread attention. After consulting with experts, including Udi, we were advised to build a monolithic web application using a more traditional layered architecture to provide separation of concerns. For the first year or so, that is exactly what we did. By the time we started to introduce microservices, there was a pretty significant monolith sitting at the center of our platform. In retrospect, this was clearly a mistake. This certainly contributed to what follows, but was not the sole or even the most notable cause.

When the Autobahn effort started, the business and the development team all agreed that quality was one of the most important things required to make the new platform successful. In fact, the new effort was chartered, in part, based on the promise that quality would be baked into the new platform from the beginning. After all, the company had suffered for years from quality issues on the existing platform and was tired of spending too many cycles fixing problems and not enough time truly innovating.

The team invested in quality from the beginning. Early in the first sprint, we had automated continuous integration including a comprehensive unit testing suite that ran on every commit and failed the build if any tests failed. We also implemented code coverage reporting and focused on achieving as close to 100% test coverage as we could get. The team cared deeply about quality and was fully committed to writing and maintaining unit tests to make sure things worked as designed and continued to work as the code base evolved.

Besides unit tests, our definition of done included an expectation for QA system, integration and regression testing. The QA engineers on the team were responsible for writing test cases based on the stories. Once stories were ready for testing, the QA engineers took responsibility for executing the test cases and recording issues on pink stickies that were added to the physical scrum board maintained in the team area. Software engineers took responsibility for fixing all the bugs the business deemed important in the sprint. Once stories were completed and integrated into the main branch, QA engineers focused on testing for regression using a rapidly growing set of test cases stored in a test case management system.

Within a few sprints, a clear cadence and separation of duties naturally developed within the team. QA engineers would start the sprint trying to automate key use cases from the sprint before. They would also work with the PO to produce a set of test cases for the stories committed in the sprint. Meanwhile, software engineers would start several stories in parallel. By the early part of the second week, software engineers would start finishing up stories with passing unit tests and QA engineers would start UI and integration testing. By mid-week, the team would have all the stories in good shape and would start merging everything into a release branch. Thursday morning we would lock down the release branch so our QA engineers could focus on regression testing and work with the software engineers to make everything ready for review on Friday.

After a few sprints of this, the team started referring to the key testing day in the sprint as “Bug Fix Thursday”. It was a neat way to describe the code freeze that would happen each sprint after the merge completed and regression testing started. Up until Bug Fix Thursday, the team was able to focus on developing new features for the sprint. Starting in the morning on Bug Fix Thursday, the software engineers would generally work ahead on stories lined up for the next sprint if they weren’t busy fixing bugs identified by QA engineers on the team.

Sometimes we had trouble getting stories ready in time for Bug Fix Thursday. Most of the time we simply relaxed the code freeze rule to allow ourselves to add to the release branch later on Thursday, or, in extreme cases, Friday morning. This put a lot of pressure on the QA engineers to either rush through regression testing or to perform multiple rounds of regression testing. It also led to some unhealthy behaviors like allowing regression testing to leak into the beginning of the next sprint. Since the team was small and the platform was not in production yet, we were able to live with some of these problems for quite a while.

As Autobahn gained momentum and the team grew, Bug Fix Thursday got a little uncomfortable. As one agile team grew to two and then to three, we starting feeling the pinch of Bug Fix Thursday more and more often as teams struggled to merge and test all the sprint’s stories in time for the demo on Friday afternoon. Although we introduced more cleanly separated microservices that could be deployed independently, most sprints included functionality that touched the monolithic customer or associate web sites and required extensive regression testing to ensure everything worked as expected. QA engineers felt the pressure the most as regression testing routinely leaked into the following sprint even for stories the team was calling “done”.

Processes were improved to compensate. The one that seemed to help the most was focusing teams on getting more stories done and ready to deploy in the first week of the sprint. This forced teams to work on one or two stories at a time and to make sure they were merged and regression tested before moving onto another story. Although this did not eliminate Bug Fix Thursday, it gave the QA engineers enough confidence to time box regression testing by reducing the number of test cases checked on Bug Fix Thursday.

As we grew from three teams to six and started exploring new business opportunities, Bug Fix Thursday started to get very uncomfortable again. The team exploring new businesses started to release pilot components more frequently, mainly because these systems had very small impacts. However, when they touched critical system components, which was far too often due to the monolithic nature of the system core, their code had to be merged into what was becoming one very big and complex sprint release. The team was also surprised by how these “safe” releases managed to break things in unanticipated ways. We beefed up our unit testing. We added integration tests. We tried adding a QA engineer to float outside the teams and focus on writing more automated UI tests. We brought automated UI testing into the sprint. We challenged our software engineers to work more closely with the QA engineers on the team to finish regression testing at the end of the sprint. We even turned Bug Fix Thursday into Bug Fix Wednesday for a little while to allow more time for regression testing to complete. Some of these changes worked and stuck, some didn’t, but overall the various changes seemed to help us keep Bug Fix Thursday manageable. We got to the point where releases would happen the Tuesday after the sprint and the business was reasonably satisfied.

Behind the scenes, our QA engineers were barely holding things together. They worked long hours on Bug Fix Thursday often testing late into the night. They tested Fridays after sprint review to make sure the release was ready. Testing often continued through the weekend and into Monday. Occasionally, testing could not get done by Tuesday and releases would slip into Thursday and, in extreme cases, into the following Tuesday.

By the time we added our eighth development team, the unrelenting pressure had led us to make a number of quiet compromises on quality. The pressure to finish last sprint’s testing left QA engineers with little time to write and maintain automated UI tests. Because comprehensive regression testing was taking too long, manual regression testing focused on areas the team thought could be impacted by the changes in the sprint and very little time would be spent testing other areas. Because schedule pressure was almost always present, the team did not believe they had the time they needed to clean up the monolithic components so technical debt was growing and it was getting harder to accurately identify the parts of the system that really needed regression testing.

Once we grew to 12 teams, the symptoms were clearly visible to the team and our business. One sprint’s release took so long to test that we decided to combine it with the subsequent sprint’s work into one gigantic release. “Hot fixes”, intra-sprint releases made to fix critical bugs that were impacting our customers, became common. In fact, we were starting to see cases where one hot fix would introduce yet another critical issue requiring a second hot fix to repair.

Finally, the pace of change completely overwhelmed our teams and processes. Release after release either failed and required rollback or resulted in a flurry of hot fixes. In one particularly bad week, the sprint release spawned a furious hydra; Each time we fixed one problem, two more would show up to replace it. By that time, I was leading the IT organization and, after consulting with team members and leaders, I mandated strict rules around regression testing, hot fixes and releases to stop the bleeding.

Simultaneously, we launched a small team of three people dedicated to improving quality and our ability to release reliable software frequently. We named it Yoda. We claimed it was an acronym, but I can’t find anyone that remembers what the letters were supposed to mean. Its biggest concrete deliverable would be an improved automated regression testing suite. We also asked the Yoda team to find ways to simplify the release process and improve the overall engineering culture.

Over the next several months, the Yoda team made progress. As expected, automated tests improved. However, the big improvement came from improvements in the release management process and the culture.

Although by this time the web sites were still pretty monolithic, they were surrounded by microservices that were independently deployable. The teams had also made progress on making aspects of the web sites independently deployable. The Yoda team spent some time documenting the various components and worked with various development teams across the company to determine which were truly independent enough to release on their own and which required more system-wide regression testing. Yoda improved the continuous delivery process and added a chatbot to make it easier for development team members to reliably deploy. They worked with the development teams to make releases easier to rollback too.

Once the Yoda effort gained momentum and the development teams were ready, we relaxed the rules around regression testing and releases for the components that Yoda identified as reasonably separated and safe to release independently. Over the next couple of months, we went from 1 large release per 2-week sprint to over 50 per week. Because releases were smaller, they were easier to test and quality improved. Hot fixes became rare again. Rollbacks occurred from time to time, but, because teams planned for the possibility, did not create the kind of drama we observed in the past.

Process changes were also required. As the number of releases per sprint increased, we realized visible functionality was making it to production before business stakeholders had a chance to formally review and approve it. As a result, teams started to demo stories to stakeholders as soon as they were done and ready to deploy. For some teams, that made the traditional end of sprint review exercise far less useful. Therefore, some teams stopped performing the end of sprint review though they continue to value and practice retrospectives based on the feedback received from the many stakeholder reviews and releases that happen during the sprint. As they work more story by story, teams are gradually starting to look at things like cumulative flow diagrams and cycle times and are starting to experiment with other agile methodologies, such as Kanban.

And so Bug Fix Thursday lived and mostly died within our agile process. At times, it served us well. At times, it reflected problems in our process or our code. At times, it created additional problems and raised stress levels. The solve, though obvious in retrospect, was terribly counter-intuitive especially in a world where the codebase includes some critical monolithic components: Create and nurture a culture that values releasing more and more frequently. Smaller, more frequent releases make testing easier and the risks smaller. Independently testable and deployable components are an important part of the story, but don’t do much good without the commitment to release more frequently. Although we had always talked about it and even built much of the necessary infrastructure to support it, we never brought it into focus until we launched Yoda and truly changed our culture.

Unlike some of the other stories in this series, we’re still not quite done with Bug Fix Thursday. We just found a way to make it smaller and insured that it can’t get any bigger by limiting its impact to the monolithic pieces of our system that are left over from the early days of the Autobahn platform. We’re also committed to shrinking it further over the coming months by focusing a small team, called Supercharge Autobahn, on breaking down the highly complex remaining pieces of the original monolith into truly independent components. We also continue to work on our engineering culture to make sure we don’t backslide.

The Biggest Mistakes We Made on Our Agile Journey (and Why We are Glad We Made Them) — Failure IS NOT an Option

failure

I am working with Tim Coonfield to develop a talk for the one day Agile Shift conference scheduled for April 12, 2019 in Houston, TX titled “10 Biggest Mistakes We Made on Our Agile Journey (and Why We are Glad We Made Them)“. This is the first in a series of articles that Tim and I will use to explore some of those mistakes and what we learned from them. You can find other articles in this series here.

If you are interested in hearing this talk or some of the other awesome speakers and topics that will be covered at the event, you can learn more about the conference and purchase tickets here.

Everything I will share here happened at Global Custom Commerce (GCC), a Home Depot Company, as we developed and improved our 2nd generation e-commerce platform, called Autobahn, designed to make it easy for customers to shop for and buy more complex configurable products and services like custom window blinds, custom windows, flooring and decks.  

In 2011, when the Autobahn platform started development, GCC was already the #1 seller of custom window coverings online and owned several brands including blinds.com. We were a couple years away from acquisition by the Home Depot and had about 80 employees. The existing e-commerce platform had been in production for a number of years and was still actively being improved by a large team following a traditional project management philosophy using Gantt charts and reporting on percent complete.

The Autobahn project marked a number of firsts for GCC including our first use of cloud hosting at AWS and our first use of agile methodologies. This article highlights one of our bigger mistakes and how we were able to improve as a result.

In the early days of the effort to deliver our 2nd generation e-commerce platform, Autobahn, we adopted the Scrum methodology and were following all the practices “by the book” including sprint planning, stand-ups, sprint review and sprint retrospectives.  However, the company continued to use more traditional project management techniques and reporting processes. As a result, the team was required to work with the PMO to keep a Gantt chart updated with progress by mapping completed stories to estimates of percent complete (i.e. a mistake I’ll cover in another article). The team was also concerned that, even though the CEO was shepherding the project himself, many leaders in the company were not happy about the company investing resources to build a new e-commerce platform instead of investing more in the existing one. There was also the issue of racing to replace the existing platform, which remained under active development since nobody in the company was interested in moving our e-commerce business to the new platform until it was more capable than the existing one. Despite these challenges, the team was optimistic that we would deliver on time and within budget.

For the first few months things appeared to be going well.  Every sprint we delivered exactly as promised. I know the Scrum guide talks about forecasts these days, but back then the book called for commitment — a very clear promise from the team to the business that they would deliver what they said they would deliver at planning. As the project started to gain momentum, we shifted focus from basic CRUD to the critical functionality required to sell any sort of configurable product or service.

As the required functionality got more complex, we started to have more difficulty delivering the way we intended. However, we could usually figure out a way to get enough done to demo by interpreting the story narrowly or, in some cases, by pushing the PO to pull out things that we convinced ourselves were not truly critical into new stories to be tackled in future sprints. At times, a couple of us also worked crazy hours over the weekend to make sure things got done as planned. The good news, we thought, was that velocity kept increasing so there was no doubt that we could tackle all those new stories and still deliver according to the original plan. We certainly thought that the combination of increasing velocity and the on-time trend shown in the project plan would make us look good to the business and help us keep our project alive.

Then came THAT story. It’s not really important which story exactly or what it was expected to deliver. What does matter is that after the first week of our two week sprint we knew we were in trouble and we knew THAT story was the problem. As per usual, a couple of us decided to work over the weekend to break through the hard bits so the team could finish wrapping it up in the second week of the sprint. Sunday night we still had not broken through. In fact, we had started to recognize that we had more work ahead of us than we had believed back on Friday evening.

The following Monday, the whole team got together after daily stand-up to talk about THAT story yet again. The team members that worked over the weekend shared details about the issues they had solved and the new issues they had uncovered. Nobody seemed overly worried. After all, we were the team that always delivered and failure just wasn’t an option. We spent a hour or so breaking down the remaining work into a set of tasks and put them up on the board. We updated our burn down and were pleased to note that we were still going to deliver THAT story, and everything else we had committed to, within the sprint even if it was going to require a little extra effort along the way.

Late in the day on Monday, I recognized I was in for a long night. As I stood up to stretch, I noticed other team members around me still hard at work and realized they were likely behind on their tasks too. A few hours later, I realized I was simply staring at the screen and was not really accomplishing much so I packed up to head home with a plan to start fresh early the next morning. The rest of the team appeared to be in about the same place as they headed out after a very, very long day.

It was pretty much the same story Tuesday and Wednesday. Despite lots of conversation, creative re-planning each morning and long days trying very hard to finish tasks, we just couldn’t seem to break through into the home stretch. Left unsaid was a rapidly growing sense of dread — THAT story wasn’t getting done.

Thursday morning we spent significant time after stand-up focusing on how to trim THAT story back, move bits out or otherwise bend the fabric of reality to allow us to call it done.  Our newest team member then said what desperately needed saying, “THAT story is not getting done this sprint. If we keep wasting time on it, we won’t finish up testing the other stories in the sprint either.”  He then advanced a heresy we had not even allowed ourselves to consider, “Perhaps,” he said, “we should fail with honor. We tried our best and we’ll get everything else done this sprint”.

We stood there for a moment and silently asked ourselves essentially the same question: how will the business react and how will we explain it? We started to discuss the idea and very quickly got past the fear. We would simply acknowledge the story wasn’t done and that it could go into the next sprint if the business still wanted it at the top of the priority list. We wouldn’t make excuses and we wouldn’t talk about percent complete. Our direct manager, who coded with the team, was very uncomfortable with the idea and did not want us to do it at first. He knew the political environment and worried about the backlash. In the end, he came around and the team decided to put THAT story aside and make sure the rest of the stories in the sprint were truly done before the demo.

I got the dubious honor of demoing the last completed story and discussing the failed one. Despite all the brave talk the morning before, I distinctly remember the little flutter in my gut as I finished up demoing the last completed story.  I remember staring at the ugly unfinished story on the projector screen as I said, “and THAT story did not get done”.

I took a deep breath and just a second passed before one of the business leaders asked the obvious question, at least the obvious question for someone used to reviewing projects using Gantt charts clearly illustrating percent complete. She asked, “So how close is it to done?”

I delivered the answer, carefully vetted with the team and rehearsed in advance, four simple words designed to explain our commitment to agile principles and to clear communications and transparency, “It’s simply not done”.

Of course, there was more to say after that. It started with a discussion about why percentage complete is an illusion and quickly moved into a group effort to figure out the best way to move forward. After a few minutes, I realized everyone there, the developers and the business, were engaged in looking for solutions and not spending any time at all assigning blame.

None of us should have been surprised, though we were. After all, one of our four company values is “experiment without fear of failure”. In essence, that’s just another way of saying that sometimes you’ll stumble and that’s OK as long as you learn and improve based on the experience. We saw it and lived it that day for sure. It was the beginning of a true partnership between the team and the business to figure out how to really deliver the benefit of a new e-commerce platform to our customers.

It’s also between the lines in the Agile Manifesto.  All of the unintentional spin and irrational optimism inherent in those pretty Gantt charts showing tasks 80% done gets pushed aside by a perfectly clear idea: What’s done is done.  Transparency from a development team is critical to actually making informed decisions about what to do next. “Responding to change”, one of the four values in the Agile Manifesto, is as much about what you learn on the technical side as it is about reacting to evolving business realities. The day our team and our business embraced that simple idea was one of the most important days in our agile journey.