I recently had the chance to speak at the wonderful Lean Agile Scotland conference. The conference had a very wide range of subjects being discussed on an amazingly high level: complexity theory, lean thinking, agile methods, and even technical practices!
I followed a great presentation by Steve Smith on how the popularity of feature branching strategies make Continuous Integration difficult to impossible. I couldn’t have asked for a better lead in for my own talk.
Which is about giving up and starting over. Kinda.
Why? Because, when you really get down to it, refactoring an old piece of junk, sorry, legacy code, is bloody difficult!
Sure, if you give me a few experienced XP guys, or ‘software craftsmen’, and let us at it, we’ll get it done. But I don’t usually have that luxury. And most organisations don’t.
When you have a team that is new to the agile development practices, like TDD, refactoring, clean code, etc. then learning that stuff in the context of a big ball of mud is really hard.
You see, when people start to learn about something like TDD, they do some exercises, read a book, maybe even get a training. They’ll see this kind of code:
Then they get back to work, and are on their own again, and they’re confronted with something like this:
And then, when they say that TDD doesn’t work, or that agile won’t work in their ‘real world’ situation we say they didn’t try hard enough. In these circumstances it is very hard to succeed.
So how can we deal with situations like this? As I mentioned above, an influx of experienced developers that know how to get a legacy system under control is wonderful, but not very likely. Developers that haven’t done that sort of thing before really will need time to gain the necessary skills, and that needs to be done in a more controlled, or controllable, environment. Like a new codebase, started from scratch.
Easy now, I understand your reluctance! Throwing away everything you’ve built and starting over is pretty much the reverse of the advice we normally give.
Let me explain using an example.
A story from the Real World
This is the story of a team that I worked with recently. And this team had the full stack of problems that you encounter in different forms in many organisations.
- A history of takeovers and unrest, reflected in a big messy codebase merging functionality from the different companies bought
- A history of management, with overly controlling architects/tech-leads and indifferent line management
- The team that hadn’t much experience in our beloved technical practices, and had become very careful after being pulled in different directions by a sequence of those controlling tech-leads
- Business stakeholders that hadn’t had new functionality delivered in 18 months, because of the migration/merge work coming out of those takeovers
The new CTO knew that he’d have to take a different direction. He brought in some people to help, and got me and the excellent Silvester van der Bijl from FourScouts to work with his teams.
But though this team was very much open to learning new things and actually eager to start refactoring their system, it turned out that there was no realistic way that they were able to do so within the time necessary to gain the speed they needed for the business to start trusting them again.
We did try. We did coding katas. Paired up on tougher changes. Started a stop-the-line process for any bugs found. Send part of the team to Chet Hendrickson’s excellent Agile Development Skills training. Did proper ‘boy scout rule’ refactoring in all our work. Identified and performed larger, planned, refactorings where needed.
And there was progress. Unit test coverage went up by 25%! From 1.7% to 2.1%. Unit test run time went down from two hours to 10 minutes. But though improvements in-the-small were starting to happen, the attempts to attack larger design problems in the system were slow and very error prone.
This all came to a fairly drastic point soon enough due to a delayed delivery of some, theoretically, simple functionality. Higher management started contemplating letting the team go completely and moving to a so-called of-the-shelf product. This would have been bad for the team, but would alse have meant the company was letting go of considerably competitive advantage.
More drastic action was needed, and the talk turned to doing a rebuild.
Now, rebuilds seem to be the most popular type of software projects on the planet! Pretty much anyone in software development has worked on rebuild projects. I worked for one company once, where we built the same product 5 times in four years!
Developers seem to like rebuilds, because it allows them to lay the blame with previous developers that allowed things to get so messy.
Management seems to like rebuilds because it allows them to lay the blame with previous management that allowed things to get so messy.
The business like rebuilds because it allows them to lay the blame with the development team and managers for not improving business results.
Clean slates all over. Somebody else can be responsible.
Still, though, if we look at rebuilds, they are not often very successful. Or recommended. Experienced developers will pretty much all say the same thing: don’t do it!
We know all kinds of reasons why rebuilds fail:
- The original requirements are simply not known. They may have been written down somewhere, or not. They certainly weren’t sufficiently unambiguous to just take and re-implement. Besides, no one knows which version was built, furthermore:
- The requirements ‘evolved’. Sometimes because we changed our minds. Oftentimes because situations in production simply were not covered. And those certainly were not all written down.
- We have new requirements! We’re doing a rebuild because we want to (be able to) change this system. There’s a reason we want that change: we don’t like what it does now.
- We no longer need it. We’ll be spending a lot of time rebuilding functionality that we don’t need anymore. That workaround for that one customer? He’s no longer a customer, or in the meantime has updated his own systems so the workaround is no longer necessary. That 20% of features that no one uses? Let’s double down on that wasted investment!
- Requirements freeze or split focus? If we’re busy building what we had before, which we don’t want, we won’t be getting the new things we want. We have to wait until the rebuild is done. Or we do try to change the old system, which is slow, and in doing so delay the rebuild and add new requirements to it
- Did we learn? The new system won’t be any better than the old one, because we won’t have changed our process and culture, so we’ll end up with the same… limited quality.
Yes, there’s a lot that can go wrong. This team, though, could not go on the way they had been. It would simply take too long before enough ‘technical debt’ was payed off to get this system in a competitive state again. Besides, I’ve done a few rebuilds, and I figured I might know a way to make this work.
We decided we could avoid those pitfalls. We’d just have to:
- Deliver value from day one
- Don’t rebuild the same mess
- Don’t rebuild using the same process
- Don’t rebuild the same functionality
Well, possible. Maybe. It would take some doing.
The Agile way to rebuild
To achieve those goals, we combined tactics in three areas: architectural, business value and process. We knew that those tactics could reinforce each other, like the XP practices do.
- Architectural: The Strangler Pattern
- Business Value: Behaviour Driven Development (BDD)
- Process: Continuous Delivery
The Strangler Pattern
The architectural tactic we deployed was the strangler pattern. A name coined by Martin Fowler (but then, what wasn’t named by him?) for an approach that allows one to build a new system around an existing one, and then incrementally migrate functionality from the old system to the new one.
It’s easiest to explain with an example. Say we have an existing, legacy system. In our example (and in the example project) we’ll take the simplest case and discuss a website.
We insert a new system between the client and the legacy system, and let all requests go through this new system. Initially, the new system doesn’t actually have to do anything, just pass things through. A proxy.
Then, when we decide we are going to replace a part of the functionality of the website. In this case we’ll simply check if the request if for the page we want to replace or not. If it is, we handle it in our new system, which perhaps uses a new back-end service as well so we don’t accidentally put business logic in a front-end.
And, as we’ll be going to use Continuous Delivery, we need the control supplied by Feature Toggles (see, another thing Martin Fowler named!) that allows us to dynamically decide whether to show the old or the new version of that page. This allows us to push the new page to production, but only having it available to internal users that can then decide whether the rest of the world is ready for it. It also allows us to do a/b testing of the new page against the old one to ensure we don’t have regression in, for instance, conversion rates.
Which is all there is to this. In fact, the very first instance of this set-up consisted only of about 10 lines of apache configuration. Including the basic toggle.
This pattern makes it possible to deliver value from the start. And for the team to learn those fancy agile development skills in a context that would grow along with their experience: ‘green field’ components whose complexity starts out low but grows over time.
There’s also a more psychological component to this. If you’re the developer that lets the unit-test coverage slip from 2.1% to 2.0%, nobody is going to get very excited about that. If you’re the one that lets it slip from 100% to 99.9%…
Continuous Delivery (Deployment!)
We combined the strangler pattern with the agreement that we would do Continuous Delivery for all the new components. Now, different people have different ideas on what is meant by ‘Continuous Delivery’. For the purposes of this story, I will simplify it to a single statement that makes all the difference:
Every push goes to production
This single agreement makes all the difference. It helps keep the focus of the team, and each developer, on quality. There is no delaying testing. If what you write right now is going to be in production in a few hours, you can’t post-pone testing until tomorrow. You can’t leave testing to someone else. And if you know that it is you and only you (with your pair, of course) that is responsible for not breaking the website, you’ll not just keep that test-coverage at 100%, you’ll make very sure that those tests are actually useful. And you’ll keep looking if there are any type of tests missing.
Suddenly, those agile development skills get to have a very direct and meaningful role in your work.
Behaviour Driven Development
Looking back at the reasons most rebuilds fail, we see that most are clustered around the requirements. It is crucial that we know what we’ll be building, that it is what our customers need right now, and that while we’re building it we don’t make the same mistakes and end up with another undocumented system.
BDD is particularly well suited to address those problems. The process part of it means that you arrange for close collaboration with the customer. In the case of a rewrite, it’s tempting for a Product Owner to say: “Just make sure it works the same as the old system”. It’s very important to make sure that does not happen.
So we gather the Three Amigo’s, bringing together developers, testers and our product owner. And we ensure that any and all new functionality is discussed between them. And once they agree on how something should work (independently of whether it worked exactly like that before), they write down acceptance scenarios that unambiguously register the outcomes of those discussions.
Those acceptance scenarios are then automated, using a tool such as Cucumber (or Behat, or FitNesse) by the development team. And the reports for that are published, as living documentation of how the system works, guarantee that is still works as expected.
Then, we also agree that while specifying the functionality we investigate, where necessary, whether that functionality is actually still needed, or whether there is a good business case for it.
In the example project, we actually investigated the impact of features on a webpage on the conversion rates of that page, and did not include some of those in the new system because that impact was simply too small.
Piece of cake!
Now seeing those three rules we used, it might still seem like it’s a very major undertaking to build-up such a new system.
But this team managed to set all that up, including the initial needed functionality, in two weeks time. There really is no excuse to wait in adopting the techniques that will get you to continuous delivery.
Of course, it took another two weeks to get the page to a point where design and business were happy with it. By that time, the team had a very good start on using those new skills. And they have continued growing in skill level ever since at a very high rate.
And after the page had been served to 10% of users for a few weeks, they could demo to upper management not just the new functionality, but also the analytics numbers showing big improvements in conversion rates above the old version. And then they could just ask the CEO and marketing manager whether those numbers justified rolling out to more users, and the complete control they had allowed them to fearlessly do that with a push to production during the demo!