Don’t Refactor. Rebuild. Kinda.

I recently had the chance to speak at the wonderful Lean Agile Scotland conference. The conference had a very wide range of subjects being discussed on an amazingly high level: complexity theory, lean thinking, agile methods, and even technical practices!

I followed a great presentation by Steve Smith on how the popularity of feature branching strategies make Continuous Integration difficult to impossible. I couldn’t have asked for a better lead in for my own talk.

Which is about giving up and starting over. Kinda.

Learning environments

Why? Because, when you really get down to it, refactoring an old piece of junk, sorry, legacy code, is bloody difficult!

Sure, if you give me a few experienced XP guys, or ‘software craftsmen’, and let us at it, we’ll get it done. But I don’t usually have that luxury. And most organisations don’t.

When you have a team that is new to the agile development practices, like TDD, refactoring, clean code, etc. then learning that stuff in the context of a big ball of mud is really hard.

You see, when people start to learn about something like TDD, they do some exercises, read a book, maybe even get a training. They’ll see this kind of code:

Example code from Kent Beck’s book: “Test Drive Development: By Example”;

Then they get back to work, and are on their own again, and they’re confronted with something like this:

Code Sample from my post “Code Cleaning: A refactoring example in 50 easy steps”;

And then, when they say that TDD doesn’t work, or that agile won’t work in their ‘real world’ situation we say they didn’t try hard enough. In these circumstances it is very hard to succeed.

So how can we deal with situations like this? As I mentioned above, an influx of experienced developers that know how to get a legacy system under control is wonderful, but not very likely. Developers that haven’t done that sort of thing before really will need time to gain the necessary skills, and that needs to be done in a more controlled, or controllable, environment. Like a new codebase, started from scratch.

Easy now, I understand your reluctance! Throwing away everything you’ve built and starting over is pretty much the reverse of the advice we normally give.

Let me explain using an example.

A story from the Real World

This is the story of a team that I worked with recently. And this team had the full stack of problems that you encounter in different forms in many organisations.

A history of takeovers and unrest, reflected in a big messy codebase merging functionality from the different companies bought
A history of management, with overly controlling architects/tech-leads and indifferent line management
The team that hadn’t much experience in our beloved technical practices, and had become very careful after being pulled in different directions by a sequence of those controlling tech-leads
Business stakeholders that hadn’t had new functionality delivered in 18 months, because of the migration/merge work coming out of those takeovers

The new CTO knew that he’d have to take a different direction. He brought in some people to help, and got me and the excellent Silvester van der Bijl from FourScouts to work with his teams.

But though this team was very much open to learning new things and actually eager to start refactoring their system, it turned out that there was no realistic way that they were able to do so within the time necessary to gain the speed they needed for the business to start trusting them again.

We did try. We did coding katas. Paired up on tougher changes. Started a stop-the-line process for any bugs found. Send part of the team to Chet Hendrickson’s excellent Agile Development Skills training. Did proper ‘boy scout rule’ refactoring in all our work. Identified and performed larger, planned, refactorings where needed.

And there was progress. Unit test coverage went up by 25%! From 1.7% to 2.1%. Unit test run time went down from two hours to 10 minutes. But though improvements in-the-small were starting to happen, the attempts to attack larger design problems in the system were slow and very error prone.

This all came to a fairly drastic point soon enough due to a delayed delivery of some, theoretically, simple functionality. Higher management started contemplating letting the team go completely and moving to a so-called of-the-shelf product. This would have been bad for the team, but would alse have meant the company was letting go of considerably competitive advantage.

More drastic action was needed, and the talk turned to doing a rebuild.

Rebuilds

Now, rebuilds seem to be the most popular type of software projects on the planet! Pretty much anyone in software development has worked on rebuild projects. I worked for one company once, where we built the same product 5 times in four years!

Developers seem to like rebuilds, because it allows them to lay the blame with previous developers that allowed things to get so messy.

Management seems to like rebuilds because it allows them to lay the blame with previous management that allowed things to get so messy.

The business like rebuilds because it allows them to lay the blame with the development team and managers for not improving business results.

Clean slates all over. Somebody else can be responsible.

Still, though, if we look at rebuilds, they are not often very successful. Or recommended. Experienced developers will pretty much all say the same thing: don’t do it!

We know all kinds of reasons why rebuilds fail:

Yes, there’s a lot that can go wrong. This team, though, could not go on the way they had been. It would simply take too long before enough ’technical debt’ was payed off to get this system in a competitive state again. Besides, I’ve done a few rebuilds, and I figured I might know a way to make this work.

We decided we could avoid those pitfalls. We’d just have to:

Deliver value from day one
Don’t rebuild the same mess
Don’t rebuild using the same process
Don’t rebuild the same functionality

Easy!

Well, possible. Maybe. It would take some doing.

The Agile way to rebuild

To achieve those goals, we combined tactics in three areas: architectural, business value and process. We knew that those tactics could reinforce each other, like the XP practices do.

The Strangler Pattern

The architectural tactic we deployed was the strangler pattern. A name coined by Martin Fowler (but then, what wasn’t named by him?) for an approach that allows one to build a new system around an existing one, and then incrementally migrate functionality from the old system to the new one.

It’s easiest to explain with an example. Say we have an existing, legacy system. In our example (and in the example project) we’ll take the simplest case and discuss a website.

Strangler pattern 1 - legacy situation

We insert a new system between the client and the legacy system, and let all requests go through this new system. Initially, the new system doesn’t actually have to do anything, just pass things through. A proxy.

Strangler pattern 2 - introduce proxy

Then, when we decide we are going to replace a part of the functionality of the website. In this case we’ll simply check if the request if for the page we want to replace or not. If it is, we handle it in our new system, which perhaps uses a new back-end service as well so we don’t accidentally put business logic in a front-end.

Strangler pattern 3 - introduce new functionality in wrapping compent

And, as we’ll be going to use Continuous Delivery, we need the control supplied by Feature Toggles (see, another thing Martin Fowler named!) that allows us to dynamically decide whether to show the old or the new version of that page. This allows us to push the new page to production, but only having it available to internal users that can then decide whether the rest of the world is ready for it. It also allows us to do a/b testing of the new page against the old one to ensure we don’t have regression in, for instance, conversion rates.

Strangler pattern 3 - add a feature toggle

Which is all there is to this. In fact, the very first instance of this set-up consisted only of about 10 lines of apache configuration. Including the basic toggle.

This pattern makes it possible to deliver value from the start. And for the team to learn those fancy agile development skills in a context that would grow along with their experience: ‘green field’ components whose complexity starts out low but grows over time.

There’s also a more psychological component to this. If you’re the developer that lets the unit-test coverage slip from 2.1% to 2.0%, nobody is going to get very excited about that. If you’re the one that lets it slip from 100% to 99.9%…

Continuous Delivery (Deployment!)

We combined the strangler pattern with the agreement that we would do Continuous Delivery for all the new components. Now, different people have different ideas on what is meant by ‘Continuous Delivery’. For the purposes of this story, I will simplify it to a single statement that makes all the difference:

Every push goes to production

This single agreement makes all the difference. It helps keep the focus of the team, and each developer, on quality. There is no delaying testing. If what you write right now is going to be in production in a few hours, you can’t post-pone testing until tomorrow. You can’t leave testing to someone else. And if you know that it is you and only you (with your pair, of course) that is responsible for not breaking the website, you’ll not just keep that test-coverage at 100%, you’ll make very sure that those tests are actually useful. And you’ll keep looking if there are any type of tests missing.

Suddenly, those agile development skills get to have a very direct and meaningful role in your work.

Behaviour Driven Development

Looking back at the reasons most rebuilds fail, we see that most are clustered around the requirements. It is crucial that we know what we’ll be building, that it is what our customers need right now, and that while we’re building it we don’t make the same mistakes and end up with another undocumented system.

BDD is particularly well suited to address those problems. The process part of it means that you arrange for close collaboration with the customer. In the case of a rewrite, it’s tempting for a Product Owner to say: “Just make sure it works the same as the old system”. It’s very important to make sure that does not happen.

So we gather the Three Amigo’s, bringing together developers, testers and our product owner. And we ensure that any and all new functionality is discussed between them. And once they agree on how something should work (independently of whether it worked exactly like that before), they write down acceptance scenarios that unambiguously register the outcomes of those discussions.

Those acceptance scenarios are then automated, using a tool such as Cucumber (or Behat, or FitNesse) by the development team. And the reports for that are published, as living documentation of how the system works, guarantee that is still works as expected.

Then, we also agree that while specifying the functionality we investigate, where necessary, whether that functionality is actually still needed, or whether there is a good business case for it.

In the example project, we actually investigated the impact of features on a webpage on the conversion rates of that page, and did not include some of those in the new system because that impact was simply too small.

Piece of cake!

Now seeing those three rules we used, it might still seem like it’s a very major undertaking to build-up such a new system.

And it is. This team had to not just build new code. They had to build-up a whole new infrastructure. To do continuous delivery well, they needed to create a new, completely automated infrastructure. That added things to the to-do list such as controlling AWS using ansible, packaging and deploying docker images for our new components, automatically creating new delivery pipelines using jenkins job builder, using a proper build script for php projects (which had not been there), unit, bdd and integration testing infrastructure for php and javascript code, configuring sonarqube, automatically creating smoke-tests for new services, actually having services and figuring out how to do contract and integration testing for them.

New technologies applied

But this team managed to set all that up, including the initial needed functionality, in two weeks time. There really is no excuse to wait in adopting the techniques that will get you to continuous delivery.

Of course, it took another two weeks to get the page to a point where design and business were happy with it. By that time, the team had a very good start on using those new skills. And they have continued growing in skill level ever since at a very high rate.

And after the page had been served to 10% of users for a few weeks, they could demo to upper management not just the new functionality, but also the analytics numbers showing big improvements in conversion rates above the old version. And then they could just ask the CEO and marketing manager whether those numbers justified rolling out to more users, and the complete control they had allowed them to fearlessly do that with a push to production during the demo!

Showing the release pipeline during the demo