Don’t Refactor. Rebuild. Kinda.

I recently had the chance to speak at the wonderful Lean Agile Scotland conference. The conference had a very wide range of subjects being discussed on an amazingly high level: complexity theory, lean thinking, agile methods, and even technical practices!

I followed a great presentation by Steve Smith on how the popularity of feature branching strategies make Continuous Integration difficult to impossible. I couldn’t have asked for a better lead in for my own talk.

Which is about giving up and starting over. Kinda.

Learning environments

Why? Because, when you really get down to it, refactoring an old piece of junk, sorry, legacy code, is bloody difficult!

Sure, if you give me a few experienced XP guys, or ‘software craftsmen’, and let us at it, we’ll get it done. But I don’t usually have that luxury. And most organisations don’t.

When you have a team that is new to the agile development practices, like TDD, refactoring, clean code, etc. then learning that stuff in the context of a big ball of mud is really hard.

You see, when people start to learn about something like TDD, they do some exercises, read a book, maybe even get a training. They’ll see this kind of code:

Example code from Kent Beck's book: "Test Drive Developmen: By Example"

Example code from Kent Beck’s book: “Test Drive Development: By Example”

Then they get back to work, and are on their own again, and they’re confronted with something like this:

Code Sample from my post "Code Cleaning: A refactoring example in 50 easy steps"

Code Sample from my post “Code Cleaning: A refactoring example in 50 easy steps”

And then, when they say that TDD doesn’t work, or that agile won’t work in their ‘real world’ situation we say they didn’t try hard enough. In these circumstances it is very hard to succeed. 

So how can we deal with situations like this? As I mentioned above, an influx of experienced developers that know how to get a legacy system under control is wonderful, but not very likely. Developers that haven’t done that sort of thing before really will need time to gain the necessary skills, and that needs to be done in a more controlled, or controllable, environment. Like a new codebase, started from scratch.

Easy now, I understand your reluctance! Throwing away everything you’ve built and starting over is pretty much the reverse of the advice we normally give.

Let me explain using an example.

A story from the Real World

This is the story of a team that I worked with recently. And this team had the full stack of problems that you encounter in different forms in many organisations.

  • A history of takeovers and unrest, reflected in a big messy codebase merging functionality from the different companies bought
  • A history of management, with overly controlling architects/tech-leads and indifferent line management
  • The team that hadn’t much experience in our beloved technical practices, and had become very careful after being pulled in different directions by a sequence of those controlling tech-leads
  • Business stakeholders that hadn’t had new functionality delivered in 18 months, because of the migration/merge work coming out of those takeovers

The new CTO knew that he’d have to take a different direction. He brought in some people to help, and got me and the excellent Silvester van der Bijl from FourScouts to work with his teams.

But though this team was very much open to learning new things and actually eager to start refactoring their system, it turned out that there was no realistic way that they were able to do so within the time necessary to gain the speed they needed for the business to start trusting them again.

We did try. We did coding katas. Paired up on tougher changes. Started a stop-the-line process for any bugs found. Send part of the team to Chet Hendrickson’s excellent Agile Development Skills training. Did proper ‘boy scout rule’ refactoring in all our work. Identified and performed larger, planned, refactorings where needed.

And there was progress. Unit test coverage went up by 25%! From 1.7% to 2.1%. Unit test run time went down from two hours to 10 minutes. But though improvements in-the-small were starting to happen, the attempts to attack larger design problems in the system were slow and very error prone. 

This all came to a fairly drastic point soon enough due to a delayed delivery of some, theoretically, simple functionality. Higher management started contemplating letting the team go completely and moving to a so-called of-the-shelf product. This would have been bad for the team, but would alse have meant the company was letting go of considerably competitive advantage.

More drastic action was needed, and the talk turned to doing a rebuild.

Rebuilds

Now, rebuilds seem to be the most popular type of software projects on the planet! Pretty much anyone in software development has worked on rebuild projects. I worked for one company once, where we built the same product 5 times in four years!

Developers seem to like rebuilds, because it allows them to lay the blame with previous developers that allowed things to get so messy.

Management seems to like rebuilds because it allows them to lay the blame with previous management that allowed things to get so messy.

The business like rebuilds because it allows them to lay the blame with the development team and managers for not improving business results. 

Clean slates all over. Somebody else can be responsible. 

Still, though, if we look at rebuilds, they are not often very successful. Or recommended. Experienced developers will pretty much all say the same thing: don’t do it!

We know all kinds of reasons why rebuilds fail:

  • The original requirements are simply not known. They may have been written down somewhere, or not. They certainly weren’t sufficiently unambiguous to just take and re-implement. Besides, no one knows which version was built, furthermore:
  • The requirements ‘evolved’. Sometimes because we changed our minds. Oftentimes because situations in production simply were not covered. And those certainly were not all written down.
  • We have new requirements! We’re doing a rebuild because we want to (be able to) change this system. There’s a reason we want that change: we don’t like what it does now.
  • We no longer need it. We’ll be spending a lot of time rebuilding functionality that we don’t need anymore. That workaround for that one customer? He’s no longer a customer, or in the meantime has updated his own systems so the workaround is no longer necessary. That 20% of features that no one uses? Let’s double down on that wasted investment!
  • Requirements freeze or split focus? If we’re busy building what we had before, which we don’t want, we won’t be getting the new things we want. We have to wait until the rebuild is done. Or we do try to change the old system, which is slow, and in doing so delay the rebuild and add new requirements to it
  • Did we learn? The new system won’t be any better than the old one, because we won’t have changed our process and culture, so we’ll end up with the same… limited quality.

Yes, there’s a lot that can go wrong. This team, though, could not go on the way they had been. It would simply take too long before enough ‘technical debt’ was payed off to get this system in a competitive state again. Besides, I’ve done a few rebuilds, and I figured I might know a way to make this work.

We decided we could avoid those pitfalls. We’d just have to:

  • Deliver value from day one
  • Don’t rebuild the same mess
  • Don’t rebuild using the same process
  • Don’t rebuild the same functionality

Easy!

Well, possible. Maybe. It would take some doing.

The Agile way to rebuild

To achieve those goals, we combined tactics in three areas: architectural, business value and process. We knew that those tactics could reinforce each other, like the XP practices do.

  • Architectural: The Strangler Pattern
  • Business Value: Behaviour Driven Development (BDD)
  • Process: Continuous Delivery

The Strangler Pattern

The architectural tactic we deployed was the strangler pattern. A name coined by Martin Fowler (but then, what wasn’t named by him?) for an approach that allows one to build a new system around an existing one, and then incrementally migrate functionality from the old system to the new one.

It’s easiest to explain with an example. Say we have an existing, legacy system. In our example (and in the example project) we’ll take the simplest case and discuss a website.

Strangler pattern - legacy situation

Strangler pattern 1 – legacy situation

We insert a new system between the client and the legacy system, and let all requests go through this new system. Initially, the new system doesn’t actually have to do anything, just pass things through. A proxy.

Strangler pattern 2 - introdruce proxy

Strangler pattern 2 – introduce proxy

Then, when we decide we are going to replace a part of the functionality of the website. In this case we’ll simply check if the request if for the page we want to replace or not. If it is, we handle it in our new system, which perhaps uses a new back-end service as well so we don’t accidentally put business logic in a front-end.

Strangler pattern 3 - introduce new functionality in wrapping compent

Strangler pattern 3 – introduce new functionality in wrapping compent

And, as we’ll be going to use Continuous Delivery, we need the control supplied by Feature Toggles (see, another thing Martin Fowler named!) that allows us to dynamically decide whether to show the old or the new version of that page. This allows us to push the new page to production, but only having it available to internal users that can then decide whether the rest of the world is ready for it. It also allows us to do a/b testing of the new page against the old one to ensure we don’t have regression in, for instance, conversion rates.

Strangler pattern 3 - add a feature toggle

Strangler pattern 3 – add a feature toggle

Which is all there is to this. In fact, the very first instance of this set-up consisted only of about 10 lines of apache configuration. Including the basic toggle.

This pattern makes it possible to deliver value from the start. And for the team to learn those fancy agile development skills in a context that would grow along with their experience: ‘green field’ components whose complexity starts out low but grows over time.

There’s also a more psychological component to this. If you’re the developer that lets the unit-test coverage slip from 2.1% to 2.0%, nobody is going to get very excited about that. If you’re the one that lets it slip from 100% to 99.9%…

Continuous Delivery (Deployment!)

We combined the strangler pattern with the agreement that we would do Continuous Delivery for all the new components. Now, different people have different ideas on what is meant by ‘Continuous Delivery’. For the purposes of this story, I will simplify it to a single statement that makes all the difference:

Every push goes to production

This single agreement makes all the difference. It helps keep the focus of the team, and each developer, on quality. There is no delaying testing. If what you write right now is going to be in production in a few hours, you can’t post-pone testing until tomorrow. You can’t leave testing to someone else. And if you know that it is you and only you (with your pair, of course) that is responsible for not breaking the website, you’ll not just keep that test-coverage at 100%, you’ll make very sure that those tests are actually useful. And you’ll keep looking if there are any type of tests missing.

Suddenly, those agile development skills get to have a very direct and meaningful role in your work.

Behaviour Driven Development

Looking back at the reasons most rebuilds fail, we see that most are clustered around the requirements. It is crucial that we know what we’ll be building, that it is what our customers need right now, and that while we’re building it we don’t make the same mistakes and end up with another undocumented system.

BDD is particularly well suited to address those problems. The process part of it means that you arrange for close collaboration with the customer. In the case of a rewrite, it’s tempting for a Product Owner to say: “Just make sure it works the same as the old system”. It’s very important to make sure that does not happen.

So we gather the Three Amigo’s, bringing together developers, testers and our product owner. And we ensure that any and all new functionality is discussed between them. And once they agree on how something should work (independently of whether it worked exactly like that before), they write down acceptance scenarios that unambiguously register the outcomes of those discussions.

three-amigos

Those acceptance scenarios are then automated, using a tool such as Cucumber (or Behat, or FitNesse) by the development team. And the reports for that are published, as living documentation of how the system works, guarantee that is still works as expected.

Then, we also agree that while specifying the functionality we investigate, where necessary, whether that functionality is actually still needed, or whether there is a good business case for it.

In the example project, we actually investigated  the impact of features on a webpage on the conversion rates of that page, and did not include some of those in the new system because that impact was simply too small.

Piece of cake!

Now seeing those three rules we used, it might still seem like it’s a very major undertaking to build-up such a new system.

And it is. This team had to not just build new code. They had to build-up a whole new infrastructure. To do continuous delivery well, they needed to create a new, completely automated infrastructure. That added things to the to-do list such as controlling AWS using ansible, packaging and deploying docker images for our new components, automatically creating new delivery pipelines using jenkins job builder, using a proper build script for php projects (which had not been there), unit, bdd and integration testing infrastructure for php and javascript code, configuring sonarqube, automatically creating smoke-tests for new services, actually having services and figuring out how to do contract and integration testing for them.

New technologies applied

New technologies applied

But this team managed to set all that up, including the initial needed functionality, in two weeks time. There really is no excuse to wait in adopting the techniques that will get you to continuous delivery.

Of course, it took another two weeks to get the page to a point where design and business were happy with it. By that time, the team had a very good start on using those new skills. And they have continued growing in skill level ever since at a very high rate.

And after the page had been served to 10% of users for a few weeks, they could demo to upper management not just the new functionality, but also the analytics numbers showing big improvements in conversion rates above the old version. And then they could just ask the CEO and marketing manager whether those numbers justified rolling out to more users, and the complete control they had allowed them to fearlessly do that with a push to production during the demo!

Showing the release pipeline during the demo

Showing the release pipeline during the demo

7 thoughts on “Don’t Refactor. Rebuild. Kinda.

  1. Don’t use feature toggles. They completely negate all the advantages of having everything go to master quickly, because now you can change “configuration” and be running different code in production from what you tested. Instead you should improve your release process to the point where you don’t mind making a release to change an A/B test or similar piece of “configuration”.

    Rewriting works *if* you’re confident that your “three amigos” will do a better job of capturing the business requirements than the existing code. IME that’s rarely the case though. The existing code tends to have important edge cases that users don’t remember, or maybe don’t even realise is a special case. Maybe you’re lucky enough to have an unusually effective product owner? I find one has to first read the code, understand the cases, and *then* you can ask informed questions that will get the users to nail down the real requirements. And at that point you understand the existing code well enough that refactoring is cheaper than rewriting.

  2. It’s indeed very important that your tests test with both the toggle on and off. If you implement your acceptance scenario’s on a low enough level this often happens automatically but any integration tests need to test with different toggle states. It’s in the nature of these things to increase the code and test size a little before decreasing again when the toggle is removed.

    Apart from A/B type situations, though, feature toggles allow you to decouple the release of code from the release of features to your users. This is a crucial capability for any team that wants to do continuous delivery.

    And I agree that understanding the code is important in any rewrite. That’s why the Three Amigos include developers! There’s a good chance you won’t catch all cases, though. And a good chance that even if you do, you won’t know why they’re there. Or if they’re still needed. That requires work, and business level decisions.

    I would recommend refactoring above a rewrite in most cases, btw. But once you’ve refactored the code, you’ll still have cases in there that you may-or-may-not need. Simplifying beyond the code then means talking to business stakeholders, POs, etc. And even going into usage statistics. That’s a barrier that most teams/developers don’t quickly go over and one that can often save a lot of work if you do. The BDD process is simply a way to raise those questions as early as possible.

  3. > feature toggles allow you to decouple the release of code from the release of features to your users. This is a crucial capability for any team that wants to do continuous delivery.

    I don’t see how. If you have the ability to do continuous releasing then you can do continuous delivery without feature toggles.

    > I agree that understanding the code is important in any rewrite. That’s why the Three Amigos include developers! There’s a good chance you won’t catch all cases, though. And a good chance that even if you do, you won’t know why they’re there. Or if they’re still needed. That requires work, and business level decisions.

    Agreed but I don’t think the Three Amigos approach lends itself to understanding the code. Understanding the code is a continuous process, and a single discussion doesn’t help all that much. What you really need is a line of communication where you can ask some questions of a business representative, spend ten minutes reading the code in light of the answers, and then come back with more questions. Involving test representatives in this process largely just makes it less convenient all round. (I’m happy to believe that the company in question misunderstood the concept, because honestly I never saw much value from it).

    > I would recommend refactoring above a rewrite in most cases, btw.

    Then, uh, why the post title? It sounds like you’re saying the opposite.

    > But once you’ve refactored the code, you’ll still have cases in there that you may-or-may-not need. Simplifying beyond the code then means talking to business stakeholders, POs, etc. And even going into usage statistics.

    Sure, there’s certainly value in pruning functionality that’s not pulling its weight. But I think that fits into the normal refactoring workflow fine. I certainly wouldn’t describe it as rebuilding.

    • > I don’t see how. If you have the ability to do continuous releasing then you can do continuous delivery without feature toggles.

      Most often, the work is done in smaller steps than our business people want it released, so you’ll have features that are not supposed to be in production yet. But we don’t want to keep long-living feature branches alive, with the merge/integration problems that causes.
      A good example would be the retail company that needed changes in behaviour to coincide with changes in content, perhaps at the time the christmas sale starts. That feature toggle was switched by a publication from the content management system.

      > On the Three Amigos

      I think we’re pretty much in agreement here. It’s not a single meeting (even if often interpreted that way), it’s the idea that those roles with their respective knowledge cooperate in surfacing the requirements. Testers, good ones, very often know things about the behaviour of the system that programmers didn’t see even after studying the code.

      >> I would recommend refactoring above a rewrite in most cases, btw.

      >Then, uh, why the post title? It sounds like you’re saying the opposite.

      I try to sketch the context, but I think that you should go for a rewrite (or a partial one to start with) if you have a lack of time or skills to tackle a refactoring. But as you say, both gain much from pruning functionality, not just code.
      Rewrite has a bad name, though, so I thought I’d amplify the option a bit in the title.

Leave a Reply