Adventures in Rework

I came across this post by Martin Fowler, on the Strangler Application pattern, and its accompanying paper. This brought back memories of some of my own adventures in rework, some fond, others not so much. In all cases, though, I think they were very valuable lessons on what to do and not to do when reworking existing systems. No reason not to share those lessons, especially as some of them were rather painful and expensive to learn. The paper linked above is a success story, mine are more the kind you tell your children about to keep them away from dysfunctional projects. This is not because these projects were done by horrible or incompetent people, or completely ineffective organisations. They weren’t. But sometimes a few bad decisions can stack up, and result into the kind of war stories developers share after a few beers.

I’ll wait here until you’re back from the fridge.

When talking about rework, note that I’m not calling them ’legacy systems’, as you’ll see that in a few of these cases the ’legacy’ system wasn’t even finished yet before the rework began.

‘Rework’ is simply a name for rebuilding existing functionality. It is distinct from Refactoring in that it is usually not changing existing code, but replacing it. Another distinction is one of scope. Refactoring is characterised by small steps of improvement, while Rework is about replacing a large part of (or an entire) system. Or in simpler terms:

Rework = bad, Refactoring = good

One problem is that very often people talk about refactoring when they mean rework, giving the good practice of refactoring a bad name. When a developer or architect says ‘We need to do some refactoring on the system, we think it will take 4 weeks of work for the team’, what they are talking about is rework. Not surprisingly, many managers now treat refactoring as a dirty word…

When do you do rework?

Rework is not always bad. There can be good reasons to invest in re-implementation. Usually, maintainability and extensibility are part of those reasons, at least on the technical side. This is the type of rework that is similar to refactoring, in that there is no functional change. This means that this is work that does not give any direct business value. From the point of view of the customer, or the company, this means that these kind of changes are ‘cost’ only.

Rework can also be triggered by changes in requirements. These might be functional requirements, where new functionality can’t easily be fitted in the current system. Usually, though, these are non-functionals. Such as: we need better scalability, but the platform we’re using doesn’t support that. Or: We need to provide the same functionality, but now as a desktop application instead of a web-application (or vice versa).

Rework is also sometimes triggered by policy, such as “we’re moving all our development over to…”;. And then Java, Scala, Ruby, ‘The Cloud’, or whatever you’re feeling enthusiastic about at the moment. This is not a tremendously good reason, but can be a valid one if you see it in the context of, for example: “We’re moving all our development over to Java, since the current COBOL systems are getting to be difficult to maintain, simply because we’re running out of COBOL programmers.’

Adventure number one

This was not the first piece of rework I was involved with, but a good example of the importance of always continuing to deliver value, and keeping up trust between different parties in an organisation. No names, to protect the innocent. And even though I certainly have opinions on which choices were good and which were not, this is not about assigning blame. The whole thing is, as always, the result of the complete system in which it happens. The only way to avoid them is complete transparency, and the trust hopefully resulting of that.

A project I worked on was an authorisation service for digital content distribution. It could register access rights based on single-sale or subscription periods. This service in the end was completely reworked twice, with another go in the planning. Let’s see what happened, and what we can learn from that.

The service had originally been written in PHP, but was re-created from scratch in Java. I don’t know all the specifics on why this was done, but it involved at least the component of expected better performance, and there was also a company-wide goal of moving to Java for all server-side work. The non-functional requirements and policy from above.

This project was a completely redone system. Everything, including database structures, was created new from scratch. There was a big data-migration, and extensive testing to ensure that the customers wouldn’t suddenly find themselves with missing contents, or subscriptions cut short by years, months or minutes.

Don’t try to change everything at once

A change like that is very difficult to pull off. It’s even more difficult if the original system has a large amount of very specific business-logic in code to handle a myriad of special cases. Moreover, since the reasons for doing rework were completely internally directed, the business side of the company didn’t have much reason to be involved in the project, or understanding of the level of resources that were being expended in it. It did turn out, though, that many of the specific cases were unknown to the business. Quickly growing companies, and all that…

Anyway, the project was late. Very late. It was already about 9 months late when I first got involved with it. At that point, it was technically a sound system, and was being extensively tested. Too extensively, in a way. You see, the way the original system had so many special cases hard-coded was in direct conflict with the requirement for the new system to be consistent and data-driven. There was no way to make the new implementation data-driven and still get 100% the same results as the old one.

Now, this should not be a problem, as long as the business-impact is clear, and the business side of the organisation is closely enough involved to early-on make clear decisions on what are acceptable deviations from the old system and what are not. A large part of the delays were simply due to that discussion not taking place until very late in the process.

As with all software development, rework needs the customer closely involved

In the end, we did stop trying to work to 100% compliance, and got to sensible agreements about edge-cases. Most of these cases were simply that a certain subset of customers would have a subscription end a few days or weeks later, with negligible business impact. They still caused big delays in the project delivery!

What problems to fix is a business decision

Unfortunately, though the system went live eventually, this was with a year’s delay. It was also too late. On the sales and marketing side, a need had grown to not only register subscriptions for a certain time-period, but also to be able to bill them periodically (monthly payments, for instance). Because the old one hadn’t been able to do this, neither could the new one. And because the new system had been designed to work very similar to the old one, this was not an very straightforward functionality to add.

If you take a long time to make a copy of an existing system, by the time you’re done they’ll want a completely different system

Of course, it was also not a completely impossible thing to add, but we estimated at the time that it would take about three months of work. And that would be *after* the first release of the system, which hadn’t taken place yet. That would bring us to somewhere around October of that year, while the business realities dictated that having this type of new product would have the most impact if released to the public early September.

So what happens to the trust between the development team and the customer by a late release of something that doesn’t give any new functionality to the customer? And if the customer, after not getting any new functionality for a full year, then has a need and hears that he’ll have to wait another 6 months before he can get it? He tells the development team: “You know what? I’ll get someone else to do it!”;

Frustrate your customer to your peril!

So the marketing department gets involved in development project management. And they put out a tender to get some offers from different parties. And they pick the cheapest option. And it’s going to be implemented by an external party. With added outsourcing to India! Such a complex project set-up that it *must* work. Meanwhile, the internal development organisation is still trying to complete the original project, and is keeping off getting involved into this follow-up thing.

Falling out within the organisation means falling over of the organisation

Now this new team is working on this new service which is going to be about authorisation and subscriptions. They talk to the business, and start designing a system based on that (this was an old-school waterfall project). Their requirement focus a lot on billing, of course, since that is the major new functionality relative to the existing situation. But they also need to have something to bill, and that means that the system also supports subscriptions without a hard end-date, which are renewed with every new payment. The existing system doesn’t support that, which is a large part of that three months estimation we were talking about.

Now a discussion starts. Some are inclined to add this functionality to the old system, and make the new project about billing and payment tracking. But that would mean waiting for changes in the existing system. So others are pushing to make the new system track the subscription periods. But then we’d have two separate systems, and we’d need to check in both to see if someone is allowed access to a specific product. Worse, since you’d have to be able to switch from pre-paid (existing) to scheduled payments, there would be logic overlapping those two.

Architecture is not politics. Quick! Someone tell Conway!

All valid discussions on the architecture of this change. Somehow, there was an intermediate stage where both existing and new system would keep track of everything, and all data would magically be kept in sync between those two systems, even though they had wildly different domain models about subscriptions. That would have made maintenance… difficult. So the decision was to move everything into the new system, and have the old system only there as a stable interface toward the outside world (ie. a façade talking to the new system through web-services, instead of to its own database).

So here’s a nice example of where *technically* there isn’t much need for rework. There are changes needed, but those are fairly easy to incorporate into an existing architecture. We’re in a hurry, but the company won’t fall over if we are late (even if we do have to delay introducing new payment methods and product types). But the eroded trust levels within the company made the preference to start from scratch, instead of continuing from a working state.

Trust is everything

Now for the observant among you: Yes, some discussion was had about how we had just proven that such a rework project was very complex, and last time took a lot of time to get right. But the estimates of the the external party indicated that the project was feasible. One of the reasons they thought this was that they’d talked mostly to the sales side of the organisation. This is fine, but since they didn’t talk much to the development side, they really had no way of knowing about the existing product base, and its complications and special cases. Rework *should* be easier, but only if you are in a position to learn from the initial work!

If you do rework, try to do it with the involvement of the people who did the original work

It won´t come as a big surprise that this project did not deliver by early September as was originally intended. In fact, it hadn´t delivered by September of the following year. In that time the original external party had been extended and/or replaced (I never quite got that clear) by a whole boatload of other outsourcing companies and consultants. The cost of the project skyrocketed. Data migration was, of course, again very painful (but this time the edge-case decisions were made much earlier!)

A whole new section of problems came from from a poorly understood domain, and no access during development to all of the (internal) clients that were using the service in different ways. This meant that when integration testing started, a number of very low-level decisions on the domain of the new application had to be reconsidered. Some of those were changed, others resulted in work-arounds in all the different clients, since the issues were making a late project later.

Testing should be the *first* thing you start working on in any rework project. Or any project at all.

Meanwhile, my team was still working on the existing (now happily released) system, both maintenance, new features, and the new version that ran against the new system´s web-services. And they were getting worried. Because they could see an army of consultants packing-up and leaving them with the job of keeping the new system running. And when it became clear that the intention was to do a big-bang release, without any way to do a roll-back, we did intervene. One of the developers created a solution to pass all incoming requests to both the old and the new systems, and do extensive logging on the results, including some automated comparisons. Very neat, as it allowed us to keep both systems in sync for a period of time, and see if we ran into any problems.

Always make a roll-back as painless as possible

This made it possible to have the new system in shadow-mode for a while, fix any remaining issues (which meant doing the data-migration another couple of times), and then do the switch painlessly by changing a config setting.

Make roll-back unnecessary by using shadow-running and roll-forward

So in the end we had a successful release. In fact, this whole project was considered by the company to be a great success. In the sense of any landing you can walk away from, this is of course true. For me, it was a valuable lesson, teaching among other things:

Haste makes waste (also known as limit WIP)
Don´t expect an external supplier to understand your domain, if you don´t really understand it yourself
Testing is the difference between a successful project and a failed one
When replacing existing code, test against live data
Trust is everything

I hope this description was somewhat useful, or at least entertaining in a schadenfreude kind-of way, for someone. It is always preferably to learn from someone else’s mistakes, if you can… I do have other stories of rework, which I´ll make a point of sharing in the future, if anyone is interested.