DevOps and Continuous Delivery

If you want to go fast and have high quality, communication has to be instant, and you need to automate everything. Structure the organisation to make this possible, learn to use the tools to do the automation.

There’s a lot going on about DevOps and Continuous Delivery. Great buzzwords, and actually great concepts. But not altogether new. But for many organisations they’re an introduction to agile concepts, and sometimes that means some of the background that people have when arriving at these things in the natural way, through Agile process improvement, is missing. So what are we talking about?

DevOps: The combination of software developers and infrastructure engineers in the same team with shared responsibility for the delivered software

Continuous Delivery: The practice of being able to deliver software to (production) environments in a completely automated way. With VM technology this includes the roll-out of the environments.

Both of these are simply logical extensions of Agile and Lean software development practices. DevOps is one particular instance of the Agile multi-functional team. Continuous Delivery is the result of Agile’s practice of automating any repeating process, and in particular enabled by automated tests and continuous integration. And both of those underlying practices are the result of optimizing your process to take any delays out of it, a common Lean practice.

In Practice

DevOps is an organisational construct. The responsibility for deployment is integrated in the multi-functional agile team in the same way that requirement analysis, testing and coding were already part of that. This means an extension to the necessary skills in the teams. System Administrator skills, but also a fairly new set of skills for controlling the infrastructure as if it were code with versioning, testing, and continuous integration.

Continuous Delivery is a term for the whole of the process that a DevOps team performs. A Continuous Delivery (CD) process consists of developing software, automating testing, automating deployment, automating infrastructure deployment, and linking those elements so that a pipeline is created that automatically moves developed software through the normal DTAP stages.

So both of these concepts have practices and tools attached, which we’ll discuss in short.

Practices and Tools

DevOps

Let’s start with DevOps. There are many standard practices aimed at integrating skills and improving communication in a team. Agile development teams have been doing this for a while now, using:

  • Co-located team
  • Whole team (all necessary skills are available in the team)
  • Pairing
  • Working in short iterations
  • Shared (code, but also product) ownership
  • (Acceptance) Test Driven Development

DevOps teams need to do the same, including the operations skill set into the team.

One question that often comes up is: “Does the entire team need to suddenly have this skill?”. The answer to that is, of course, “No”. But in the same way that Agile teams have made testing a whole team effort, so operations becomes a whole team effort. The people in the team with deep skills in this area will work together with some of the other team members in the execution of tasks. Those other will learn something about this work, and become able to handle at least the simpler items independently. The ops person can learn how to better structure his scripts, enabling re-use, from developers. Or how to test and monitor the product better from testers.

An important thing to notice is that these tools we use to work well together as a team are cross-enforcing. They enforce each-other’s effectiveness. That means that it’s much harder to learn to be effective as a team if you only adopt one or two of these.

Continuous Delivery

Continuous Delivery is all about decreasing the feedback cycle of software development. And feedback comes from different places. Mostly testing and user feedback. Testing happens at different levels (unit, service, integration, acceptance, …) and on different environments (dev, test, acceptance, production). The main focus for CD is to get the feedback for each of those to come as fast as possible.

To do that, we need to have our tests run at every code-change, on every environment, as reliable and quickly as possible. And to do that, we need to be able to completely control deployment of and to those environments, automatically, and for the full software stack.

And to be able to to that, there are a number of tools available. Some have been around for a long time, while others are relatively new. Most especially the tools that are able to control full (virtualised) environments are still relatively fresh. Some of the testing tooling is not exactly new, but seems still fairly unknown in the industry.

What do we use that for?

You’re already familiar with Continuous Integration, so you know about checking in code to version control, about unit tests, about branching strategies (basically: try not to), about CI servers.

If you have a well constructed CI solution, it will include building the code, running unit tests, creating a deployment package, and deploying to a test environment. The deployment package will be usable on different environments, with configuration provided separately. You might use tools such the cargo plugin for deployment to test (and further?), and keep a versioned history of all your deployment artefacts in a repository.

So what is added to that when we talk about Continuous Delivery? First of all, there’s the process of automated promotion of code to subsequent environments: the deployment pipeline.

pipeline

This involves deciding which tests to run at what stage (based on dependency on environment, and runtime) to optimize a short feedback loop with as detailed a detection of errors as possible. It also requires decisions on which part of the pipeline to run fully automatic, and where to still assume human intervention is necessary.

Another thing that we are newly interested in for the DevOps/CD situation is infrastructure as code. This has been enabled by the emergence of virtualisation, and has become manageable with tools such as Puppet and Chef. These tools make the definition of an environment into code, including hardware specs, OS, installed software, networking, and deployment of our own artefacts. That means that a test environment can be a completely controlled systems, whether it is run on a developer’s laptop, or on a hosted server environment. And that kind of control removes many common error situations from the software delivery equation.

The ‘Just Do It’ Approach To Change Management

Last Friday I gave a talk at the Dare 2013 conference in Antwerp. The talk was about the experiences I and my colleague Ciarán ÓNeíll have had in a recent project, in which we found that sometimes a very directive, Just Do It approach will actually be the best way to get people in an agile mindset.

Update: The full video of this talk as given on ‘Agile on the Beach’ is available on youtube.

This was surprising to us, to say the least, and so we’ve tried to find some theory supporting our experiences. And though theory is not the focus of this story, it helps if we set the scene by referencing two bits of theory that we think fits our experience.

Just Do It

A long time ago, in a country far away, there was this psychologist called William James, who wrote:

“If you want a quality, act as if you already have it.” – William James (1842-1910)

We often say that if you want to change your behaviour, you need to change your mind, be disciplined, etc. But this principle tells us that it works the other way around as well: if you change your behaviour this can change your thinking. Or mindset, perhaps?

For more about the ‘As If’ Principle, see the book by Richard Wiseman

Another piece of theory that is related is complexity thinking as embodied by the Cynefin framework. Cynefin talks about taking different actions when managing situations that are in different domains: simple, complicated, complex or chaos.

Cynefin Framework

The project

And in chaos, our story begins.

This particular project was a development project for a large insurance company. The project had already been active for over half a year when we joined. It was a bad case of waterfall, with unclear requirements, lots of silo’s, lots of finger pointing and no progress.

The customer got tired of this, and got in a high-powered project manager who was given far reaching mandate to get the project going. (ie. no guarantees, just get *something* done) This guy decided that he’d heard good things about this ‘Agile’ thing, and that it might be appropriate here as a risk-management tool. Which was where we came in.

And this wasn’t the usual agile transition, with its mix of proponents and reluctants, where you coach and teach, but also have to sell the process to large extend.

Here, everyone was external (to the customer), no-one wanted Agile, or had much experience with it, but the customer was demanding it! And taking full responsibility for delivery, switching the project to a time-and-material basis for the external parties.

A whole new ballgame.

Initial actions

We started out by getting everyone involved local. Up to then, people from four different vendors been in different locations, in different countries even. Roughly 60 people in all, we all worked from the office in Amsterdam. Most of these people had never met or even spoken!

We started with implementing a fairly standard Scrum process.

Step one was requiring multi-functional teams, mixing the vendors. This was tolerated. Mostly, I think, because people thought they could ignore it. Then we explained the other requirements. One week sprints, small stories (<2 / 3 days), grooming, planning, demo, retro. These things were all, in turn, declared completely impossible and certainly in our circumstances unworkable. But the customer demanded it, so they tried. And at the end of the first week, we had our first (weak) demo.

So, we started with basic Scrum. The difference was in the way this was sold to the teams. Or wasn’t.

That is not to say that we didn’t explain the reasons behind the way of working, or had discussions about its merit. It’s just that in the end, there was no option of not doing it.

And… It worked!

The big surprise to us was how well this worked. People adjusted quickly, got to work, and started delivering working software almost immediately. Every new practice we introduced, starting with testing within the sprint, met with some resistance, and within 4 to 6 weeks was considered normal.

After a while we noticed that our retrospectives changed from simply complaining about the process to open discussion about impediments and valuable input for improvements generated by our teams.

And that’s what we do all this for, right? The continuous improvement mindset? Scrum, after all, is supposed to surface the real problems.

Well. It sure did.

Automated testing

One of those problems was one which you will be familiar with. If you’ve been delivering software weekly for a while, testing manually won’t keep up. And so we got more and more quality issues.

We had been expecting this, and we had our answer ready. And since we’d had great success so far in our top-down approach, we didn’t hesitate much, and we started asking for automated testing.

Adoption

Resistance here was very high. Much more so than for other changes. Impossible! But we’d heard all those arguments before, and why would this situation be any different? We set down the rules: every story is tested, tests are automated, all this happens within the sprint.

the-princess-bride-inconceivable

And sure enough, after a couple of sprints, we started seeing automated tests in the sprint, and a hit in velocity recovered to almost the level we had had before.

See. It’s Simple! Just F-ing Do It!

Limitations

Then after another 3-4 sprints, it all fell apart.

Tests were failing frequently, were only built against the UI, had lots of technical shortcomings. And tests were built within the team, but still in isolation: a ‘test automation’ person built them, and even those were decidedly unconvinced they were doing the right thing.

In the end, it took us another 6 months to dig our way out of this hole. This took much coaching, getting extra expertise in, pairing, teaching. Only then did we arrive at the stop-the-line mindset about our tests that we needed.

Even with all of that going on, though we were actually delivering working software.

And we were doing that, much quicker than expected. After the initial delays in the project, the customer hadn’t expected to start using the system until… well, about now, I think. But instead we had a (very) minimal, but viable product in time for calculating the 2012 year-end figures. And while we were at it, since we could roll-out new environments at a whim (well… almost:-) due to our efforts in the area of Continuous Delivery, we could also do a re-calculation of the 2011 figures.

These new calculations enabled the company to free a lot of money, so business wise there’s no doubt this was the right thing to do.

But it also meant that, suddenly, we were in production, and we weren’t really prepared to deliver support for that. Well, we really weren’t prepared!

Kanban

And that brings us to one of the most invasive changes we did during the project. After about 5 months, we moved away from Scrum and switched to Kanban.

Just Do It

At that time I was the scrum master of one of the teams, the one doing all the operations work. And our changes in priority were coming very fast, with many requests for support of production. In our retros, the team were stating that they were at the same time feeling that nothing was getting done (our velocity was 0), and they felt stressed (overtime was happening). Not a good combination. This went on for a few sprints, and then we declared Kanban.

That’s not the way one usually introduces Kanban. Which is carefully, evolutionary, keeping everyone involved, not changing the process but just visualising it. You guys know how that’s supposed to be done right?

This was more along the lines: “Hey, if you can’t keep priorities stable for a week, we can’t plan. So we won’t.”

Of course, we did a little more than that. We carefully looked at the type of issues we had, and the people available to work on them. We based some initial WIP limits on that, as well as a number of classes of service. And we put in some very basic explicit policies. No interruptions, except in case of expedite items. If we start something, we finish it. No breaking of WIP limits. And no days longer than 8 hours.

Adoption

That brought a lot of rest to the team. And immediately showed better production. It also made the work being done much more transparent for the PO.

It worked well enough, that another team that was also experiencing issues with the planning horizon also opted to ‘go Kanban’. Later the rest of the teams followed, including the PO team.

Limitations

That is not to say there was no resistance to this change. The Product Owners in particular felt uncomfortable with it for quite some time. The teams also raised issues. All that generated many of those nice incremental, evolutionary changes. And still does. The mindset of changing your process to improve things has really taken root.

The most remarkable thing, though, about all that initial resistance was the direction. It was all about moving back to the familiar safety of… Scrum!

Wrap-up

I’d like to tell you more but this post is getting long enough already. I don’t have time to talk about our adventures with going from many POs to one, introducing Specification by Example, moving to feature teams, or our kanban ready board.

I do feel I need to leave you with some comforting words, though. Because parts of this story go against the normal grain of Agile values.

Directive leadership, instead of Servant Leadership? Top-Down change, instead of bottom-up support? Certainly more of a dose of Theory X than I can normally stomach!

And to see all of that work, and work quite well, is a little disconcerting. Yes, Cynefin says that decisive action is appropriate in some domains, but not quite in the same way.

And overcoming the familiar ‘That won’t work in our situation’ resistance by making people try it is certainly satisfying, but we’ve also seen that fail quite disastrously where deep skills are required. That needs guidance: Still no silver bullets.

Enlightened Despotism is a perhaps dangerous tool. But what if it is the tool that instills the habits of Agile thinking? The tool that forcibly shakes people out of their old habits? That makes the despot obsolete?

Practice can lead to mindset. The trick is in where to guide closely, and when to let go.

Scaling Agile?

There’s a lot of discussion in the Agile community on the matter of scaling agile. Should we all adopt Dean Leffingwell’s Scaled Agile Framework? Do the Spotify tribe/squad thing? Or just roll our own? Or is Ron Jeffries’ intuition right, and do the terms scaling and agile simply not mix?

Ron’s stance seems to be that many of Agile’s principles simply don’t apply at scale. Or apply in the same way, so why act differently at scale? That might be true, but might also be a little too abstract to be of much use to most people running into questions when they start working with more than one team on a codebase.

Time and relative dimension in space

When Ron and Chet came around to our office last week, Chet mentioned that he was playing around with the analogy of coordination in time (as opposed to cross-team) when thinking about scaling. This immediately brought things into a new perspective for me, and I thought I’d share that here.

If we have a single team that will be working on a product/project for five years, how are they going to ensure that the team working on it now communicates what is important to the team that is working on it three, four or five years from now?

Now that is a question we can easily understand. We know what it takes to write software that is maintainable, changeable, self-documenting. We know how to write requirements that become executable, living documentation. We know how to write tests that run through continuous integration. We even know how to write deployment manifests that control the whole production environment to give us continuous deployment.

So why would this be any different when instead of one team working five years on the same product, we have five teams working for one year?

This break in this post is intentionally left blank to allow you to think that over.

Simple Design

scrum_tardis

Scrum really is bigger on the inside!

This way of looking at the problem simplifies the matter considerably, doesn’t it? I have found repeatedly that there are more technical problems in scaling (and agile adoption in general) than organizational ones. Of course, very often the technical problems are caused by the organizational ones, but putting them central to the question of scaling might actually help re-frame the discussions on a management level in a very positive way.

But getting back to the question: what would be the difference?

Let’s imagine a well constructed Agile project. We have an inception where the purpose of the product is clearly communicated by the customer/PO. We sketch a rough idea of architecture and features together. We make sure we understand enough of the most important features to split off a minimum viable version of it, perhaps using a story map. We start the first sprint with a walking skeleton of the product. We build up the product by starting with the minimal versions of a couple of features. We continue working on the different features later, extending them to more luxurious versions based on customer preference.

As long as the product is still fairly well contained, this would be exactly the same when we are with a few teams. We’d have come to a general agreement on design early on, and would talk when a larger change comes up. Continuous integration will take care of much of the lower level coordination, with our customer tests and unit testing providing context.

One area does become more explicit: dependencies. Where the single team would automatically handle dependencies in time by influencing prioritization, the multiple teams would need to have a commonly agreed (and preferably commonly built) interface in existence before they could be working on some features in parallel. This isn’t really different from the single-team version above, where the walking skeleton/minimal viable feature version would also happen before further work. But it would be felt as something needing some special attention, and cooperation between teams.

If we put these technical considerations central, that resolves a number of issues in scaling.  It could also allow for a much better risk / profit trade-offs by integrating this approach with a set-based approach to projects. But I’ll leave that for a future post.

Management Innovation, ca. 1972

SilosYesterday, after my brother’s 47th birthday, I was talking with my father. My father is 79, and he has had an interesting professional life. He started out as a catholic priest but, as you could guess from the fact of my existence, at some point figured out that this was not a sustainable career path for him.

While talking, my father touched upon the subject of work, and the importance of working for the right reasons. He discussed people that had found a work/life balance that worked for them, by not focusing on financial rewards, but on the contents of the work, and the need to spend time on their families. Of course he sees that many people, including himself at one point, have manoeuvred themselves in positions where making those choices has become very difficult, if not impossible. Certainly many people now are in positions where two full-time salaries are needed to be able to keep paying the mortgage. But focusing on intrinsic motivation for work is very important for your enjoyment of work, and life.

We also talked about management a bit. In 1972, also the year of my own vintage, my father became director of the social services department (‘Sociale Zaken’) of the city of Haarlem. This organisation of around 250 people was then organised in strict silos, where social workers, legal people, administrative work and other departments worked separately, and with little understanding of each other’s specific challenges. Clients (people who were waiting to hear whether they would be getting social support, usually) regularly had to wait until each of those departments had had their say, often adding up to a 4 month wait before their application was either approved or denied. Meanwhile social workers were reporting their conclusions to those client back in language that  was very hard to understand. All of this resulted in very unhappy clients. And thus for the people that had the direct contact with those client not always being treated in the kindest ways.

My father apparently discussed the use of understandable language extensively with the social workers, at one point demonstrating the problem of jargon by slipping back into some latin phrases

Quidquid recipitur ad modum recipientis recipitur – “Whatever is received is received according to the mode of the receiver.”

I have, having failed spectacularly at the one year of Latin I had in school, no idea if this was the actual phrase he used. But it does seem apt.

He also instigated a reorganisation of the service. The reorganisation moved from the siloed system to one where different geographical parts of the city were going to be services by multi-functional teams, consisting of social workers, legal experts, administrative workers and directly client-facing people. The goal was to reduce time wasted with hand-overs, breed understanding for people with another area of expertise, and create more understanding for the clients predicaments.

This move was not welcomed by the employees. So much so that unions were involved, strikes threatened, and silo walls reinforced. In the end, it happened anyway. The arguments were solid, and even with unions involved, there was a rather clear power structure in place. And eventually, when the dust had settled down, the results were very positive indeed. Client satisfaction was up and response times were down (I think it went from roughly 4 months to two weeks). But just as important, internal strife was removed to a large extend, employees were picking up skills (and work!) outside of their area of expertise, and employee satisfaction was considerably better.

Now for me, this is not surprising. It’s not surprising because I know and have lived the results of similar ways of working in Agile teams.

Or maybe I’ve been doing that because, in the end, some values are instilled at an early age, and I had a great example.