Timothy Fitz

Continuous Deployment

Posted in Uncategorized by timothyfitz on February 8, 2009
Alex has just written a refactoring of some website backend code. Since it was a small task, it’s committed and Alex moves on to the next feature.
 
When the code is deployed in production two weeks later it causes the entire site to go down. A one-character typo which was missed by automated tests caused a failure cascade reminiscent of the bad-old-days at twitter. It takes eight hours of downtime to isolate the problem, produce a one character fix, deploy it and bring production back up.
 
Alex curses luck, blames human infallibility, inevitable cost of software engineering and moves on to the next task.
 
This story is the day-to-day of most startups I know. It sucks. Alex has a problem and she doesn’t even know it. Her development practices are unsustainable. “Stupid mistakes” like the one she made increase as the product grows more complex and as the team gets larger. Alex needs to switch to a scalable solution.
 
Before I get to the solution, let me tell you about some common non-solutions. While these are solutions to real problems, they aren’t the solution to Alex’s problem.
 
1. More manual testing.
This obviously doesn’t scale with complexity. This also literally can’t catch every problem, because your test sandboxes or test clusters will never be exactly like the production system.
 
2. More up-front planning
Up-front planning is like spices in a cooking recipe. I can’t tell you how much is too little and I can’t tell you how much is too much. But I will tell you not to have too little or too much, because those definitely ruin the food or product. The natural tendency of over planning is to concentrate on non-real issues. Now you’ll be making more stupid mistakes, but they’ll be for requirements that won’t ever matter. 
 
3. More automated testing.
Automated testing is great. More automated testing is even better. No amount of automated testing ensures that a feature given to real humans will survive, because no automated tests are as brutal, random, malicious, ignorant or aggressive as the sum of all your users will be.
 
4. Code reviews and pairing
Great practices. They’ll increase code quality, prevent defects and educate your developers. While they can go a long way to mitigating defects, ultimately they’re limited by the fact that while two humans are better than one, they’re still both human. These techniques only catch the failures your organization as a whole already was capable of discovering.
 
5. Ship more infrequently
While this may decrease downtime (things break and you roll back), the cost on development time from work and rework will be large, and mistakes will continue to slip through. The natural tendency will be to ship even more infrequently, until you aren’t shipping at all. Then you’ve gone and forced yourself into a total rewrite. Which will also be doomed.
 
So what should Alex do? Continuously deploy. Every commit should be instantly deployed to production. Let’s walk through her story again, assuming she had such an ideal implementation of Continuous Deployment.
Alex commits. Minutes later warnings go off that the cluster is no longer healthy. The failure is easily correlated to Alex’s change and her change is reverted. Alex spends minimal time debugging, finding the now obvious typo with ease. Her changes still caused a failure cascade, but the downtime was minimal. 
 
This is a software release process implementation of the classic Fail Fast pattern. The closer a failure is to the point where it was introduced, the more data you have to correct for that failure. In code Fail Fast means raising an exception on invalid input, instead of waiting for it to break somewhere later. In a software release process Fail Fast means releasing undeployed code as fast as possible, instead of waiting for a weekly release to break.
 
Continuous Deployment is simple: just ship your code to customers as often as possible. Maybe today that’s weekly instead of monthly, but over time you’ll approach the ideal and you’ll see the incremental benefits along the way.

31 Responses

Subscribe to comments with RSS.

  1. Erik A. Brandstadmoen said, on February 8, 2009 at 6:05 pm

    Sorry, this might be good enough for play-like services, but it is not good enough for critical systems as financial systems. A simple non-availability failure is always your best case scenario. Suppose Alex made an error that had large financial consequences, such as transferring an erroneous amount, or transferring it to the wrong recipient.

    I definitely think extensive manual and automatic testing are necessary, of which automatic testing may reduce the number of “program domain” errors, but manual testing will definitely catch more of the “business domain” errors.

    In real life, I don’t think you approach is good enough. Of course, more frequent releases could make the developer more aware of what actually caused the error, but suppose it wasn’t just a one-character error, and took some hours to nail down and fix, what is your backup plan? And what consequences are you able to bear, due to insuficcient testing?

    Interesting theory, but I don’t agree :)

  2. drewp said, on February 9, 2009 at 4:20 am

    Erik: you aren’t totally clear about it, but it sounds like you think “non-solution” #1 is a good plan. So do you also think it scales suitably? Do some companies just need to have QA depts five times the size of their development groups? IMHO, manual testing has only two advantages: it’s the easiest thing to [try to] do; and it has a lovely accountability chain. You can always blame the developer, and non-technical people will easily accept that this is the “inevitable cost of software engineering”.

    As to your last question, I think the idea is that your operations people should be able to roll backwards as fast as they roll forwards, which they can do in the cases that a bugfix is taking too long.

    I especially liked the discussion of pattern #5 (release less often), since it appears to work perfectly *according to the consumers of the code*. They wanted fewer buggy release events, and they get that. If the bug-fixing effort was actually fixed at, say, 8-man-hours-per-release, this would actually be a winner. But the point is that it’s not; the faster you release, the easier each bug is to fix. The difficulty of fixing bugs also goes way up in a large system with many interdependencies. Maybe this means that as your system gets more interdependencies, you should crank up the deployment speed?

  3. Eric Florenzano said, on February 9, 2009 at 5:22 am

    I like the idea, but it would certainly make it more difficult to implement larger features slowly. Instead of encouraging many atomic commits, you would be encouraging the developer to produce one monolithic commit with the complete new feature.

    P.S. Who is Alex? :)

  4. Ben McGraw said, on February 8, 2009 at 11:24 pm

    I am guessing “Alex” is “Some random token guy with a name of someone who Tim doesn’t personally know.”

  5. Ben McGraw said, on February 8, 2009 at 11:25 pm

    …How odd. I just posted six hours prior to the post I was responding to.

  6. timothyfitz said, on February 9, 2009 at 12:04 am

    Erik: drewp beat me to answering your questions, and his answers are exactly what I would have written.

    Ben: The timing issue is a hiccup from when I switched the blog from UTC to PST. Also Alex is a she.

    Eric: This system and continuous integration go hand in hand. With Continuous Integration you commit as often as you can, and keep a green build (tests passing) on trunk. With Continuous Deployment you deploy each of those commits. Now you don’t just think you have a shippable build, you have a shipped build. You know it’s good.

  7. CanonicalChris said, on February 9, 2009 at 6:20 am

    How is this different from conceptualizing your customers as the unpaid extended QA department you either cannot afford or cannot properly simulate? And how will that make your customers feel once they figure that out?

    Or put differently: The question that DrewP’s reply does not answer is how to you balance the customer satisfaction against the ease of debugging.

  8. Erik A. Brandstadmoen said, on February 9, 2009 at 9:15 am

    drewp and Timothy:

    Maybe I focused too much on solution #1 because I totally see the value of manual testing. Of course, I absolutely agree that automatic testing and automatic quality gates are necessary, and makes the quality of the product you ship to the testers much, much higher.

    However, regarding rollback, it’s not a case of just rolling back to the former version if something was buggy. Suppose you have a few hundred thousand users in a financial transactional application. Who is going to clean up all the faulty transactions created during the uptime of the faulty version? This is my point. A rollback is often harder relating to already committed, unwanted transactions, that relating to simply non-functional software.

    I do agree in that it is a good strategy to deploy continuously, but, and mark my but, to a test environment, with test data. That way you don’t risk messing up your (more correctly your customers’) real-life data.

    As you state in your original article, no automated tests are able of capturing all possible scenarios. This is why I would prefer a lot of testers testing the solution before letting it loose on real-life data.

    What are your thoughts on the possibility of data corruption in your deploy-continuously scenario?

  9. Kevin Gadd said, on February 9, 2009 at 3:57 pm

    Erik: What is the strategy you suggest that has less risk of data corruption than deploy-continuously w/automated tests and rollback? The only way to completely avoid data corruption on your production servers would be to only run perfect code on your production servers, so we’ve got to be talking about some sort of quality tradeoff where you can decrease the potential of data corruption in production by paying some other cost (slower commit-test-deploy cycles, larger engineering team, etc.)

    It seems to me, though, that even if you can reduce the potential for data corruption by deploying less often or using more stringent code review strategies, you still have to account for data corruption. No matter what, eventually you’re going to get some faulty transactions in your financial system’s tables. When that happens, you need to have a rock-solid strategy for verifying the data you’ve got and getting rid of/repairing the bad data.

  10. Erik A. Brandstadmoen said, on February 10, 2009 at 12:32 am

    Kevin,

    I appreciate that all the strategies mentioned in the original post are valuable pieces in a best-effort to try to reduce the risks of getting faulty code on your production servers. I have experiences from my consulting, where quality control is based almost only on a huge, manual testing period at the time between code freeze and release to customers. And I am painfully aware that in most cases, this is not sufficient for discovering all possible scenarios and faults.

    You are right. The only way to completely avoid data corruption is to run perfect code on production servers :) The subject of the discussion is how we end up as close to this aim as possible, in the best way. I think removing all “quality gates” between checkin and releasing to customers is not the best way. And I think the downtime and/or faulty behaviour for customers would increase drastically if we use them for first-level testing.

    I think one strategy for reducing risk is:

    1) Nightly builds and/or continous integration builds – discovers build errors
    2) Fine-grained unit tests – discovers (some) programming logic errors
    2) Nightly and/or continous run of cleverly-crafted, replayable, self-contained functionality-tests, discovers basic functional flaws.
    3) Testing of nightly builds on a daily basis – to discover more subtle logic flaws (this is expensive, do as much as you can afford)
    4) Frequent or more infrequent “release candidate” builds (depending on your needs/wishes), which gets released to internal test teams for thorough testing including regression testing – hopefully rules out most errors.
    5) Internal test on production data, if 4) is carried out successfully and approved by a test manager – rules out more difficult-to-find, data-dependent errors
    6) Release to production

    Number 3) is probably the most difficult item to do properly. Especially if you depend on external systems’ test systems etc. Mocks can help a part of the way, but it gets very complicated and time-consuming to write these kinds of tests, if you depend on synchronized information in more than one external system, data being updated by nightly batch jobs, etc.

    Dependant of your budget and local needs, you need to adjust the time used on each phase, of course. I generally prefer spending effort on higher quality earlier in the process described above, as I think errors are easier to rule out, if you can isolate them on a fine-grained level, than if you have to track them down and isolate them manually after a manual test has discovered the potential error.

    This is just a suggestion, and is not meant to be a recipe for everyone. Please feel free to give me your thoughts on it.

  11. [...] Posted in Uncategorized by timothyfitz on February 10th, 2009  I recently wrote a post on Continuous Deployment: deploying code changes to production as rapidly as possible. The response on news.ycombinator was, [...]

  12. stockst said, on February 10, 2009 at 4:54 am

    @Erik: “In real life, I don’t think you[r] approach is good enough.”

    And where, if not in “real life”, is the OP living?

    I ask because your extensive comments, mostly containing references to financial transactions, miss several important subtleties the OP writes about.

    For one, all systems do not have the transactional requirements that you imply. Some systems (and userbases) will accept transient errors. If Amazon.com occasionally says “Internal Error” I hit the refresh button and problem solved.

  13. Erik A. Brandstadmoen said, on February 10, 2009 at 5:21 am

    stockst: I don’t see what subtleties you think I am missing. “Fail fast” is fine. But how about “Fail not so fast, but definitely not as serious”? My point is that it all depends on the nature of your applications. An “Internal error” is fine. But how about if John-newly-hired-freshman coder introduced an error that made all orders performed on Amazon for about 5 minutes to be charged the wrong credit cards?

    In a perfect world, all of us would be good programmers. Not all of us are. I say this from experience with different types of programmers I have met through my programming experience. And, given that all programmers make errors, some worse than others, I don’t like the idea of cuttin 3-4 steps in the QA process before letting software loose onto real customers.

  14. Daniel Einspanjer said, on February 10, 2009 at 11:59 am

    Erik: I’ve enjoyed your clear and rational discussion in this thread regarding the seriousness of data corruption in financial systems, and how to determine the best approach to prevent them.

    When I read your first response, my initial thought was: If you want to make sure that corruption of financial transactions can’t occur in production, then you need to make sure you pay the appropriate amount of money in up front development to have a test suite that prevents it.

    Many companies might amortize this cost by paying a manual QA team to perform manual test for a long time with lengthy delays between releases, but I imagine the most successful teams will have the most comprehensive documentation of problems found and tests that ensure a problem doesn’t exist, and from there, that those tests and documentation are transcribed into a testing infrastructure that ensures all of those tests are executed constantly.

    In reading Timothy’s subsequent post regarding Continuous Deployment, it really does sound like that is exactly what IMVU has done. On paper, they have a test suite that many developers would drool over. I’m very interested to hear your take on whether you feel that even a financial company could eventually refine and expand the comprehensiveness of their testing infrastructure to the point that this kind of deployment scheme would result in just as few data corruption issues as that one sleeper bug that sat in two year-long release cycles is likely to cause.

  15. Danno said, on February 10, 2009 at 8:47 pm

    What Erik is talking about is “Oh shit, launched the missiles” errors. Things that have no easy way of being undone, and the failure of which have dire consequences that cannot be corrected easily or at all.

    Thankfully, most code doesn’t launch the missiles and not too many errors, even in a financial application, are going to be much worse than a fat-finger. Unfortunately, I suppose you’ve got to accept the cost in the situation that you could possibly launch the missiles. In fact, in that case, you probably want the manual testing, the automated testing, the test environment somehow running off of production data with fake missiles to fire… and a couple of elves hanging around just in case.

  16. Briefly Noted for February 25, 2009 said, on February 25, 2009 at 5:24 am

    [...] Along with “the perfect is the enemy of the good,” “release early and often” is something of a mantra around CHNM, especially when it comes to software and web application development. For a variety of reasons, not least the invaluable testing and feedback projects get when they actually make it into the wild, CHNM has always been keen to get stuff into users hands. Two good statements of likeminded philosophy: Eric Ries’ Lessons Learned: Continuous deployment and continuous learning and Timothy Fitz’s Continuous Deployment. [...]

  17. Michael Dubakov said, on February 26, 2009 at 9:27 am

    Continuous Deployment *may* work only for online applications (or single instance application). It can’t work for downloadable packages like Web browser, IDE, etc. People download it, install it and throw error to you. It may be thousands of people over the world during short period of time (imagine new major Safari release, how many downloads they have already in 2 days?) Even if you fix the problem and deploy new version in several hours, it will be SERIOUS hit on company reputation “HEY! Apple released buggy Safari! It crashes when I open Google instantly!”

  18. Notional Slurry » links for 2009-02-27 said, on February 27, 2009 at 10:15 pm

    [...] Continuous Deployment « Timothy Fitz "So what should Alex do? Continuously deploy. Every commit should be instantly deployed to production. Let’s walk through her story again, assuming she had such an ideal implementation of Continuous Deployment. Alex commits. Minutes later warnings go off that the cluster is no longer healthy. The failure is easily correlated to Alex’s change and her change is reverted. Alex spends minimal time debugging, finding the now obvious typo with ease. Her changes still caused a failure cascade, but the downtime was minimal. " (tags: continuous-integration continuous-deployment testing agility FUD amusing-comments) [...]

  19. noname said, on February 28, 2009 at 4:43 pm

    infallibility != fallibility Alex fails again.

  20. [...] what they do in iterative Customer Development and their work on Continuous Deployment IMVU is well worth paying attention to. Possibly related posts: (automatically generated)Active [...]

  21. cs said, on March 22, 2009 at 2:41 pm

    for small projects with small fairly localized teams, this process has merit. for larger, highly scaled projects that use distributed resources this process can become untenable. the situation i’m thinking about is when a system’s core functionality is being altered. while there are plenty of atomic pieces that can be tested individually with automated unit tests, and focussed manual qa, often time not many of these pieces can be deployed independent of other pieces. because while from a code perspective they are neatly abstracted and modular, from a ‘how this sytem works’ they present changes that the legacy system might not be able to support.

    bottom line, good qa is good qa whether it happens within 5 minutes of the developer leaving for a beer or 1 week after they leave for a beer. qa will always miss things. that’s human nature. and embracing that while letting you relieve some stress doesn’t justify introducing buggy code to users any sooner. its a better experience for your users if you don’t rely on them to do the qa for you.

  22. [...] Day of all days. However, clicking through to the article he linked to led me to a couple of blog entries written back in February, and a Google blog search turned up several dated hits indicating [...]

  23. [...] being tossed around the Valley right now with names like Pirate Metrics, Customer Development and Continuous Deployment. There are some awesome people who are sharing great templates for startup creation. If [...]

  24. [...] typo with ease. Her changes still caused a failure cascade, but the downtime was minimal. (found at T. Fitz Blog) When he posted this, what happend, people do not believed him, so his next blog posting is even [...]

  25. [...] I Just read Timothy Fitz’s post on Continuous Deployment: [...]

  26. [...] Fitz wrote an interesting entry about Continuous Deployment, basically deploying to production after each commit. While I [...]

  27. [...] folks at IMVU also seem to be fans of the continuous deployment methodology as well from the post by Timothy Fitz. Eric suggest a 5 step approach for moving to a continuous deployment environment. The topic of [...]

  28. Das said, on August 9, 2009 at 8:37 am

    I don’t see how continuous deployment is scalable either (compared to the other “non-solutions” … So, what if Alex is not the only programmer and there are 25 other programmers who are checking in code on a daily basis. Now, it is no longer Alex who may have caused the bug – it could be any one of the others. For larger teams, trying to troubleshoot & find problems on live production is much more costly (both financially & otherwise) than what continuous deployment seems to save us.

    Secondly – what if the bug lurks for a while before it is visible. How do you know that it is Alex who caused the bug because of what was continuously deployed a week ago or because of what was deployed 5 minutes ago.

    I think the “use case” that is being touted is too simplistic to be used as a real case study showing why continuous deployment is better than the other “non-solutions”. I think the type & size of project team, the application, the users of the application and their tolerance for bugs all would play a role into whether continuous deployment makes sense – for me this is no panacea …
    –Das

  29. DPQ said, on August 21, 2009 at 3:46 pm

    Das hit the nail. This is soooo reminiscent of the academic programming proof of correctness discussions from the 1980s. Go ahead and start from scratch, absolutely require TDD and as much automated test coverage as is humanly possible, a strict SaaS model deploying to a small set of sites completely under your control (for rollback), and a customer base for whom failure is merely a nuisance, and you might get CD to work. Academic.

    Have some sparsely tested legacy code (even in a modern programming methodology), or a download distribution model (noted earlier), or customers paying for reliable service, and Alex will not cut it. Not to mention the morale hit to the programming team when Alex’s bug is found 50 commits later and they ALL have to be rolled back because, of course, the non-buggy work that Alex did has become integral to subsequent coding.

  30. [...] the topic of my post, but it is what has triggered me to write this post. Timothy’s two posts explain it very well, including this part which I shall quote: Continuous Deployment is [...]


Leave a Reply