Continuous Deployment at IMVU: Doing the impossible fifty times a day.
I recently wrote a post on Continuous Deployment: deploying code changes to production as rapidly as possible. The response on news.ycombinator was, well…
”Maybe this is just viable for a single developer … your site will be down. A lot.” – akronim
”It seems like the author either has no customers or very understanding customers … I somehow doubt the author really believes what he’s writing there.” - moe
…not exactly what I was expecting. Quite the contrast to the reactions of my coworkers who read the post and thought “yeah? what’s the big deal?” Surprising how quickly you can forget the problems of yesterday, even if you invested most of yourself into solving them.
Continuous Deployment isn’t just an abstract theory. At IMVU it’s a core part of our culture to ship. It’s also not a new technique here, we’ve been practicing continuous deployment for years; far longer than I’ve been a member of this startup.
It’s important to note that system I’m about to explain evolved organically in response to new demands on the system and in response to post-mortems of failures. Nobody gets here overnight, but every step along the way has made us better developers.
The high level of our process is dead simple: Continuously integrate (commit early and often). On commit automatically run all tests. If the tests pass deploy to the cluster. If the deploy succeeds, repeat.
Our tests suite takes nine minutes to run (distributed across 30-40 machines). Our code pushes take another six minutes. Since these two steps are pipelined that means at peak we’re pushing a new revision of the code to the website every nine minutes. That’s 6 deploys an hour. Even at that pace we’re often batching multiple commits into a single test/push cycle. On average we deploy new code fifty times a day.
So what magic happens in our test suite that allows us to skip having a manual Quality Assurance step in our deploy process? The magic is in the scope, scale and thoroughness. It’s a thousand test files and counting. 4.4 machine hours of automated tests to be exact. Over an hour of these tests are instances of Internet Explorer automatically clicking through use cases and asserting on behaviour, thanks to Selenium. The rest of the time is spent running unit tests that poke at classes and functions and running functional tests that make web requests and assert on results.

Buildbot running our tests sharded across 36 machines.
Great test coverage is not enough. Continuous Deployment requires much more than that. Continuous Deployment means running all your tests, all the time. That means tests must be reliable. We’ve made a science out of debugging and fixing intermittently failing tests. When I say reliable, I don’t mean “they can fail once in a thousand test runs.” I mean “they must not fail more often than once in a million test runs.” We have around 15k test cases, and they’re run around 70 times a day. That’s a million test cases a day. Even with a literally one in a million chance of an intermittent failure per test case we would still expect to see an intermittent test failure every day. It may be hard to imagine writing rock solid one-in-a-million-or-better tests that drive Internet Explorer to click ajax frontend buttons executing backend apache, php, memcache, mysql, java and solr. I am writing this blog post to tell you that not only is it possible, it’s just one part of my day job.
Back to the deploy process, nine minutes have elapsed and a commit has been greenlit for the website. The programmer runs the imvu_push script. The code is rsync’d out to the hundreds of machines in our cluster. Load average, cpu usage, php errors and dies and more are sampled by the push script, as a basis line. A symlink is switched on a small subset of the machines throwing the code live to its first few customers. A minute later the push script again samples data across the cluster and if there has been a statistically significant regression then the revision is automatically rolled back. If not, then it gets pushed to 100% of the cluster and monitored in the same way for another five minutes. The code is now live and fully pushed. This whole process is simple enough that it’s implemented by a handfull of shell scripts.
The point is that Continuous Deployment is real. It works and it scales up to large clusters, large development teams and extremely agile environments.
And if you’re still wondering if we are a company that “has no customers”, I’d like to refer you to our million dollar a month revenue mohawks.



Your deployment script pushing to a subset of the cluster, monitoring for regressions, then doing a rollout/rollback is brilliant.
This post is fascinating to the point I wonder if you’ve given away too many secrets? Then again, the level of discipline and sophistication you are describing is so far beyond your typical web-developer type, it probably doesn’t matter.
I agree, the deployment and commit system you employ is ingenious.
ohhh Shit… Skynet!
drooolllll
brilliant – this is lean software development at its best. If you look at code that is not deployed as inventory, then you have gotten rid of a lot of it. At that pace, any problem in the code should surface quite fast (ie within the minutes after commit) and there is some urgency created to solve it.
My guess would be that you also have an ‘Andon’ system in place, i.e. a culture where a developer shouts for help as soon there a problem difficult for him to solve. I only wonder how you review the code that you push – are there any code reviews scheduled before pushing something? Because not only defects (ie failing tests) slow you down, but also unclean code and all the things that you stumble upon when you have to go back to it.
Nice work!
Do you handle schema changes separately, out of band in the model? An achilles heel of partial cluster deployment.
A detailed explanation of how to do that can be used by other teams might make a nice short book. Call it “Extreme Testing”
[...] Continuous Integration just isn’t hardcore enough. What an amazing and fascinating place that must be to work, an environment where discipline to [...]
[...] post info By smspencer Categories: Future and Tech Tags: Awesome, Continuous Deployment, Hacker News, Sharding Just a quick update to comment on my finding of the Holy Grail of automated deployment. [...]
Awesome! I don’t buy that many technical books these days, but if you ever end up writing this up in 200-page form, please please let me know.
how do you make the changes atomically, using rsync? it operates only on one file at a time, afaik
stockst: Honestly, if the whole world caught up overnight and developed services to solve these problems I’d be ecstatic. We’d love to leverage someone else’s hard work for some of these problems.
clemens: There are code reviews, but they happen less often than I would like. Thankfully about half of our code is developed while pairing so there’s often a second set of eyes before the commit is ever made.
Jeremy Wohl: Yep, schema changes are done out of band. Just deploying them can be a huge pain. Doing an expensive alter on the master requires one-by-one applying it to our dozen read slaves (pulling them in and out of production traffic as you go), then applying it to the master’s standby and failing over. It’s a two day affair, not something you roll back from lightly. In the end we have relatively standard practices for schemas (a pseudo DBA who reviews all schema changes extensively) and sometimes that’s a bottleneck to agility. If I started this process today, I’d probably invest some time in testing the limits of distributed key value stores which in theory don’t have any expensive manual processes.
poifadfpoaidf: We have a fixed queue of 5 copies of the website on each frontend. We rsync with the “next” one and then when every frontend is rsync’d we go back through them all and flip a symlink over.
i believe imvu uses tdd so code reviews are a bit outdated
Clever! Especially the automatic load comparison and rollback.
pifadfpoaidf: They rsync the code into a separate folder, then flip a symlink to make it live. e.g. the document root is /var/www/current, which is a symlink to /var/www/1. The next deployment rsyncs to /var/www/2 and flips the /var/www/current symlink to point to that when ready. And so on.
This is very cool, thanks for sharing. @joran, I don’t see how code reviews are outdated in a TDD environment. Even if your tests are perfect it’s still useful to have reviews to check that the tests are comprehensive, are testing the right things, and implementation is good.
joran: Code reviews can be important for the things that TDD won’t always tell you, like ‘this query you just wrote is a table scan’ or ‘this algorithm won’t scale to 1 million users’. Those kinds of issues tend to be the ones that won’t set off an automatic rollback because you don’t feel the pain until hours, days or weeks after you push them out to your cluster. Right now we use monitoring to pick up the slack, but it costs our operations team a lot of time.
That’s something I’ve been dreaming about for a long time, but it takes lots of work and discipline to achieve it and you did it. Congratulation!
Can we hope to have more details or at least more streamlined methodology ?
IMVU seems to have sprung out of There.com somehow. I remember when There was in development. Don Carson (formerly of Disney and Sierra Online), a friend and former coworker of mine, was an artist there. It had some very funny bugs early on when people were learning how to abuse it.
Nice to know it’s still operational. Always wondered what happened to it.
Great post, by the way.
One of your earlier criticisms was that you must not have any customers, but to me it seems that the more customers you have the better this process works and the more robust it is.
If you push your new code out to 1% of your users for a minute then the more users you have, the better chance you have of catching any problems and rolling back before a full roll out. However if you only have a handful of users online at any given time, the chances that something isn’t caught in that first minute seem much higher; you’d probably either need to increase that time or increase the exposure percentage, neither of which seems advantageous.
You want to keep the chance of a random user getting this code, and the time needed to adequately ensure the quality of it, to a minimum. It seems to me that the more users you have the lower these numbers become. Would you agree?
Wow, that’s awesome stuff. Wanna share some of your scripts? Or sell them to me for beer money?
@poifadfpoaidf
To make atomic operations with rsync, you rsync to a copy of the destination directory, then switch a symlink over to point at your new directory.
How nice so what the real drawback.
No wait I see it you wrote it down.
“A symlink is switched on a small subset of the machines throwing the code live to its first few customers.”
Darn thats why my system slows up.
sounds like a neat process. What do the graphs of your release statistics(in ref. to Statistically Significant regression) look like?
Also, as a quibble
“Continuous Deployment isn’t just an abstract theory. At IMVU it’s a core part of our culture to ship. It’s also not a new technique here, we’ve been practicing continuous deployment for years; far longer than I’ve been a member of this startup.”
and then you link to
http://startuplessonslearned.blogspot.com/2008/09/lean-startup.html
which appears to be a random post about Lean development from less than one year ago. Doesn’t really back up your claim about walking uphill both ways through snow for millenia…
@ Nosredna : Don Carson still works with us ( http://avatars.imvu.com/Don ).
It’s so weird seeing buildbot off of the actual buildbot page. Tim, you should post one with a failed buildbot upgrade showing so it’s all red and yellow and purple!
Alexander Fairly: Yeah, that link was a little bit just thrown in there. That is the blog of our cofounder and former CTO. Lean Startup is a term he uses (and maybe coined?) to describe end-to-end agility in a startup. It’s the theory behind Lean Startups (and Lean Manufacturing before it) that led us to invest in Continuous Deployment.
Ben McGraw: Shhh, our tests _never_ fail because our code is perfect.
Mike Rooney:
Yeah, it’s true that more customers help in testing faster. We measure registrations in a rate per second, which means a minute of testing is guaranteed to exercise the registration flow quite a few times.
They also help in testing more subtle issues; they’ll smoke out all the edge cases you never thought about. That also means you can run more subtle A/B tests, which has been invaluable to our business.
[...] Continuous Deployment at IMVU: Doing the impossible fifty times a day. « Timothy Fitz (tags: development agile) [...]
[...] I’ve thought about continuous integration. This incredibly inspiring post about continuous deployment has me really thinking hard about how my changes actually get pushed up out of dev to production. [...]
Thanks for this post. It’s definitely inspiring
.
However, what do you do about schema changes? The need to upgrade data in production (and the difficulty of both effectively testing downgraders and preserving information that earlier schemas did not account for) has always been my biggest difficulty in pushing new changes this quickly.
Glyph: Unfortunately, we’re not much more agile than anybody else in that respect. As long as the data sets are small we have tools that make applying schema changes painless, but as soon as the alters get expensive it gets awkward. (Explained in more detail http://news.ycombinator.com/item?id=474862)
My guess is that we could build a decent automated system by having a replicated stand by machine just for schema updates. You’d apply new schemas to the stand by and then automatically fail over to it. If you have to roll back, you just swap back the original machine, after which you have to rebuild the stand by from backups or some other expensive recovery option. You’d still need yet another safety net, in case of mysql crashes or other shenanigans, so you’d end up with a 2nd stand by machine (3 hosts per database role).
We just don’t have to do that many expensive schema updates to warrant this type of system… yet
Very interesting, and food for thought.
I am wondering how this would work for complex business process applications – what we do. It seems that to use your approach requires that test plans and scripts have to be well thought-out before hand, and that test development is given as much value as code development.
It requires quite a culture change in a development organisation.
Fantastic! I have been striving to do something similar for a long time, but have never come as close as you guys. I’m very impressed.
One of my biggest problems with this has been like you mention problems with tests that break too often or from the wrong reason. Are you using any special tricks to elevate the quality of the tests or is it just hard work (it usually is, but one can hope)?
Continuous Deployment is something I’ve been advocating for a few years now
http://www.hpl.hp.com/techreports/2002/HPL-2002-274.html
http://wiki.smartfrog.org/wiki/display/sf/Pattern+-+Continuous+Deployment
For it to work you need a good staging/test system, and the ability to rollback fast.
Where it creates problems is that it can create unrealistic expectations of how rapidly fixes can be pushed out, hence how rapidly problems can be fixed. Saying a site update is overnight (fully automated) takes the stress off the developers to find a quick and dirty solution. Otherwise within 15 minutes of a strategic partner reporting a bug, they are on the phone asking if the fix is live.
[...] the reddit comments on Timothy’s second continuous integration article, many of the posters bemoan the fact that he works at IMVU (yes, I work at IMVU too): “This [...]
If people want some idea of what’s possible check out Autopilot:
http://blogs.zdnet.com/microsoft/?p=1160
http://research.microsoft.com/apps/pubs/default.aspx?id=64604
This beast is in some part behind the Live Services stuff MS are running…..
Nice linkbaiting!
>Every commit should be instantly deployed to production.
from your original post implies commits go straight to production, sans testing, and you got the response the idea deserves.
> deploying code changes to production as rapidly as possible.
As rapidly as possible is very different to instantly, and leaves scope for any amount of testing. But still, I guess it got you to the front page of news.YC so congrats.
“Nice linkbaiting!
>Every commit should be instantly deployed to production.
from your original post implies commits go straight to production, sans testing, and you got the response the idea deserves.
> deploying code changes to production as rapidly as possible.
As rapidly as possible is very different to instantly, and leaves scope for any amount of testing. But still, I guess it got you to the front page of news.YC so congrats.”
Implies is not the same as explicit statement. Making assumptions based on implication and following up with debate therefore makes little sense.
Seemingly people made the same mistake with the interpretation of “instantly”.
Your comment implies some things about you (love the fact you don’t put your name up): Should I assume the implications are true?
[...] Deployment at IMVU and a tale from Pirum Jump to Comments I’ve just finished reading an excellent article on Continuous Deployment. This is the way of the future. Now that Continuous Integration has become (almost) mainstream and [...]
[...] Continuous Deployment at IMVU: Doing the impossible fifty times a day – “The high level of our process is dead simple: Continuously integrate (commit early and often). On commit automatically run all tests. If the tests pass deploy to the cluster. If the deploy succeeds, repeat.” [...]
Wow, impressive strategy you’ve got here. I can imagine the difficulty is in the details, but getting this level of near real-time deployment is worth it. I’d like to see a detailed write-up on this process and the challenges you faced in implementing it reliably. Good stuff.
[...] This is awesome, if a little insane. Continuous Deployment at IMVU: Doing the impossible fifty times a day: [...]
Excellent Blog. I would love to get to this point for our site. I was wondering what everyone uses to test web applications, httpUnit, is that the standard?
If the sites look and feel changes significantly does this have huge impacts to the test suite. Any thoughts would be greatly appreciated.
OMFG! My o panties are not SFW after reading this!
I can see where continuous deployment would be possible (if a lot of work) for a web site where all the bits live on your servers and there are no persistent connections from the outside world. Do you follow the same practices for new versions of the client and the servers that it connects to? How do you do that without knocking everyone offline multiple times a day and annoying your customers?
Where in your process do you catch usability problems and non-functional problems like grammar mistakes?
Paul, there is no reason why you couldn’t still have QA people. They could just be using the production environment instead of a staging environment, and wouldn’t be required as a step in between developing and pushing.
…………
…………
…………
…………
after reading this
[...] this week Eric Ries and Timothy Fitz posted on a technique called Continuous Deployment that they use at IMVU. Timothy describes it [...]
[...] Continuous Deployment at IMVU: Doing the impossible fifty times a day. « Timothy Fitz [...]
[...] the most dramatic examples I have relate to our test and deploy infrastructure, I’ll skip rehashing those [...]
How do you deal with the risk of losing or corrupting data due to a bug that wasn’t caught? I don’t know your transaction rates, but running bad code even on a subset of machines for 1 minute could mean loss of thousands of items (whatever it is in your app).
Nosredna: There.com still exists. It’s a different company. What it has in common with IMVU is we were founded by the same guy, Will Harvey. He brought the lessons he learned at There to IMVU, and also brought a lot of the best employees, Don included.
[...] Want to be in a functional business environment yet, as a team of engineers, ship code fifty times a day to a live and heavily used service? [...]
What software do you recommend for building a test suite?
[...] Lots of software companies do continuous integration, but how many do continuous deployment? The folks IMVU have an incredibly inspiring deployment process. [...]
[...] Fritz has a very interesting blog post on Continuous Deployment at IMVU (subtitled “Doing the impossible fifty times a day”), detailing how all committed code [...]
[...] software for pretty much as long as I can remember. This made me all the more interested to read this post telling how IMVU, a ‘avatar based chat site’ go programatically straight from commit to [...]
[...] linearly, until it’s just too valuable to pass up. You can push the problem off somewhat by investing in hardware, and you can save quite a lot of engineer-muscle by doing this, but there’s eventually going [...]
[...] Continuous Deployment at IMVU The high level of our process is dead simple: Continuously integrate (commit early and often). On commit automatically run all tests. If the tests pass deploy to the cluster. If the deploy succeeds, repeat. [...]
We have been doing it for Parabuild (using Parabuild itself) for quite some time. All builds are automatically built, tested and deployed for public access to the production website.
[...] team is practicing test-driven development. At the beginning, it’s going great. You’re more agile than you ever imagined. Everyone writes tests, tests pass, and everyone’s confidence level in the codebase is high. [...]
[...] really enjoyed Timothy Fitz’s new blog, he’s sold me on Continuous Deployment, named the benefits automation I never put my finger on, and [...]
[...] Continuous Deployment at IMVU: Doing the impossible fifty times a day is another fascinating article. It describes a project that has 4.4 hours of automated tests, including an hour of tests that involve running Internet Explorer instances an simulating a user clicking and typing. By running them across dozens of servers, they can run all the tests in 9 minutes. What’s more, they have so much faith in these tests that their code is automatically deployed to live when the tests pass. This happens 50 times a day on average. Brilliant! [...]
“The code is now live and fully pushed. This whole process is simple enough that it’s implemented by a handfull of shell scripts.”
I hope you have at least as much test coverage for that handful of scripts! Otherwise, the rest of the work is pretty much moot.
This is awesome. Are you guys hiring
This may work for you, but I would advise strong caution to others. Code tests are great, we also use lots of tests and the commit early and often credo. Nevertheless, our QA department was VERY good at picking out edge cases and the occasional other thing that developers had not anticipated. (“What happens if I rapidly double-click this submit button?” )
If you are going to try this you had better have excellent review of your code to cover edge cases, and you had better have lots of integration and view tests. Then, MAYBE this will work for you (and even then you will miss some edge cases). Otherwise, for most houses, this would be a major waste of time and money.
[...] Continuous Deployment – insane [...]
Thanks for the very interesting post. I recommend Yehuda Katz’s talks on testing: “Writing code that doesn’t suck” (RubyCoonf 2008) and “Testing Merb applications” (MerbCamp 2008). He presents a good argument for regression testing, as well as specific advice such as “test what you care about” and “concentrate on public APIs”.
I wonder if, in your test suite, you have any tests that simply check for general characteristics (e.g., broken HTML or links) of selected “real pages”. This would seem to fit well into your continuous testing approach.
[...] Continuous Deployment at IMVU: Doing the impossible fifty times a day. « Timothy Fitz [...]
[...] IMVU, we deploy code fifty times a day. The code you just wrote goes out to the production cluster without waiting for QE people to sign [...]
[...] and quality By dionad Michael over at Developsense has an interesting reply to a recent blog entry regarding continuous [...]
sadly this style of development sucks if you actually want a decent service from a user perspective as every code change has multiple unexpected sie effects and the normal state of the IM engine is glitchy (and thats being very charitable). for each bug “fixed” or “improvement made” at least 2 – 3 other bugs are introduced. its very frustrating and makes imvu look more than a little incompetent after over 4 years as a “beta” (LOL).
[...] Continuous Deployment at IMVU: Doing the impossible fifty times a day. « Timothy Fitz The Machine | Tags:agile, development Continuous Deployment at IMVU: Doing the impossible fifty times a day. « Timothy Fitz. [...]
[...] there was a lot of noise about troop movement in Iraq. Lately, however, things have become much more [...]
[...] Continuous Deployment at IMVU: Doing the Impossible Fifty Times a Day [...]
[...] is reported to do so every nine minutes. The report has created a fair amount of controversy. Whether you are or are not in favor of such [...]
This looks really cool and quite extreme. State of the art stuff.
[...] Continuous Integration does not cause Low Quality Since Timothy first wrote about it, IMVU’s bleeding edge continuous integration strategy got a lot of attention. Some of it has [...]
[...] It has enabled companies I’ve worked with to deploy new code to production as often as fifty times every day. Continuous deployment is controversial. Most people, when they first hear about continuous [...]
[...] Day of all days. However, clicking through to the article he linked to led me to a couple of blog entries written back in February, and a Google blog search turned up several dated hits indicating that [...]
[...] run across multiple machines and run them in parallel. We’ve been impressed by the likes of IMVU and the guys at weplay have spiked a mechanism for doing this with Cucumber test suites. [...]
[...] it, running tests against it, packaging it, deploying it (I was particularly inspired when I read IMVU’s doing the impossible 50 times a day…!) – all of that good stuff that, without computers, would fail miserably because humans just [...]
[...] Continuous Deployment at IMVU: Doing the impossible fifty times a day is another fascinating article. It describes a project that has 4.4 hours of automated tests, including an hour of tests that involve running Internet Explorer instances an simulating a user clicking and typing. By running them across dozens of servers, they can run all the tests in 9 minutes. What’s more, they have so much faith in these tests that their code is automatically deployed to live when the tests pass. This happens 50 times a day on average. Brilliant! [...]
[...] It’s important to note that system I’m about to explain evolved organically in response to new demands on the system and in response to post-mortems of failures. Nobody gets here overnight, but every step along the way has made us better developers. (Fitz) [...]
[...] Blog de Timothy Fitz: Continuous Deployment at IMVU: Doing the impossible fifty times a day [...]
[...] 5 whys, split testing and other topics. I would have liked to spend a bit more time discussing continuous deployment, but I did get some more insight into how they got started with CD at [...]
[...] attending Eric Ries‘ talk on the Lean Startup, I started thinking about how to work towards continuous deployment within Sophos. Note that I say “work towards” and not “achieve” – for my [...]
[...] Continuous Integration just isn’t hardcore enough. What an amazing and fascinating place that must be to work, an environment where discipline to [...]
Cool… I liked it.
Request you to write a descriptive article on writing the test cases with quality you described. It will help a lot of people including me.
[...] recently received a link to a compelling blog article on continuous deployment at IMVU. Continuous deployment methodologies are really capable of disrupting the traditional enterprise [...]
[...] Total startup costs are plummeting — it costs less than $10,000 to launch a new, web-based product. Using the latest technology, a lean startup can create product prototypes in weeks and months, not years, and use customer feedback to evolve them in near-real time. Releases are measured in minutes and hours, not days and weeks -– in some cases, lean startups are releasing new code to production 50 times a day. [...]
Cool site, love the info.
[...] awesome MochiWeb web server. My goals for this project include using test driven development and continuous deployment and fully embracing the mantra of “release early, release [...]
If the software is of less than one fail in a million test cases run, than it is indeed a top quality software. But…if someone has to deploy a software 50 times a day, you know how bad the quality already is – I mean, a well-designed, well-engineered software doesn’t deserve to be re-deployed so many times a day. So, full marks to engineering services team for an excellent CI and automated test system, and no cookie for the engineering team. Haven’t the customers called the bluff yet ?
This seems to be a bad case of hack-test-hack-test…endless cycle. Deploying something 50 times a day doesn’t condone bad engineering. And AFAIK, it is definitely LEANer but not Lean
[...] while ago I read an excellent blog post from Timothy Fitz called Continuous Deployment at IMVU: Doing the impossible fifty times a day, that got me started on trying to improve the way we deploy our websites at UKFast but just as [...]