Or how to achieve bi-weekly releases at SaaS scale
Almost exactly a year ago I joined RiverMeadow: a self-described “cloud mobility” SaaS platform, with the modest task of improving a product I was told was “ready for market” and a team that only needed “some improvement, and a better adoption of Engineering best practices” – in slightly over six months, however, we had to completely redesign the architecture, rewrite the entire code base and, most critically, move away from a failing release cycle and achieve my stated goal of “one release every Sprint.”
In this series of blog posts I would like to share what I learned, what the team achieved and what tools we used to enable us to accomplish an amazing turnaround, with a special focus on release, testing and reliability.
The starting point
To understand the distance we’ve come, it’s also important to understand where we have come from and what the challenges were: from my limited startup experience, it seems to me a very common pattern, and I’m hoping that by sharing my experiences here (which are almost identical to what I had to endure at SnapLogic, but hopefully with a more rewarding outcome) others will be able to benefit from it and, who knows, avoid those mistakes in the first place.
Badly burned by the SnapLogic experience, before joining RiverMeadow I’d even asked (and obtained) access to the code base to review; to this day, I’m still amazed at how badly I missed the obvious signs of impending doom: the only reasonable explanation I can imagine, is that I probably wanted to leave SnapLogic so desperately, that my brain refused to recognize what was otherwise plain to see.
The original so-called “SaaS platform” was in fact a single Java Tomcat app (WAR) calling out to a bewildering array of intricate Bash shell scripts for the low-level operations; a smattering of C code, whose purpose and raison d’etre, to this day, still escape us; and a persistence layer using HSQLDB (I’m not kidding).
Unit testing was virtually non-existent and, in fact, impossible to implement for the Bash scripts (to make matters worse, they had accumulated over the years, some were no longer used, others we weren’t quite sure, others yet were used, but we were not quite sure whether what they did was of any use).
There was no multi-tenancy (at one point, one of the founders decided that to implement multi-tenancy all we had to do was to create sub-directories for the server images – again, I’m not making this up) and security was virtually non-existent (they had been using Samba for the file transfers, and were only now slowly moving to SSH).
It should then be of no surprise that release dates had been missed, by months, and even planning releases was a challenge: the product planning, features and roadmap was all up in the air and it was virtually impossible to plan ahead – due to the intricacy of the code; the inter-dependencies; and, most critically, the absence of unit, functional and (automated) integration testing (there was a minimal – and, I’d soon discover, massively inadequate – amount of manual testing): we often found ourselves in the crazy situation that fixing a bug, or adding a feature in some parts of the product, would cause unexpected breakages in others, unrelated areas – we were never sure whether those were undiscovered bugs suddenly exposed, or new ones introduced.
Finally, we were only able to support a handful of OS platforms (4 Linux distros and 4-5 Windows 2k3/8 variants) migrating to one target cloud (vCloud).
So, dear reader, how do you go from there to a bi-weekly release cycle, with full secure multi-tenancy, Enterprise-grade security, fully scalable architecture, highly-available (HA) persistence layer using Cassandra and MongoDB (for replication and multi-DC support – both for redundancy and geo-diversity) and supporting a QA-certified 20 OS variants (and counting: we should have 24 or more in the next couple of weeks) to four cloud targets: vCloud, vSphere, Amazon AWS, Openstack (and soon Cloudstack too)?
Not only that, but we continue to add features and functionality: we just enabled the ability to include/exclude directories from the migration; we’ll soon have post-migration file-sync capabilties; the ability for the user to define pre- and post-migration custom operations; metrics, monitoring and reporting; and so on and so forth.
How we did it
The theme for this series will be that it takes a combination of tools, people and processes in order to accomplish change at “Internet speed,” and changing only one (or two) of those ingredients will not be sufficient: they all have to change.
Tools have to be “modern” and appropriate to delivering distributed products at scale; people must embrace a culture of rapid development, code integrity and comprehensive automated testing; and, finally, (lightweight) processes (supported by the tools and embraced by the people) must be put in place, so that we can iterate quickly, fix mistakes as they happen and move the product forward at the speed the market dictates.
It has not been easy and it has taken a lot of effort from everyone in the team, but the end result is that we now have unprecedented opportunities with some of the largest cloud
providers and OEMs and we are well on our way to disrupt the market for
server migrations to the Cloud.
In the next entries, I will share what we did and how we did it.