From null to a hundred in six months (4)

Or how to achieve bi-weekly releases at SaaS scale

Part IV – Processes

This is Part IV of a series, see Other Parts in the Series

Code Reviews

When you join Google, you cannot commit code to the main codebase – no matter how senior or experienced you are (I suspect they did not allow exemptions for James Gosling or Josh Block either).

In order to do so, one needs to get past a Code Readability qualification (a sort of internal certification [1]): this essentially means that (1) you understand, internalize and practice strict adherence to the Google guidelines and (2) that a (much) more senior developer has been reviewing your code and finds it acceptable; at any rate, not even after attaining status as a “readability holder” is one allowed to commit code to main trunk without first a rigorous code review by other project members, at least one of which must approve of the changes (aka, a LGTM).

Having seen the power of such a strict (but, overall, fair and straightforward) policy in maintaining order across such a vast codebase, composed of all programming languages known to mankind, and some possibly now even forgotten, across a globally disperse and culturally diverse developers population, little wonder that I’ve become a fervent believer.

In other words, in my team, we enforce strict adherence to generally-accepted coding guidelines (obviously, we favor the Google guidelines) and no code gets committed without code review and peer approval.

The process is simple and straightforward, leveraging the combined power (and integration) of github and RBCommons:

  • new features and bug fixes are branched off develop, with the branch named in a way that allows easy reference back to the original bug or story ID (eg, for bug ME-123 which is about enabling user logout, one would git checkout -b ME-123_user_logout);
  • once the code is complete, unit tests are written (possibly functional tests are added / modified) the code is submitted for review to RB (rbt post) and some individuals (who are deemed to be familiar with that portion of the system) are chosen are named reviewers, whilst the entire Eng group is (sort of) cc’d into the review;
  • the reviewers (as well as other “interested parties”) can comment on the code, style, design, approach, test and documentation – once objections are addressed and the code is deemed to be in great shape, one (and most typically, two) reviewers indicate their agreement by providing a Ship It and then the code can be merged into develop

There are a few strict guidelines (non-negotiables):

  • no trailing/leading whitespace (and, by God, no tabs!);
  • all public methods should have Javadoc/epydoc (unless obvious and/or getters/setters);
  • adherence to Sun Java style / PEP8 Python guidelines;
  • no lines to exceed 100 chars;

We have just implemented automatic checking of PEP8 compliance, and violations will prevent developers to commit code (see also the note about ratcheting).

The above may seem trivial, even overkill and alarmingly heavy-handed: let me assure it is not; once you get used to follow a certain coding style and conventions, violating those will stick out like a sore thumb, while compliance will come natural; most importantly most modern IDEs (and, dare I say this, even vim) have not automated checking and the ability to automatically correct for violations.

However, this is invaluable once your code base grows over a few thousand lines of codes [2] and one must rely on automated tools to find (and correct) issues, refactoring, etc. (the pervasive use of grep and Regular Expressions being a main driver of choice).


Ask anyone building large distributed systems, made up of many loosely coupled components, developing in an Agile fashion against a set of rapidly evolving requirements and fast moving markets, and they’ll tell you there are really only three important things to look out for:

  • Testing;
  • Testing; and
  • Testing.

And I should add, when I talk about “testing,” what I really mean is Automated Testing; manual testing is a sort of oxymoron in my world and a failing strategy: for one, it does not scale; for another it’s time consuming and non-reproducible; but, worst of all, it relies on the lowest form or intelligence at our disposal as programmers: that of carbon-based lifeforms.

If you recall from Part 3 , I mentioned our QA team working hard to automated our full integration tests suite: we’ll soon be able to run nightly integration tests (with full automated validation of results) across a vast array of OS platforms (>20, as of last count – and growing) being migrated to all major clouds (vCloud, Openstack, AWS and soon Cloudstack): obviously, the resulting compliance matrix grows combinatorially and it would be impossible to run (let alone analyze results) manually.

This is my approach to create a team and a system that allows for frequent releases (we are currently on a bi-weekly cycle, we may decide to reduce this even further):

  • all the code must be unit tested: measure coverage, choose a target that feels “right” for your team/product and keep ratcheting [3] – developers are responsible for unit tests, they must be run often and they must run fast (less than a minute or two);

  • all the systems must expose a REST API, and the endpoints must be fully tested (functional tests) with other participating systems either mocked out, or running “stubs” – these must be fast too, fully executed in less than 5-10 minutes (but really, if it takes more than a handful of minutes, start asking hard questions of your tests – or your implementation, for that matter). These are maintained by the QA team, with heavy involvement of the Dev team;

    By and large, these are black box tests, and only assume knowledge of (and adherence to) the API contract; however, when testing specific aspects or functionality, or trying to isolate specific test cases, one is allowed to “cheat” and use knowledge of the inner workings of the DUT.

  • integration tests must be comprehensive, realistic and, most importantly, black box: in other words, they must be developed without any assumed knowledge of the system and, most critically, without any shared libraries with the product under test: assuming that your system exposes REST APIs (and, if it doesn’t, please stop reading my blog, you’re soiling my pages) those should be exercised according to the API contract, and the response matched against what the API promises.

    Integration tests should be run in a handful of hours (much less, if possible at all) and ideally after every commits to the master branch (aka develop, or main trunk to you, SVN desaparicidos). In practice, as commits to the main branch may happen with greater frequency that the duration of the integration tests’ execution allow, you should have both a Sanity set (run after every commit and testing a subset of all possible cases) and a Nightly set run (you guessed it) every night: the whole point is to catch code changes that cause integration breakages as soon as possible and as close as possible to the “culprit” commit, so that reverting (or fixing) them is easiest (not to mention, figuring out what caused the breakage in the first place).

    QA owns integration tests: they develop them, they build the infrastructure to keep them running and they fix and expand them as needs change and grow.

Be that as it may, all three kind of tests must rely on automatic verification and flagging of errors, with a suitable notification policy and clear resolution process.

Build Cop

As good practice, one (or two) members of the development team (not QA) ought to be on the hook to be notified upon build breakages (see also below): the duty is typically on a rotation basis and does not necessarily need to be 24/7 (extended hours is usually sufficient).

The duty of the Build Cop is not of fixing test failures (although, trivial ones, or those for which she knows how to – she should take action) but rather to be on the hook to receive the alert, identify the root cause (or as close as she can get) and then assign the resolution to the team member who is most likely to be able to fix the breakage (or investigate further).

One of the indirect benefits is that all the members of the team are forced to learn about all the components of the systems, at least to the level of being able to investigating test failures and build breakages; this is particularly important in systems such as RiverMeadow‘s, which employ several programming languages; across a distributed system; with diverse build/packaging technologies.

To implement a Build Cop rotation, I recommend PagerDuty .

The Ultimate Goal: Continuos Integration

Once individuals with the right mindset have formed what is hopefully an awesome team, then it’s possible to introduce real CI (as opposed to imaginary one, which essentially consists of having a random guy run some Jenkins job in the hope to catch a bug somewhere).

Real CI involves much more than setting up a build server (we use TeamCity) and watching it run random builds: certainly, creating and maintaining a comprehensive and sane set of build jobs that gets automatically executed (typically triggered by commits to certain branches, eg develop or release-*), drive specific suites of tests (as mentioned, for example, one may run a Sanity test every time there is a git commit to the origin/develop branch) and notification to be sent when failure occurs.

However, what really matters here is people’s attitude: when a failure occurs and a notification is sent out, this is not “QA’s job” or “it worked on my machine” – it is treated as a serious issue, usually with several folks jumping in (needless to say, the “blamed” –in the git blame sense of whoever committed last– developer being the target of benevolent hilarity, hence having a pressing motive to find the root cause and fix it… quickly!).

We also have a Build Cop weekly rotation whose main job is to monitor the build failure notifications, conduct a preliminary analysis and possibly a quick fix or, if that is not possible, figure out who is the best person who can look into it and, ideally, fix the breakage.


By implementing all the above, and following the practices outlined in Part 2 and Part 3 we have overcome the nightmare described in Part 1, developed a fully distributed, scalable system, driven by REST APIs, deployed across several data centres globally and capable of serving potentially tens of thousand of Enterprise server migrations across all the major Cloud platforms, for more than 20 OS editions (and counting!) while keeping our sanity and having fun.

This is what happens when you have great people, the right tools and sane processes.

Other Parts in the Series


[1] Yours truly proudly holds both Java and C++ Google Readabilities. Yay!
[2] Interestingly enough, I don’t know how many LOCs compose our current codebase: for one, the code is spread across several repositories, for the other, I don’t much give weight to LOCs as an interesting metric. But maybe I’ll find out and update this blog entry.

ratcheting is a term I heard the first time from my good friend and esteemed colleague, Chris Wagner, then-CTO at SnapLogic and currently a Juniper Networks VP, Fellow: it means essentially that you continually keep raising the bar on a given metric (eg, test coverage) and never allow it to slip back: this is particularly useful when dealing with legacy or poorly-maintained code bases, where imposing an arbitrary high bar outright may discourage folks from even trying (or cause an unacceptable loss of productivity) but you are unwilling to compromise and have a goal to eventually arrive at the “high bar.”

Maybe there’s an entire blog post I should write up about this…



4 thoughts on “From null to a hundred in six months (4)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s