A python notebook to experiment with the Apache Mesos HTTP API – Part 1 of 3

This is the first of a series of three articles that shows how to setup a Vagrant-based Apache Mesos test/development environment on your laptop; then how to run a Python notebook against the HTTP API; and finally, how to launch Docker containers on the running Agent VM.

It is pretty jam-packed and requires a certain amount of familiarity with some concepts around containers, VMs and Mesos, but I am taking the time to show all the intermediate steps (hence, the 3-parts) and it should be easy to follow even if you have never used before Vagrant, Mesos or jupyter notebooks, for that matter.

A certain degree of familiarity with Python, requests and handling HTTP responses is going to be certainly helpful, as we will not be going into too much details there.

All the code is available on my zk-mesos git repository:

git clone git@github.com:massenz/zk-mesos.git

and you can also see the README there.

This series is an extended (and updated) version of the talk I gave at MesosCon Europe 2015 updated for Apache Mesos 1.0.0, which has just been released (August 2016) – you can also find the slides there.

Getting Started

In order to follow along, you will need to clone the repository (as shown above) and install Virtualbox and Vagrant: they are both super-easy to get going, please follow the instructions on their respective sites and you’ll be up and running (literally) in no time.

I also recommend to quickly scan the [Vagrant documentation]: a knowledge of Vagrant beyond vagrant up is not really required to get the most out of this series, but it may help if you get stuck (or would like to experiment and improve on our Vagrantfile).

If you are not familiar with Apache Mesos I would recommend to have a look at the project’s site: there are also a couple of good books out there, Mesos in Action being the one I would recommend (also having been one of the manuscript’s reviewers).

We will not be building it from source here, but will instead use Mesosphere packages: you don’t need to download them, the Vagrantfile will automatically download and install on the VMs.

To run the Python notebook, we will take advantage of the Jupyter packages, and use a virtualenv to run all our code: the latter is not strictly necessary, but will prevent you messing up your system Python.

The steps are pretty simple, and YMMV, but if you have never used virtualenv before:

$ sudo pip install virtualenv

and then create and run a virtualenv:

$ cd zk-mesos
$ virtualenv mesos-demo
$ source mesos-demo/bin/activate
$ pip install -r requirements.txt

Finally, verify that you can run and load the Jupyter notebook:

$ jupyter notebook

this should automatically open your browser and point it to http://localhost:8888, from where you can select the notebooks/Demo-API.ipynb — don’t run it just yet, but if it shows up, it will confirm that your Python setup is just fine.

Building and installing Apache Mesos

Here is where the beauty of Vagrant shows in all its glory: installing Apache Messos Master and Agent is not trivial, but in our case, it’s simply a matter or:

$ cd vagrant
$ vagrant up

(make sure to be in the same directory as the Vagrantfile when issuing any of the Vagrant commands, or it will complain about it).

It is worth noting that we are building two Vagrant boxes, so any command will operate on both unless specified; to avoid this, you can specify the name of the VM after the command; for example, to SSH onto the Agent:

$ vagrant ssh Agent

should log you in on that box, from where you can explore, experiment and diagnose any issues.

The vagrant up command will take some time to execute, but it should eventually lead your Virtualbox to have two VMs, named respectively mesos-master and mesos-agent – incidentally, you should never need to use VBox to manage them (all the tasks can be undertaken via Vagrant commands), but you can do that too, if necessary or desired.

Once the VMs are built, ensure you can access Mesos HTTP UI at: ;
you should also see one agent running, accessible either via the Master UI, or directly at: .

NOTE

the Agent runs at a different IP (obviously) than the Master, but also on a different port (5051 instead of 5050): look into vagrant/run-agent.sh to see a few of the command line flags that we use to run the Agent (and in run-master.sh for the Master).

Zookeeper

It’s worth noting that we are also running an instance of Zookeeper (for Leader election and Master/Agent coordination) on the mesos-master VM, inside a Docker container: partly because we can, but also to show how easy it is to do so using containers.

This one line (in run-master.sh), will give you a perfectly good ZK instance (albeit, a catastrophically unreliable one in a production environment, where you want to run at least 3-5 nodes, at least, an on physically separate machines/racks):

docker run -d --name zookeeper -p 2181:2181 -p 2888:2888 -p 3888:3888 jplock/zookeeper:3.4.8

Wrap up

That’s pretty much about it: you are now the proud owner of a Master/Agent 2-node Apache Mesos deployment: welcome in the same league as Twitter and Airbnb production wizards.

In Part 2, we will run our Python notebook against the Master API and will accept the Agent’s offers to launch a Docker container.

Python Magic Methods

Intro

In his excellent Fluent Python book, Luciano Ramalho talks about Python’s “data model” and gives some excellent examples of how the language internal consistency is achieved via the judicious use of a well-defined API and, in particular, how Python’s “magic methods” enable the construction of elegant solutions, which are concise and highly readable.

And while you can find countless examples online of how to implement the iterative magic methods (__iter__() and friends), here I wanted to present an example of how to use two of the lesser known magic methods: __del__() and __call__().

For those familiar with C++, these implement two very familiar patterns: the destructor and the function object (aka, operator()).

Implement a self-destructing key

Note

The full code is available at filecrypt Github repository, and it has been more fully explained in this blog entry.

Let’s say that we want to design an encryption key which will be in turn encrypted with a master key and whose “plaintext” value will only be used “in flight” to encrypt and decrypt our data, but will otherwise only be stored encrypted.

There are many reasons why one may want to do this, but the most common is when the data to be encrypted is very large and time-consuming to encrypt: should the master key be compromised, we could revoke it, re-encrypt the (possibly, many, one-time) encryption keys with a new master key without incurring the time penalty of having to decrypt and re-encrypt possibly several TB’s of data.

In fact, re-encrypting the encryption keys may be so inexpensive (computationally or time-wise) that this could be done on a regular basis, rotating the master key at frequent intervals (e.g., weekly).

If we use OpenSSL command-line tools to do all the encryption and decryption tasks, we need to temporarily store the encryption key as “plaintext” in a file, which we will securely destroy (using the shred Linux tool).

Note

We use the term “plaintext” to signify that the contents are decrypted, not to mean plain text format: the key is still binary data, but, if gotten at that stage by an attacker, it would not be protected with encryption.

However, just implementing the call to the shredding utility as the last step in our encryption algorithm would not be sufficient to ensure that this is executed under all possible code paths executions: there may be errors, exceptions raised, the user my terminate gracefully (Ctrl-c) or abruptly (SIGKILL) the program, and so on.

Guarding against all possibilities is not only tiresome, but also error-prone: how about instead having the Python interpreter do the hard work for us, and ensure that certain actions are always undertaken when the object is garbage collected?

Note

The technique shown here will not work for the SIGKILL case (aka kill -9) for which a more advanced technique (signal handlers) needs to be employed.

The idea is to create a class which implements the __del__() magic method, which is guaranteed to be always invoked when the there are no further references to the object, and it is garbage-collected (the exact timing of that happening is implementation dependent, but if you try that in common Python interpreters, it seems to be almost instantaneous).

This is what happens on a macOS laptop, running El Capitan and Python 2.7:

$ python
Python 2.7.10 (default, Oct 23 2015, 19:19:21) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
>>> class Foo():
...     def __del__(self):
...         print("I'm gone, goodbye!")
... 
>>> foo = Foo()
>>> bar = foo
>>> foo = None
>>> bar = 99
I'm gone, goodbye!
>>> another = Foo()
>>> ^D
I'm gone, goodbye!
$

As you can see, the “destructor” method will be invoked eithere when there are no longer references to it (foo) or when the interpreter exits (bar).

The following code fragment shows how we ended up implementing our “self-encrypting” key (I called it SelfDestructKey because the real feature is that it destructs the plaintext version of the encryption key upon exit):

This is a much simplified version of the code, focusing only on the __del__() method; please refer to the full version in the repository for the complete code.

class SelfDestructKey(object):
    """A self-destructing key: it will shred its contents when it gets deleted.

       This key also encrypts itself with the given key before writing itself out to a file.
    """

    def __init__(self, encrypted_key, keypair):
        """Creates an encryption key, using the given keypair to encrypt/decrypt it.

        The plaintext version of this key is kept in a temporary file that will be securely
        destroyed upon this object becoming garbage collected.

        :param encrypted_key the encrypted version of this key is kept in this file: if it
            does not exist, it will be created when this key is saved
        :param keypair a tuple containing the (private, public) key pair that will be used to
            decrypt and encrypt (respectively) this key.
        :type keypair collections.namedtuple (Keypair)
        """
        self._plaintext = mkstemp()[1]
        self.encrypted = encrypted_key
        self.key_pair = keypair
        if not os.path.exists(encrypted_key):
            openssl('rand', '32', '-out', self._plaintext)
        else:
            with open(self._plaintext, 'w') as self_decrypted:
                openssl('rsautl', '-decrypt', '-inkey', keypair.private, _in=encrypted_key,
                        _out=self_decrypted)

    def __str__(self):
        return self._plaintext

    def __del__(self):
        try:
            if not os.path.exists(self.encrypted):
                self._save()
            shred(self._plaintext)
        except ErrorReturnCode as rcode:
            raise RuntimeError(
                "Either we could not save encrypted or not shred the plaintext passphrase "
                "in file {plain} to file {enc}.  You will have to securely delete the plaintext "
                "version using something like `shred -uz {plain}".format(
                    plain=self._plaintext, enc=self.encrypted))

    def _save(self):
        """ Encrypts the contents of the key and writes it out to disk.

        :param dest: the full path of the file that will hold the encrypted contents of this key.
        :param key: the name of the file that holds an encryption key (the PUBLIC part of a key pair).
        :return: None
        """
        if not os.path.exists(self.key_pair.public):
            raise RuntimeError("Encryption key file '%s' not found" % self.key_pair.public)
        with open(self._plaintext, 'rb') as selfkey:
            openssl('rsautl', '-encrypt', '-pubin', '-inkey', self.key_pair.public,
                    _in=selfkey, _out=self.encrypted)

Also, note how I have implemented the __str__() method, so that I can get the name of the file containing the plaintext key by just invoking:

passphrase = SelfDestructKey(secret_file, keypair=keys)
encryptor = FileEncryptor(
    secret_keyfile=str(passphrase), 
    plain_file=plaintext,
    dest_dir=enc_cfg.out)

Obviously, we could have just as easily implemented the __str__() method to return the actual contents of the encryption key.

Be that as it may, if you look in the code that uses the encryption key, at no point we need to invoke the _save() method or directly invoke the shred utility; this will all be taken care of by the interpreter when either passphrase goes out of scope, or the script terminates (normally or abnormally).

Implement the Command Pattern with a Callable

Python has the concept of callable which is essentially “something that can be invoked as if it were a function” (this follows the Duck Typing approach: “if it looks like a function, and can be called like a function, then it is a function”).

To make a class object behave as a callable all we need to do is to define a __call__() method and then implement it as any other “ordinary” class method.

Say that we want to implement a “command runner” script that (similarly to, for example, git) can take a set of sub-commands and execute them: one approach could be to use the Command Pattern in our CommandRunner class:

class CommandRunner(object):
    """Implements the Command pattern, with the help of the __call__() magic method."""

    def __init__(self, config):
        """Initiailize the Runner with the configuration from parsing the command line.

           :param config the command-line arguments, as parsed by ``argparse``
           :type config Namespace
        """
        self._config = config

    def __call__(self):
        method = self._config.cmd
        if hasattr(self, method):
            callable_meth = self.__getattribute__(method)
            if callable_meth:
                callable_meth()
        else:
            raise RuntimeError('Unexpected command "{}"; not found'.format(method))

    def run(self):
        # Do something with the files
        pass

    def build(self):
        # Call an external method that takes a list of files
        build(self._config.files)

    def diff(self):
        """Will compute the diff between the two files passed in"""
        if self._config.files and len(self._config.files) == 2:
            file_a, file_b = tuple(self._config.files)
            diff_files(file_a, file_b)
        else:
            raise RuntimeError("Not enough arguments for diff: 2 expected, {} found".format(
                len(self._config.files) if self._config.files else 'none'))

    def diff_all(self):
        # This will take a variable number of files and will diff them all
        pass

The config initialization argument is a Namespace object as returned by the argparse library:

def parse_command():
    """ Parse command line arguments and returns a configuration object

    :return: the configured options, or `None` if just printing help.
    :rtype: Namespace or None
    """
    parser = argparse.ArgumentParser()

    # Removed the `help` argument for better readability; make sure you
    # always include that to help your user, when they invoke your script
    # with the `--help` flag.
    parser.add_argument('--host', default='localhost')
    parser.add_argument('-p', '--port', type=int, default=8080,)
    parser.add_argument('--workdir', default=default_wkdir)

    parser.add_argument('cmd', default='run', choices=['run', 'build', 'diff', 'diff_all'])
    parser.add_argument('files', nargs=argparse.REMAINDER")
    return parser.parse_args()

To invoke this script we would use something like:

$ ./main.py run my_file.py

or:

$ ./main.py diff file_1.md another_file.md

Worth pointing out how we also protect against errors using other two "magic" methods:

if hasattr(self, method):
    callable_meth = self.__getattribute__(method)

note that we could have used the __getattr__() magic method to define the behavior of the class when attemptiong to access non-existing attributes, but in this case it was probably easier to do that at the point of call.

Given the fact that we are telling argparse to limit the possible value to the choices when parsing the cmd argument, we are guaranteed that we will never get an “unknown” command; however, the CommandRunner class does not need to know this, and it can be used in other instances where we do not have such a guarantee (not to mention that we are only one typo away from some very puzzling bug, if we didn’t do our homework in __call()).

To make all this work, then we only need to implement a trivial __main__ snippet:

if __name__ == '__main__':
    cfg = parse_command()

    try:
        runner = CommandRunner(cfg)
        runner()  # Looks like a function, let's use it like one.
    except Exception as ex:
        logging.error("Could not execute command `{}`: {}".format(cfg.cmd, ex))
        exit(1)

Note how we invoke the runner as if it were a method: this will in turn execute the __call__() method and run the desired command.

We truly hope everyone agrees this is a way more pleasant code to look at than monstruosities such as:

# DON'T DO THIS AT HOME
# Please avoid castle-of-ifs, they are just plain ugly.
if cfg.cmd == "build":
    # do something to build
elif cfg.cmd == "run":
    # do something to run
elif cfg.cmd == "diff":
    # do something to diff
elif cfg.cmd == "diff_all":
    # do something to diff_all
else:
    print("Unknown command", cfg.cmd)

Conclusion

Learning about Python’s “magic methods” will make your code not only easier to read and re-use in different situations, but also more “pythonic” and immediately recognizable to other fellow pythonistas, thus making your intent clearer to understand and reason about.

filecrypt – OpenSSL file encryption

overview

Uses OpenSSL library to encrypt a file using a private/public key pair and a one-time secret.

A full description of the process can be found here.

configuration

This uses a YAML file to describe the configuration; by default it assumes it is in /etc/filecrypt/conf.yml but its location can be specified using the -f flag.

The structure of the conf.yml file is as follows:

keys:
    private: /home/bob/.ssh/secret.pem
    public: /home/bob/.ssh/secret.pub
    secrets: /opt/store/

store: /home/bob/encrypt/stores.csv

# Where to store the encrypted file; the folder MUST already exist and the user
# have write permissions.
out: /data/store/enc

# Whether to securely delete the original plain-text file (optional, default true).
shred: false

The private/public keys are a key-pair generated using the openssl genrsa command; the encryption key used to actually encrypt the file will be created in the secrets folder, and afterward encrypted using the public key and stored in the location provided.

The name will be pass-key-nnn.enc, where nnn will be a random value between 000 and 999, that has not been already used for a file in that folder.

The name of the secret passphrase can also be defined by the user, using the --secret option (specify the full path, it will be left unmodified):

  • if it does not exist a random secure one will be created, used for encryption, then encrypted and saved with the given path, while the plain-text temporary version securely destroyed; OR

  • if it is the name of an already existing file, it will be decrypted, used to encrypt the file,
    then left unchanged on disk.

NOTE we recommend NOT to re-use encryption passphrases, but always generate a new secret.

NOTE it is currently not possible to specify a plain-text passphrase: we always assume that
the given file has been encrypted using the private key.

The store file is a CSV list of:

"Original archive","Encryption key","Encrypted archive"
201511_data.tar.gz,/opt/store/pass-key-001.enc,201511_data.tar.gz.enc

a new line will be appended at the end; any comments will be left unchanged.

usage

Always use the --help option to see the most up-to-date options available; anyway, the basic
usage is (assuming the example configuration shown above is saved in /opt/enc/conf.yml):

filecrypt.py -f /opt/enc/conf.yml /data/store/201511_data.tar.gz

will create an encrypted copy of the file to be stored as /data/store/201511_data.tar.gz.enc,
the original file will not be securely destroyed (using shred) and the new encryption key to be stored, encrypted in /opt/store/pass-key-778.enc.

A new line will be appended to /home/bob/encrypt/stores.csv:

/data/store/201511_data.tar,pass-key-778.enc,/data/store/201511_data.tar.gz.enc

IMPORTANT

We recommend testing your configuration and command-line options on test files: shred erases files in a terminal way that is not recoverable: if you mess up, you will lose data.

You have been warned.

code

The code has been uploaded to github.

See the requirements.txt to install required Python libraries:

pip install -r requirements.txt

(the use of a virtualenv is recommended here).

To install OpenSSL in Ubuntu see this page, but it boils down essentially to:

 sudo apt-get install -y openssl

references

Link

This is the post on github and it shows a few useful and worthy features have been added.

To install on Ubuntu 14.04 (and newer – possibly older too, but I haven’t tested it):

sudo add-apt-repository ppa:git-core/ppa
sudo apt-get update && sudo apt-get install git

then you should see the newest version:

$ git --version
git version 2.9.0

Enjoy!

Implementing A Merkle Tree in C++

Updated The code is now available on github: https://github.com/massenz/distlib.

Introduction

A Merkle Tree is a tree structure where the individual nodes’ hashes are computed and stored in the parent node, so that the hash stored in the root node is the only value that is required to later validate the contents of the tree (stored in the leaves).

Merkle trees are used in the Blockchain and are described in greater detail in Mastering bitcoin (Chapter 7, “Merkle Trees”):

Merkle trees are used in bitcoin to summarize all the transactions in a block, producing
an overall digital fingerprint of the entire set of transactions, providing a very efficient process to verify if a transaction is included in a block.

A merkle tree is constructed by recursively hashing pairs of nodes until there is only one hash, called the root, or merkle root.

The cryptographic hash algorithm used in bitcoin’s merkle trees is SHA256 applied twice, also known as double-SHA256.

In the following I will use a Merkle Tree as an example of how to write a template class that takes the hashing function and the resulting hash length as template arguments, along with T the type of the leaf node’s value.

Building a Merkle Tree in C++

The class signature is as follows:

template <typename T, char* (hash_func)(const T&), size_t hash_len>
class MerkleNode {
    ...
}

where hash_func is a function that takes an element of type T and generates a hash made of hash_len chars.

As an example of a hash function that uses MD5 (implemented by the OpenSSL library), we have:

char *hash_str_func(const std::string& value) {
  unsigned char* buf;
  size_t len = basic_hash(value.c_str(), value.length(), &buf);
  assert(len == MD5_DIGEST_LENGTH);
  return (char *)buf;
}

and:

size_t basic_hash(const char* value, size_t len, unsigned char** hash_value) {
  *hash_value = new unsigned char[MD5_DIGEST_LENGTH];
  MD5((unsigned char *)value, len, *hash_value);

  return MD5_DIGEST_LENGTH;
}

with those functions in place, you can then define a MD5StringMerkleNode implementation of a Merkle Tree using the MD5 hashing function:

class MD5StringMerkleNode : public MerkleNode<std::string, hash_str_func, MD5_DIGEST_LENGTH>;

MerkleNode class declaration

The base (abstract) class is declared in a way that leaves to the concrete implementation the responsibility of implementing the actual hashing functionality (including computing the hash from the children’s hashes – typically, but not exclusively, by hashing their concatenation) and just implements the common functionality which is independent from the type of T or the hash function:

// I have shortened some doxygen comments and omitted some of the various accessor
// methods for simplicity's sake.
template
class MerkleNode {
protected:

  std::unique_ptr<const MerkleNode> left_, right_;
  const char *hash_;
  const std::shared_ptr value_;

  /**
   * Computes the hash of the children nodes' respective hashes.
   * In other words, given a node N, with children (N1, N2), whose hash values are,
   * respectively, H1 and H2, computes:
   *
   *     H = hash(H1 || H2)
   *
   * where `||` represents the concatenation operation.
   * If the `right_` descendant is missing, it will simply duplicate the `left_` node's hash.
   *
   * For a "leaf" node (both descendants missing), it will use the `hash_func()` to compute the
   * hash of the stored `value_`.
   */
  virtual const char *computeHash() const = 0;

public:

  /**
   * Builds a "leaf" node, with no descendants and a `value` that will be copied and stored.
   * We also compute the hash (`hash_func()`) of the value and store it in this node.
   *
   * We assume ownership of the pointer returned by the `hash_func()` which we assume has been
   * freshly allocated, and will be released upon destruction of this node.
   */
  MerkleNode(const T &value) : value_(new T(value)), left_(nullptr), right_(nullptr) {
    hash_ = hash_func(value);
  }

  /**
   * Creates an intermediate node, storing the descendants as well as computing the compound hash.
   */
  MerkleNode(const MerkleNode *left,
             const MerkleNode *right) :
      left_(left), right_(right), value_(nullptr) {
  }

  /**
   * Deletes the memory pointed to by the `hash_` pointer: if your `hash_func()` and/or the
   * `computeHash()` method implementation do not allocate this memory (or you do not wish to
   * free the allocated memory) remember to override this destructor's behavior too.
   */
  virtual ~MerkleNode() {
    if (hash_) delete[](hash_);
  }

  /**
   * Recursively validate the subtree rooted in this node: if executed at the root of the tree,
   * it will validate the entire tree.
   */
  virtual bool validate() const;

  /**
   * The length of the hash, also part of the template instantiation (`hash_len`).
   *
   * @see hash_len
   */
  size_t len() const { return hash_len; }

  /**
   * Returns the buffer containing the hash value, of `len()` bytes length.
   *
   * @see len()
   */
  const char *hash() const { return hash_; }

  bool hasChildren() const {
    return left_ || right_;
  }

  const MerkleNode * left() const { return left_.get(); }
  const MerkleNode * right() const { return right_.get(); }
};

A couple of interesting points can be noted:

  • there are two fundamentally different constructors: one for the leaf nodes, the other for the intermediate nodes;

  • the method to actually compute the hashes of the children nodes is defined pure virtual: this must be implemented by the concrete derived class(es), and invoked in the constructor, where noted;

  • finally, we can only use this class with T types that allow copy construction (this is, in practice, similar to the requirement on STL containers’ elements).

Validating the tree

The validate() method will tell us whether one or more of the leaf nodes has been modified (this will cause all the hashes in the path from the root to the leaf to be invalid):

bool MerkleNode<T, hash_func, hash_len>::validate() const {
    // If either child is not valid, the entire subtree is invalid too.
    if (left_ && !left_->validate()) {
      return false;
    }
    if (right_ && !right_->validate()) {
      return false;
    }

    std::unique_ptr<const char> computedHash(hasChildren() ? computeHash() : hash_func(*value_));
    return memcmp(hash_, computedHash.get(), len()) == 0;
}

Building a Merkle Tree

Assuming that we have a list of T objects, we can then build a tree from them:

// Recursive implementation of the build algorithm.
template<typename NodeType>
const NodeType *build_(NodeType *nodes[], size_t len) {

  if (len == 1) return new NodeType(nodes[0], nullptr);
  if (len == 2) return new NodeType(nodes[0], nodes[1]);

  size_t half = len % 2 == 0 ? len / 2 : len / 2 + 1;
  return new NodeType(build_(nodes, half), build_(nodes + half, len - half));
}

template<typename T, typename NodeType>
const NodeType *build(const std::list<T> &values) {

  NodeType *leaves[values.size()];
  int i = 0;
  for (auto value : values) {
    leaves[i++] = new NodeType(value);
  }

  return build_(leaves, values.size());
};

Testing our Merkle Tree

There are many tests that can be done against a tree; the ones presented here are just a small fraction and are only meant to show also how a MerkleTree would be used in code:

TEST(MerkleNodeTests, CannotForgeTree) {
  std::string sl[]{"spring", "winter", "summer", "fall"};
  MD5StringMerkleNode *leaves[4];

  for (int i = 0; i < 4; ++i) {
    leaves[i] = new MD5StringMerkleNode(sl[i]);
  }

  MD5StringMerkleNode *a = new MD5StringMerkleNode(leaves[0], leaves[1]);
  MD5StringMerkleNode *b = new MD5StringMerkleNode(leaves[2], leaves[3]);
  MD5StringMerkleNode root(a, b);

  EXPECT_TRUE(root.hasChildren());
  EXPECT_FALSE(b->left() == nullptr);
  EXPECT_TRUE(root.validate());

  MD5StringMerkleNode *fake = new MD5StringMerkleNode("autumn");
  b->setNode(fake, MD5StringMerkleNode::RIGHT);
  EXPECT_FALSE(root.validate());
}

TEST(MerkleNodeTests, hasChildren) {
  MD5StringMerkleNode *a = new MD5StringMerkleNode("test");
  EXPECT_FALSE(a->hasChildren());

  MD5StringMerkleNode *b = new MD5StringMerkleNode("another node");
  MD5StringMerkleNode parent(a, b);
  EXPECT_TRUE(parent.hasChildren());
}

TEST(MerkleNodeTests, Equals) {
  MD5StringMerkleNode a("test");
  MD5StringMerkleNode b("test");

  // First, must be equal to itself.
  EXPECT_EQ(a, a);

  // Then equal to an identically constructed other.
  EXPECT_EQ(a, b);

  // Then, non-equal to a different one.
  MD5StringMerkleNode c("not a test");
  EXPECT_NE(a, c);
}


TEST(MerkleNodeTests, Build) {
  std::string sl[]{"one", "two", "three", "four"};

  std::list<std::string> nodes;
  for (int i = 0; i < 4; ++i) {
    nodes.push_back(sl[i]);
  }

  const MD5StringMerkleNode *root = build<std::string, MD5StringMerkleNode>(nodes);
  ASSERT_NE(nullptr, root);
  EXPECT_TRUE(root->validate());
  delete (root);
}

Note As we use std::unique_ptr pointers when building the tree, it is important that
we only pass new-allocated pointers (and not pointers to objects allocated on the stack) or
we will get an invalid pointer exception when the std::unique_ptr destructor calls free()
(ultimately).

The tests above use the Google Test framework, which I have been presenting in a previous article along with some issues faced when using CDT in Eclipse.

Checking for memory leaks

I have used valgrind to test for memory leaks: unfortunately, running
it against the tests causes the results to be polluted by the (many) leaks the Google Test
framework introduces.

So, I wrote a simple programs that builds a couple of Merkle Trees, and then I’ve run it under
valgrind:

int main(int argc, char *argv[]) {

  list<string> nodes;
  const MD5StringMerkleNode* root;

  // Initialize Google Logs and send logs to stderr too (in addition
  // to the default file location; default: /tmp).
  google::InitGoogleLogging(argv[0]);
  FLAGS_logtostderr = 1;

  if (argc == 1) {
    usage(argv[0]);
  }

  for (int i = 0; i < 33; ++i) {
    nodes.push_back("node #" + std::to_string(i));
  }

  root = build<std::string, MD5StringMerkleNode>(nodes);

  bool valid = root->validate();
  LOG(INFO) << "The tree is " << (valid ? "" : "not ") << "valid" << endl;
  delete(root);

  return valid ? EXIT_SUCCESS : EXIT_FAILURE;
}

And this was the result:

$ valgrind --leak-check=yes ./brick "so, maybe..."
==11921== Memcheck, a memory error detector
==11921== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==11921== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==11921==
I0626 01:18:23.546921 11921 main.cpp:52] The tree is valid
==11921==
==11921== HEAP SUMMARY:
==11921==     in use at exit: 0 bytes in 0 blocks
==11921==   total heap usage: 380 allocs, 380 frees, 103,280 bytes allocated
==11921==
==11921== All heap blocks were freed -- no leaks are possible
==11921==
==11921== For counts of detected and suppressed errors, rerun with: -v
==11921== ERROR SUMMARY: 66 errors from 1 contexts (suppressed: 0 from 0)

Conclusion

When I embarked on this journey, I thought this should have been rather straightforward to implement, but it turned out to be a rather more ambitious task, mostly due to the complexity of C++ templating syntax, and the quirks of implementing methods where the hashing function was abstracted away in a template argument (as opposed to the simpler case of just being a functor).

I am hoping that this post will help other looking to implement something similar.

Using Java 8 Optional and Functional interfaces to reduce boilerplate

A few years back I got curious about Scala, and fiddled with it a bit – becoming, in the process, very excited about the functional programming style.
(this is available in Python too – but it’s not as well-developed IMO, and anyway Python is best-suited to short-scope scripting -even though I make heavy use of its OOP abilities).

Java 8 is a good attempt at bringing the functional style to the JVM too, and I’m getting more and more used to it – still not quite as “native feel” as in Scala, but not too shoddy either.

I’ve been recently much vexed by some code I was writing that was not very DRY and required a lot of boilerplate, repetitive (and thus, error-prone) boring code; to wit, something like this::

         VersionedSrore vval = store.get(userId);

         if (vval != null) {
            PBData myData = PBData.newBuilder(vval.getValue())
                .addCard(ccard)
                .setLastUpdatedMillis(System.currentTimeMillis())
                .build();
            store.putVersioned(userId, vval.newValue(myData));
        } else {
            PBData myData = PBData.newBuilder()
                .setUserId(userId)
                .addCard(ccard)
                .setLastUpdatedMillis(System.currentTimeMillis())
                .build();
            store.putVersioned(userId, store.createInitialVersioned(myData));
        }

over and over again, very similar code repeated in various methods in the same class.

This typically occurs when dealing with Google Protocol Buffers and “versioned data stores” (such as LinkedIn’s Voldemort) that both deal generally with immutable objects, make heavy use of the Builder Pattern and require every store action to be effected via a “versioned value” (essentially, wrapping a Vector Clock object).

The (double) challenge here is that we might, or might not, have a value in the store, and depending on that we need to (a) use slightly different newBuilder() signatures and (b) slightly different, but very similar, store.put() methods.

I totally hated it – so I tried to figure out a way to factor it out in its own method::

private void transformAndUpdate(Long userId,
        Function transform) {

    PBData.Builder builder = Optional.ofNullable(store.get(userId))
        .map(versioned -&gt; {
            // This builder factory method takes a "prototype" and
            // initializes all the known fields with the prototype's values.
            return PBData.newBuilder(versioned.getValue()); 
        })
        // This one, instead, creates an empty protobuf, with default values.
        .orElse(PBData.newBuilder().setUserId(userId));

    // Because this object is used in two places (that cannot be "collapsed")
    // we extract it to a temporary variable, to make the code slightly less 
    // verbose.
    PBData myData = transform.apply(builder)
          .setLastUpdatedMillis(System.currentTimeMillis())
          .build();

    store.putVersioned(userId,
        Optional.ofNullable(vval)
            .map(versioned -&gt; { return versioned.newValue(myData); })
            .orElse(store.createInitialVersioned(myData)));
}

The transform functional takes the builder, applies whatever mutations are defined by the caller and returns the same (generally, but not necessarily) builder for further processing.
We simply note the update time and then build the protobuf.

Finally, we put back in the store the original (modified) “versioned” value we
had retrieved earlier, or a brand new one, if none was found.

Note the use of ofNullable() which will not throw even if vval
is null.

When you remove the comment lines and other formatting sugar, this is slightly less than 10 LOC, in ONE place, easy to update, maintain and improve, if need be.
(the code above can be further streamlined, but there’s a fine line between writing beautifully lean code and showing utter disregard for “the next guy”…)

With that, all we need now in every method that wishes to insert or update a record is essentially a one-liner that implements the Functional interface::

@Override
public void addUserCard(Long userId, PBCreditCard ccard) {
    try {
        transformAndUpdate(userId, (builder -&gt; {return builder.addCard(ccard);}));
    } catch (BDObsoleteVersionException e) {
        LOG.error("Could not add '{}' to [{}] payment methods: {}", ccard.getCode(), userId, e.getMessage());
    }
}

The code above makes judicious use of Optionals, Functionals and the new Streaming interfaces.
It’s all very awesome, and makes me wonder: Java, what took you so long?

To be honest, I still quite dislike the fact that (unlike Python, C++ or Scala) one still needs to use the apply() method on a functional interface, instead of simply adding the operator ( ); however, if that’s the price to pay for preventing people from defining their own custom operators, well, that’s a damn small price to pay!