Implementing A Merkle Tree in C++

Updated The code is now available on github: https://github.com/massenz/distlib.

Introduction

A Merkle Tree is a tree structure where the individual nodes’ hashes are computed and stored in the parent node, so that the hash stored in the root node is the only value that is required to later validate the contents of the tree (stored in the leaves).

Merkle trees are used in the Blockchain and are described in greater detail in Mastering bitcoin (Chapter 7, “Merkle Trees”):

Merkle trees are used in bitcoin to summarize all the transactions in a block, producing
an overall digital fingerprint of the entire set of transactions, providing a very efficient process to verify if a transaction is included in a block.

A merkle tree is constructed by recursively hashing pairs of nodes until there is only one hash, called the root, or merkle root.

The cryptographic hash algorithm used in bitcoin’s merkle trees is SHA256 applied twice, also known as double-SHA256.

In the following I will use a Merkle Tree as an example of how to write a template class that takes the hashing function and the resulting hash length as template arguments, along with T the type of the leaf node’s value.

Building a Merkle Tree in C++

The class signature is as follows:

template <typename T, char* (hash_func)(const T&), size_t hash_len>
class MerkleNode {
    ...
}

where hash_func is a function that takes an element of type T and generates a hash made of hash_len chars.

As an example of a hash function that uses MD5 (implemented by the OpenSSL library), we have:

char *hash_str_func(const std::string& value) {
  unsigned char* buf;
  size_t len = basic_hash(value.c_str(), value.length(), &buf);
  assert(len == MD5_DIGEST_LENGTH);
  return (char *)buf;
}

and:

size_t basic_hash(const char* value, size_t len, unsigned char** hash_value) {
  *hash_value = new unsigned char[MD5_DIGEST_LENGTH];
  MD5((unsigned char *)value, len, *hash_value);

  return MD5_DIGEST_LENGTH;
}

with those functions in place, you can then define a MD5StringMerkleNode implementation of a Merkle Tree using the MD5 hashing function:

class MD5StringMerkleNode : public MerkleNode<std::string, hash_str_func, MD5_DIGEST_LENGTH>;

MerkleNode class declaration

The base (abstract) class is declared in a way that leaves to the concrete implementation the responsibility of implementing the actual hashing functionality (including computing the hash from the children’s hashes – typically, but not exclusively, by hashing their concatenation) and just implements the common functionality which is independent from the type of T or the hash function:

// I have shortened some doxygen comments and omitted some of the various accessor
// methods for simplicity's sake.
template
class MerkleNode {
protected:

  std::unique_ptr<const MerkleNode> left_, right_;
  const char *hash_;
  const std::shared_ptr value_;

  /**
   * Computes the hash of the children nodes' respective hashes.
   * In other words, given a node N, with children (N1, N2), whose hash values are,
   * respectively, H1 and H2, computes:
   *
   *     H = hash(H1 || H2)
   *
   * where `||` represents the concatenation operation.
   * If the `right_` descendant is missing, it will simply duplicate the `left_` node's hash.
   *
   * For a "leaf" node (both descendants missing), it will use the `hash_func()` to compute the
   * hash of the stored `value_`.
   */
  virtual const char *computeHash() const = 0;

public:

  /**
   * Builds a "leaf" node, with no descendants and a `value` that will be copied and stored.
   * We also compute the hash (`hash_func()`) of the value and store it in this node.
   *
   * We assume ownership of the pointer returned by the `hash_func()` which we assume has been
   * freshly allocated, and will be released upon destruction of this node.
   */
  MerkleNode(const T &value) : value_(new T(value)), left_(nullptr), right_(nullptr) {
    hash_ = hash_func(value);
  }

  /**
   * Creates an intermediate node, storing the descendants as well as computing the compound hash.
   */
  MerkleNode(const MerkleNode *left,
             const MerkleNode *right) :
      left_(left), right_(right), value_(nullptr) {
  }

  /**
   * Deletes the memory pointed to by the `hash_` pointer: if your `hash_func()` and/or the
   * `computeHash()` method implementation do not allocate this memory (or you do not wish to
   * free the allocated memory) remember to override this destructor's behavior too.
   */
  virtual ~MerkleNode() {
    if (hash_) delete[](hash_);
  }

  /**
   * Recursively validate the subtree rooted in this node: if executed at the root of the tree,
   * it will validate the entire tree.
   */
  virtual bool validate() const;

  /**
   * The length of the hash, also part of the template instantiation (`hash_len`).
   *
   * @see hash_len
   */
  size_t len() const { return hash_len; }

  /**
   * Returns the buffer containing the hash value, of `len()` bytes length.
   *
   * @see len()
   */
  const char *hash() const { return hash_; }

  bool hasChildren() const {
    return left_ || right_;
  }

  const MerkleNode * left() const { return left_.get(); }
  const MerkleNode * right() const { return right_.get(); }
};

A couple of interesting points can be noted:

  • there are two fundamentally different constructors: one for the leaf nodes, the other for the intermediate nodes;

  • the method to actually compute the hashes of the children nodes is defined pure virtual: this must be implemented by the concrete derived class(es), and invoked in the constructor, where noted;

  • finally, we can only use this class with T types that allow copy construction (this is, in practice, similar to the requirement on STL containers’ elements).

Validating the tree

The validate() method will tell us whether one or more of the leaf nodes has been modified (this will cause all the hashes in the path from the root to the leaf to be invalid):

bool MerkleNode<T, hash_func, hash_len>::validate() const {
    // If either child is not valid, the entire subtree is invalid too.
    if (left_ && !left_->validate()) {
      return false;
    }
    if (right_ && !right_->validate()) {
      return false;
    }

    std::unique_ptr<const char> computedHash(hasChildren() ? computeHash() : hash_func(*value_));
    return memcmp(hash_, computedHash.get(), len()) == 0;
}

Building a Merkle Tree

Assuming that we have a list of T objects, we can then build a tree from them:

// Recursive implementation of the build algorithm.
template<typename NodeType>
const NodeType *build_(NodeType *nodes[], size_t len) {

  if (len == 1) return new NodeType(nodes[0], nullptr);
  if (len == 2) return new NodeType(nodes[0], nodes[1]);

  size_t half = len % 2 == 0 ? len / 2 : len / 2 + 1;
  return new NodeType(build_(nodes, half), build_(nodes + half, len - half));
}

template<typename T, typename NodeType>
const NodeType *build(const std::list<T> &values) {

  NodeType *leaves[values.size()];
  int i = 0;
  for (auto value : values) {
    leaves[i++] = new NodeType(value);
  }

  return build_(leaves, values.size());
};

Testing our Merkle Tree

There are many tests that can be done against a tree; the ones presented here are just a small fraction and are only meant to show also how a MerkleTree would be used in code:

TEST(MerkleNodeTests, CannotForgeTree) {
  std::string sl[]{"spring", "winter", "summer", "fall"};
  MD5StringMerkleNode *leaves[4];

  for (int i = 0; i < 4; ++i) {
    leaves[i] = new MD5StringMerkleNode(sl[i]);
  }

  MD5StringMerkleNode *a = new MD5StringMerkleNode(leaves[0], leaves[1]);
  MD5StringMerkleNode *b = new MD5StringMerkleNode(leaves[2], leaves[3]);
  MD5StringMerkleNode root(a, b);

  EXPECT_TRUE(root.hasChildren());
  EXPECT_FALSE(b->left() == nullptr);
  EXPECT_TRUE(root.validate());

  MD5StringMerkleNode *fake = new MD5StringMerkleNode("autumn");
  b->setNode(fake, MD5StringMerkleNode::RIGHT);
  EXPECT_FALSE(root.validate());
}

TEST(MerkleNodeTests, hasChildren) {
  MD5StringMerkleNode *a = new MD5StringMerkleNode("test");
  EXPECT_FALSE(a->hasChildren());

  MD5StringMerkleNode *b = new MD5StringMerkleNode("another node");
  MD5StringMerkleNode parent(a, b);
  EXPECT_TRUE(parent.hasChildren());
}

TEST(MerkleNodeTests, Equals) {
  MD5StringMerkleNode a("test");
  MD5StringMerkleNode b("test");

  // First, must be equal to itself.
  EXPECT_EQ(a, a);

  // Then equal to an identically constructed other.
  EXPECT_EQ(a, b);

  // Then, non-equal to a different one.
  MD5StringMerkleNode c("not a test");
  EXPECT_NE(a, c);
}


TEST(MerkleNodeTests, Build) {
  std::string sl[]{"one", "two", "three", "four"};

  std::list<std::string> nodes;
  for (int i = 0; i < 4; ++i) {
    nodes.push_back(sl[i]);
  }

  const MD5StringMerkleNode *root = build<std::string, MD5StringMerkleNode>(nodes);
  ASSERT_NE(nullptr, root);
  EXPECT_TRUE(root->validate());
  delete (root);
}

Note As we use std::unique_ptr pointers when building the tree, it is important that
we only pass new-allocated pointers (and not pointers to objects allocated on the stack) or
we will get an invalid pointer exception when the std::unique_ptr destructor calls free()
(ultimately).

The tests above use the Google Test framework, which I have been presenting in a previous article along with some issues faced when using CDT in Eclipse.

Checking for memory leaks

I have used valgrind to test for memory leaks: unfortunately, running
it against the tests causes the results to be polluted by the (many) leaks the Google Test
framework introduces.

So, I wrote a simple programs that builds a couple of Merkle Trees, and then I’ve run it under
valgrind:

int main(int argc, char *argv[]) {

  list<string> nodes;
  const MD5StringMerkleNode* root;

  // Initialize Google Logs and send logs to stderr too (in addition
  // to the default file location; default: /tmp).
  google::InitGoogleLogging(argv[0]);
  FLAGS_logtostderr = 1;

  if (argc == 1) {
    usage(argv[0]);
  }

  for (int i = 0; i < 33; ++i) {
    nodes.push_back("node #" + std::to_string(i));
  }

  root = build<std::string, MD5StringMerkleNode>(nodes);

  bool valid = root->validate();
  LOG(INFO) << "The tree is " << (valid ? "" : "not ") << "valid" << endl;
  delete(root);

  return valid ? EXIT_SUCCESS : EXIT_FAILURE;
}

And this was the result:

$ valgrind --leak-check=yes ./brick "so, maybe..."
==11921== Memcheck, a memory error detector
==11921== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==11921== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==11921==
I0626 01:18:23.546921 11921 main.cpp:52] The tree is valid
==11921==
==11921== HEAP SUMMARY:
==11921==     in use at exit: 0 bytes in 0 blocks
==11921==   total heap usage: 380 allocs, 380 frees, 103,280 bytes allocated
==11921==
==11921== All heap blocks were freed -- no leaks are possible
==11921==
==11921== For counts of detected and suppressed errors, rerun with: -v
==11921== ERROR SUMMARY: 66 errors from 1 contexts (suppressed: 0 from 0)

Conclusion

When I embarked on this journey, I thought this should have been rather straightforward to implement, but it turned out to be a rather more ambitious task, mostly due to the complexity of C++ templating syntax, and the quirks of implementing methods where the hashing function was abstracted away in a template argument (as opposed to the simpler case of just being a functor).

I am hoping that this post will help other looking to implement something similar.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s