Gossip-based Failure Detectors – 0.8.0 released

We have just released version 0.8.0 of our distlib C++ library that implements a number of popular distributed systems algorithms.

IT13-DLR-020216-Shutterstock
Server, Servers… Servers Everywhere!

This concludes a first iteration of the “SWIM Gossip Protocol” for failure detection: the implementation is now fully thread-safe and can scale up to many servers with minimal process overhead.

The gossip_example.cpp toy server shows how a client application can use the library to implement an embedded (or external) failure detector for a large fleet of servers:

auto detector = std::make_shared(
    port,
    kDefaultUpdateIntervalSec,
    kDefaultGracePeriodSec,
    ping_timeout_msec,
    ping_interval_sec
);

std::for_each(seedNames.begin(), seedNames.end(),
    [&](const std::string &name) {
         ...
         detector->AddNeighbor(server);
    }
);
...
detector->InitAllBackgroundThreads();

 

and this is pretty much it.

The detector is now running in background and both reporting the server’s health as well as recording (and gossiping about) the other “neighbors.”

A few of them can be started like this:

$ ./bin/gossip_detector_example --port=8085 \
--seeds=gondor:8086,gondor:8087

and after a little while, all will be reporting about the others’ health:

Screenshot from 2017-09-07 23-26-24

if any of these becomes unresponsive, it will be reported as such:

Screenshot from 2017-09-07 23-28-03

the other servers will eventually become aware of the failure via gossip, without necessarily having to directly ping this one server; conversely, if the unresponsive server recovers within a configurable “grace period” (or comes back online after a reboot) it will be re-admitted in the “healthy” list.

Thanks to the use of a Gossip-based protocol, the number of server-to-server communications (and request/response overhead) is O(1) instead of O(n²) as with traditional heartbeat protocols and it does not depend on a central master controller (a major Single Point of Failure itself).

However, the time bound on the spread of the information is upper bound to be O(log k), where k is the size of the “gossip ring.”

See the paper for more details and a rigorous analysis of the protocol.

The distlib library is build and installed as a SHARED (.so) library and also contains a number of other utilities, including a Consistent Hash; Merkle Trees; and various utilities.

We are looking for folks willing to contribute to the code (the library is released under the Apache 2 license).

More details can be found on the Github Project Page and I plan to start a series of blog posts describing its implementation.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s