A quasardb cluster is a peer-to-peer distributed hash table based on Chord. It has the following features:
- Distributed load
- The load is fairly and automatically distributed amongst the nodes of the cluster
- Optimized replica usage
- For reads, the nearest replica is used
- Automatic configuration
- Nodes organize themselves and exchange data as needed
- Integrated replication
- Data can be replicated on several nodes for increased resilience
- Fault tolerance
- The failure of one or several nodes does not compromise the cluster
- Transparent topology
- A client queries the cluster from any node without any concern for performance
To be properly operated a ring needs to be stable (see Fault tolerance).
A ring is stable when each node is connected to the proper successor and predecessor, that is, when all nodes are ordered by their respective ids. Each node requires an unique id that may either be automatically generated or given by the user (see quasardb daemon).
If a node detects that its id is already in use, it will leave the ring.
Each node periodically checks the validity of its successor and predecessor and will adjust them if necessary. This process is called stabilization.
Each entry is assigned an unique ID. This unique ID is a SHA-3 of the alias.
The entry is then placed on the node whose ID is the successor of the entry’s ID. If replication is in place, the entry will also be placed on the successor’s successor.
When a client queries the cluster, it locates the node that is the successor of the entry and queries that node.
As of quasardb 1.2.0, if the cluster uses Data replication, read queries are automatically load-balanced. Nodes containing replicated entries may respond instead of the original node to provide faster lookup times.
Each node periodically “stabilizes” itself.
Stabilizing means a node will exchange information with its neighbors in order to:
- Make sure the neighbors (the successor and the predecessor) are still up and running
- A new node isn’t a better successor than the existing one
In a sane, stable cluster, the time required to stabilize is extremely short and does not result in any modification. However, if one or several nodes fail or if new nodes join the cluster, stabilization will migrate data and change the neighbors (see Data migration).
Thus the stabilization duration depends on the amount of data to migrate, if any. Migrating data is done as fast as the underlying architecture permits.
The interval length between each stabilization can be anywhere between 1 (one) second and 2 (two) minutes.
When the node evaluates its neighbors in the cluster are stable, it will increase the duration between each stabilization check. On the contrary, when its neighbors are deemed unstable, the duration between stabilization checks will be reduced.
Tip
Stabilization happens when bootstrapping a cluster, in case of failure or when adding nodes. It is transparent and does not require any intervention.
Data migration only occurs when a new node joins the ring. If the new node is the successor of keys already bound to another node, data migration will take place. Data migration occurs regardless of data replication, as it makes sure entries are always bound to the correct node.
Note
Data migration is always enabled.
Nodes may join a ring when:
- In case of failure, when the node rejoins the ring upon recovery
- When the administrator expands the cluster by adding new nodes
Removing nodes does not cause data migration. Removing nodes results in inaccessible entries, unless data replication is in place (see Data replication).
At the end of each stabilization cycle, a node will request its successor and its predecessor for entries within its range.
More precisely:
- N joins the ring by looking for its successor S
- N stabilizes itself, informing its successor and predecessor of its existence
- When N has both predecessor P and successor S, N request both of them for the [P; N] range of keys
- P and S send the requested keys, if any, one by one.
Note
Migration speed depends on the available network bandwidth. Therefore, a large amount of data (several gigabytes) to migrate may negatively impact performances.
During migration, nodes remain available and will answer to requests, however since migration occurs after the node is registered there is a time interval during which entries in migration may be temporarly unvailable (between steps #3 and #4).
Failure scenario:
- A new node N joins the ring, its predecessor is P and its successor is S
- A client looks for the entry e, it is currently bound to S but ought to be on N
- As N has joined the ring, the client correctly requests N for e
- N answers “not found” as S has not migrated e yet
Entry e will only be unavailable for the duration of the migration and does not result in a data loss. A node will not remove an entry until the peer has fully acknowledged the migration.
Tip
Add nodes when the traffic is at its lowest point.
Data replication greatly reduces the odds of functional failures at the cost of increased memory usage and reduced performance when adding or updating entries.
Note
Replication is optional and disabled by default (see quasardb daemon).
Data is replicated on a node’s successors. For example, with a factor two replication, an entry will be maintained by a node and by its successor. With a factor three replication, an entry will be maintained by a node and and by its two successors. Thus, replication linearly increases memory usage.
Note
The replication factor is identical for all nodes of a cluster and is configurable (see quasardb daemon). By default it is set to one (replication disabled).
The limit to this rule is for clusters with fewer nodes than the replication factor. For example, a two nodes cluster cannot have a factor three replication.
Replication is done synchronously as data is added or updated. The call will not successfully return until the data has been stored and fully replicated.
When a node fails and leaves the ring, data will be replicated on the new successor after stabilization completes. This means that simultaneous failures between two stabilizations may result in inaccessible entries (see Impact on reliability)
Note
Since the location of the replication depends on the order of nodes, control of the physical location can be done through control of the nodes’s id.
Replication main benefits are in the fields of reliability and resilience:
- When adding a new node, data remains accessible during migration. The client will look up replicas should it fail to access the original entry (see Data migration)
- When a node becomes unreachable, replicas will take over and service requests
When a new node joins a ring, data is migrated (see Data migration). When replication is in place, the migration phase also includes a replication phase that consists in copying all the entries to the successor. Thus, replication increases the migration duration.
Because of the way replication works, an original and a replica entry cannot be simultenously edited. The client will always access the version considered the original entry and replicas are always overwritten in favor of the original.
A version is original if it belongs to the node range, if not, it is a replica. A replica becomes original when the range of the node changes.
In other words, the client accesses the replica after ring stabilization. It does not attempt to directly read the entry of the successor. Therefore, replication is totally transparent to the client.
This comes at the cost of some unavailability. An when the ring is unstable and replicating entries.
Formally put, this means that quasardb may chose to sacrifice Availability for Consistency and Partitionability during short periods of time.
For an entry x to become unavailable, all replicas must simultaneously fail.
More formally, given a \(\lambda(N)\) failure rate of a node N, the mean time \(\tau\) between failures of any given entry for an x replication factor is:
This formula assumes that failures are unrelated, which is never completly the case. For example, the failure rates of blades in the same enclosure is correlated. However, the formula is a good enough approximation to exhibit the exponential relation between replication and reliability.
Tip
A replication factor of two is a good compromise between reliability and memory usage as it gives a quadratic increase on reliablity while increasing memory usage by a factor two.
All add and update (“write”) operations are \(\tau\) slower when replication is active. Read-only operations are not impacted.
Replication also increases the time needed to add a new node to the ring by a factor of at most \(\tau\).
Tip
Clusters that mostly perform read operations greatly benefit from replication without any noticeable performance penalty.
To build a cluster, nodes are added to each other. A node only needs to know one other node within the ring (see Your first quasardb cluster). It is paramount to make sure that rings are not disjointed, that is, that all nodes will eventually join the same large ring.
The simplest way to ensure this is to make all nodes initially join the same node. This will not create a single point of failure as once the ring is stabilized the nodes will properly reference each other.
If following a major network failure, a ring forms two disjointed rings, the two rings will be able to unite again once the underlying failure is resolved. This is because each node “remembers” past topologies.
A client may connect to any node within the cluster. It will automatically discover the nodes as needed.
When a node recovers from failure, it needs to reference a peer within the existing ring to properly rejoin. The first node in a ring generally does not reference any other, thus, if the first node of the ring fails, it needs to be restarted with a reference to a peer within the existing ring.
quasardb is designed to be extremely resilient. All failures are temporary, assuming the underlying cause of failure can be fixed (power failure, hardware fault, driver bug, operating system fault, etc.).
However, there is one case where data may be lost:
- A node fails and
- Data is not replicated and
- The data was not persisted to disk or storage failed
The persistence layer is able to recover from write failures, which means that one write error will not compromise everything. It is also possible to make sure writes are synced to disks (see quasardb daemon) to increase reliability further.
Data persistence enables a node to fully recover from a failure and should be considered for production environments. Its impact on performance is negligible for clusters that mostly perform read operations.
When a node fails, a segment of the ring will become unstable. When a ring’s segment is unstable, requests might fail. This happens when:
- The requested node’s predecessor or successor is unavailable and
- The requested node is currently looking for a valid predecessor or successor
In this context the node choses to answer to the client with an “unstable” error status. The client will then look for another node on the ring able to answer its query. If it fails to do so, the client will return an error to the user.
When a node joins a ring, it is in an unstable state until the join is complete.
That means that although a ring’s segment may be unable to serve requests for a short period of time, the rest of the ring remains unaffected.
In a production environment, cluster segments may become unstable for a short period of time after a node fails. This temporary instability does not require human intervention to be resolved.
Tip
When a cluster’s segment is unstable requests might temporarily fail. The probability for failure is exponentially correlated with the number of simultaneous failures.
A cluster can successfully operate with a single node; however, the single node may not be able to handle all the load of the ring by itself. Additionally, managing node failures implies extra work for the nodes. Frequent failures will severely impact performances.
Tip
A cluster operates best when more than 90% of the nodes are fully functional. Anticipate traffic growth and add nodes before the cluster is saturated.