Sunday, 24 December 2017

Data Replication



As per the previous post, We have the idea about data distribution across nodes. Now let's see how the strategy behind data replication across Nodes and also across Datacenters.

We have seen in the previous post that to provide fault tolerance, Cassandra stores the replicas of data in multiple nodes like any other database but in a unique strategical format :)  Its strategy is responsible for determining which nodes hold the replica and also How many replicas need to be there in the cluster (Based on the Replication Factor). If the Replication Factor is 2, then there will be totally two copies of data in the cluster in different nodes.

Let's see the replication strategies available in Cassandra,

1) Simple Strategy: This strategy is applicable to a single data centre setup with N number of nodes in it. We have seen the token logic used for choosing the first node in the cluster (i.e using ./nodetool ring we will be able to identify the primary data holder - previous post) so once the first node is chosen then the remaining replicas are placed in the clockwise direction in the node ring/cluster.

In the below 4 node setup (If the Replication Factor is set to '3', NOTE: Its configurable in Database/keyspace level will see that later in this blog). If node 2 is chosen by partitioner based on the partitioning/hash value of key then its replica will be stored in Node 3 and Node 4 in the cluster.




Single DataCenter setup
                                                                                                                                                                 






2) NetworkTopology Strategy: This strategy is applicable to the two / multi-data centres set up with N number of nodes. It will be interesting on looking at this strategy, replicas are meant for each Datacenter separately.   Let's say we have two datacenters with 4 nodes in each DC and If we have replication factor as 3 at the database level, then there will be 3 copies of data in DC1 and 3 copies in DC2. This is because sometimes the failure will happen at the DC level and This is how the DC level failures are handled in Cassandra. 

In the below two DC with 8 node setup will demonstrate the replication across DC's.

Multi DataCenter setup



This is how the DC level replication is maintained. To reiterate it, If the replication factor is 3 in two DC setup, then there will be 3 copies of data in each DC.


Hope based on the two post, the data distribution and replication in Cassandra will be clear, In the upcoming post we will look at the consistency levels in Cassandra.


No comments:

Post a Comment