There are few new terminologies we will come across in this post. To give an head-ups I am listing the terminologies here.
Commit Log :
Mem Table :
SS Table :
These Sorted String tables are stored in disk sequentially and maintained for each Cassandra table. Each table will have the set / individual ss tables based on the compaction. Also, these SS Tables are immutable. Generally, these ss tables will be stored in the data folder grouped by table names (Example : /home/hari/Cassandra/apache-cassandra-3.11.1/data/data/*).
Compaction :
Coordinator node :
It is just any node in the cluster that determines which node/replica should perform the read and write for a particular request.
It is chosen based on the load balancer policy used in the client. Say we have 4 nodes N1, N2, N3, N4 and If the load balancer policy is set to 'Round-Robin' then for the first read / write operation the node N1 will act as coordinator node, then for the next request it will be N2, then it will be N3, N4 and then N1 again.
We have another load balancer policy called 'token-aware' which choose the coordinator node based on the hash value for the key is responsible for.
As per the above example,
- The node 3 is chosen as coordinator node for the client request.
- Node 3 chooses the node 5 (based on the hash value of the key used for read/write) considering that node 5 holds / responsible for that data, it sends a call to node 5.
- After read/write operation, node 5 sent back the response to co-ordinator node.
- Then coordinator node sent back the response to the client.
Sometimes the coordinator node itself will hold the data so it read/write to the same node. So the coordinator does the work of Master like any other database but unlike other database there is no single coordinator here and it is choosen per client request based on the load balaner policy specified by client. As a default 'token-aware policy' is used by Cassandra to choose the coordinator node.
Now, Let's see the write operation in Cassandra,
Write Operation :
The coordinator sends a write request to replicas. Consider all the replica nodes are up, they will receive write request regardless of their write consistency level.
What is write consistency levels in Cassandra?
To say in simple words, Write Consistency level determines how many nodes will respond back with the success acknowledgement.
(i.e) Let's say
1) We have 4 nodes N1, N2, N3 and N4 with the Replication Factor (RF) as '3' and Write Consistency (WC) as '3'
2) But what If the Replication Factor (RF) is '3' and Write Consistency (WC) is '2'
As similar to the point 1, Once the client chooses the Node N1 as coordinator node and consider node 2 as the primary partition support for this key (based on ./nodetool ring) then since the write consistency is 2, It successfully writes in Node 2 and Node 3, Now It needs to write the data to the third node but since the Write Consistency is 2, It returns the response back to the coordinator node and Asynchronously writes the data to Node N4. So the write operation is successfully irrespective of checking whether the data is successfully written to the third node.
Consider the application like Twitter, which needs the higher Write and Reads is not primarily important, In this case, the WC will be set to low compared to RF, since the write needs to be faster and read is not a primary thing. So this is the reason sometimes the tweets will be reflected late in the twitter for few users based on the node it chooses to read the data.
Now let's see How the write works in Cassandra internally,
Write workflow in Cassandra |
As you can see, when Cassandra processes a write operation, data is written to the commit log / WAL (Hard disk) and then to the Mem-table (In Memory). Once the write is successfully written to commit log and then to the in memory table (Mem Table), the write is considered as successful. NOTE: While storing the data in InMemory, It is stored in the sorted format (We will see this while we explore data structure in Cassandra in the next immediate post).
Then periodically based on the threshold values mentioned in the SS table section above, the mem table flush to the SS table which means write the sorted data to the ss table.
What happens before reaching the threshold, the power went down / Cassandra is forcefully shutdown?
In this case, When Cassandra node is restarted. it first replays the commit log, and verifies that all of its data have been written to disk. If it finds data in the commit log that was not flushed to disk, it writes to disk at that time.
Once the coordinator node is chosen, The coordinator sends the request to the number of replicas for read based on the Read Consistency level.
What is read consistency levels in Cassandra?
To say in simple words, the read consistency level determines for a read request from how many replicas the read needs to be successfully done.
(i.e) Let's say
1) We have 4 nodes N1, N2, N3 and N4 with the Replication Factor (RF) as '3' and Read Consistency (RC) as '1'
Once the client chooses the Node N1 as coordinator node and consider node 2 as the primary partition support for this key (based on ./nodetool ring) then since the read consistency is 1, It successfully reads from Node 2 and sent back the response to the client.
2) But what If the Replication Factor (RF) is '3' and Read Consistency (RC) is '2'
Similar to the point 1, Once the client chooses the Node N1 as coordinator node and consider node 2 as the primary partition support for this key (based on ./nodetool ring) then since the read consistency is 1, It successfully reads from Node 2 and considering the RC = 2, It also reads from another replica say Node 3 and confirms the returned data is an updated one and sent back the response to the client.
If any node gives out of date value (based on timestamp, Cassandra internally maintains the timestamp for every operation will see about this in data structure section), a background read repair request will update that data. This process is called read repair mechanism.
Read Consistency in Cassandra |
We know that the written data can be available in Mem table or If
it's flushed it will be available in SS Table. So during the read
operation, It must read from Mem table and also from SS table and
combine the results based on the data available. Reading a part of the
data from Mem table will not create any performance hiccups since its
there in memory but doing a read from SS table will have a hard disk IO
performance hiccups.
So How to handle this?
I believe this clearly explains the write and read operation in Cassandra with its respective consistency levels.
Let's understand the data structure of Cassandra in the next post.
No comments:
Post a Comment