Cassandra Bible: Write and Read operation with Consistency levels

There are few new terminologies we will come across in this post. To give an head-ups I am listing the terminologies here.

Commit Log :

Like any other database, WAL (Write Ahead Log) in Cassandra is called as Commit Log. It is obviously used for disaster recovery. Since it is a WAL, the first write will be written in the Commit Log for durability. After the write is successful in this appending log the data will be copied to the In-Memory table called 'Mem Table' which makes the write operation successful. We will see this topic clearly in the write operation below. Generally these commit log will be in the data folder location. (Example : apache-cassandra-3.11.1/data/commitlog/*).

Mem Table :

After data are written in Commit log, data is written in Mem-table. Data is written in Mem-table temporarily since it is an In-Memory table.

SS Table :

Once the Mem-Table reaches the threshold, the data are flushed to the hard disk as SS Table (Sorted String table).

So what are these thresholds which make the Mem-Table flush the data to SS Table?

1) When the Mem-Table is full in memory, the data will be automatically flushed to SSTable.

2) We can set the time for the mem-table to flush to SSTable.

3) We can forcefully flush the Mem-Table to SSTable using the utility in the bin folder (./nodetool flush)

These Sorted String tables are stored in disk sequentially and maintained for each Cassandra table. Each table will have the set / individual ss tables based on the compaction. Also, these SS Tables are immutable. Generally, these ss tables will be stored in the data folder grouped by table names (Example : /home/hari/Cassandra/apache-cassandra-3.11.1/data/data/*).

Compaction :

Generally, there will be N number of SS tables for a table. Compaction happens periodically which re-write all the ss table combining into a single file for better Read performance. This also removes the marked deleted row entries.

We can forcefully perform the compaction using the utility (./nodetool compact)

Coordinator node :

It is just any node in the cluster that determines which node/replica should perform the read and write for a particular request.

It is chosen based on the load balancer policy used in the client. Say we have 4 nodes N1, N2, N3, N4 and If the load balancer policy is set to 'Round-Robin' then for the first read / write operation the node N1 will act as coordinator node, then for the next request it will be N2, then it will be N3, N4 and then N1 again.

We have another load balancer policy called 'token-aware' which choose the coordinator node based on the hash value for the key is responsible for.

As per the above example,

The node 3 is chosen as coordinator node for the client request.
Node 3 chooses the node 5 (based on the hash value of the key used for read/write) considering that node 5 holds / responsible for that data, it sends a call to node 5.
After read/write operation, node 5 sent back the response to co-ordinator node.
Then coordinator node sent back the response to the client.

Sometimes the coordinator node itself will hold the data so it read/write to the same node. So the coordinator does the work of Master like any other database but unlike other database there is no single coordinator here and it is choosen per client request based on the load balaner policy specified by client. As a default 'token-aware policy' is used by Cassandra to choose the coordinator node.

Now, Let's see the write operation in Cassandra,

Write Operation :

The coordinator sends a write request to replicas. Consider all the replica nodes are up, they will receive write request regardless of their write consistency level.

What is write consistency levels in Cassandra?

To say in simple words, Write Consistency level determines how many nodes will respond back with the success acknowledgement.

(i.e) Let's say

1) We have 4 nodes N1, N2, N3 and N4 with the Replication Factor (RF) as '3' and Write Consistency (WC) as '3'

Once the client chooses the Node N1 as coordinator node and consider node 2 as the primary partition support for this key (based on ./nodetool ring) then since the write consistency is 2, It successfully writes in Node 2, Node 3 and Node 4. Since the RF = 3 and WC = 3.

2) But what If the Replication Factor (RF) is '3' and Write Consistency (WC) is '2'

As similar to the point 1, Once the client chooses the Node N1 as coordinator node and consider node 2 as the primary partition support for this key (based on ./nodetool ring) then since the write consistency is 2, It successfully writes in Node 2 and Node 3, Now It needs to write the data to the third node but since the Write Consistency is 2, It returns the response back to the coordinator node and Asynchronously writes the data to Node N4. So the write operation is successfully irrespective of checking whether the data is successfully written to the third node.

Consider the application like Twitter, which needs the higher Write and Reads is not primarily important, In this case, the WC will be set to low compared to RF, since the write needs to be faster and read is not a primary thing. So this is the reason sometimes the tweets will be reflected late in the twitter for few users based on the node it chooses to read the data.

Now let's see How the write works in Cassandra internally,

Write workflow in Cassandra

As you can see, when Cassandra processes a write operation, data is written to the commit log / WAL (Hard disk) and then to the Mem-table (In Memory). Once the write is successfully written to commit log and then to the in memory table (Mem Table), the write is considered as successful. NOTE: While storing the data in InMemory, It is stored in the sorted format (We will see this while we explore data structure in Cassandra in the next immediate post).

Then periodically based on the threshold values mentioned in the SS table section above, the mem table flush to the SS table which means write the sorted data to the ss table.

What happens before reaching the threshold, the power went down / Cassandra is forcefully shutdown?

In this case, When Cassandra node is restarted. it first replays the commit log, and verifies that all of its data have been written to disk. If it finds data in the commit log that was not flushed to disk, it writes to disk at that time.

Read Operation :

Once the coordinator node is chosen, The coordinator sends the request to the number of replicas for read based on the Read Consistency level.

What is read consistency levels in Cassandra?

To say in simple words, the read consistency level determines for a read request from how many replicas the read needs to be successfully done.

(i.e) Let's say

1) We have 4 nodes N1, N2, N3 and N4 with the Replication Factor (RF) as '3' and Read Consistency (RC) as '1'

Once the client chooses the Node N1 as coordinator node and consider node 2 as the primary partition support for this key (based on ./nodetool ring) then since the read consistency is 1, It successfully reads from Node 2 and sent back the response to the client.

2) But what If the Replication Factor (RF) is '3' and Read Consistency (RC) is '2'

Similar to the point 1, Once the client chooses the Node N1 as coordinator node and consider node 2 as the primary partition support for this key (based on ./nodetool ring) then since the read consistency is 1, It successfully reads from Node 2 and considering the RC = 2, It also reads from another replica say Node 3 and confirms the returned data is an updated one and sent back the response to the client.

What If the response received from any of the nodes is not an updated one?
If any node gives out of date value (based on timestamp, Cassandra internally maintains the timestamp for every operation will see about this in data structure section), a background read repair request will update that data. This process is called read repair mechanism.

Read Consistency in Cassandra

We know that the written data can be available in Mem table or If it's flushed it will be available in SS Table. So during the read operation, It must read from Mem table and also from SS table and combine the results based on the data available. Reading a part of the data from Mem table will not create any performance hiccups since its there in memory but doing a read from SS table will have a hard disk IO performance hiccups.

So How to handle this?

Cassandra comes up with the logic of bloom filter, It is an In-memory data structure associated with each SSTable that checks if any data for the requested row exists in the SSTable. So before doing an IO in an SS Table, It checks the bloom filter and If the entry is available it does an IO in that SS table. This is why we need a compaction in Cassandra, After compaction, we just need to read from the minimum number of ss tables.
I believe this clearly explains the write and read operation in Cassandra with its respective consistency levels.

Let's understand the data structure of Cassandra in the next post.

Cassandra Bible

Sunday, 24 December 2017

Write and Read operation with Consistency levels