Monday, 25 December 2017

Introduction to Cassandra



Before stepping in, I would like to give a short description of me, I am Hari Prasanth, Currently working as a Senior Software engineer in MSys Technologies. I have working experience in BigData storage and analytics, NoSQL Databases like MongoDB, HBase, and Cassandra and also cracked few of the Java Architectural Frameworks.

This blog guides engineers to understand what Cassandra is, how Cassandra works, why do we need Cassandra in our application, and how to use the features and capabilities of Apache Cassandra.

Understanding Classics :

 

There is a very famous theorem (CAP Theorem) in Database world which still proves and states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

Consistency - which means that data should be same in all the nodes in the cluster. If the user reads/writes from any node, the user should get the same data.

Availability - which means at any point in time, the database should be accessible for read/write and there should not be any downtime in accessing the database.

Partition Tolerance - which means that in a distributed system, the cluster continues to function even if there is a communications breakdown between two nodes. In this case, nodes are up in the cluster but not able to communicate between them but still it should work as expected.
According to this theorem, a distributed system cannot satisfy all three of these guarantees at the same time. To be frank, this Theorem says you can either have CA or CP or AP in any of the databases.

Guys, If you are able to create a new database system which supports CAP then no doubt you will be treated as GOD in the database world, I mean to say you will become a billionaire :)

Where Cassandra fits in CAP? 

 

FYI, Cassandra is a database :) and It is classified as AP in the CAP. So this is a database which focuses or provides importance to Availability and Partition tolerance.

But believe me, the beautiful feature of this database is we can tune and make this database to also meet Consistency. I got your curiosity to know about this, will see shortly about that on this blog.

What is NoSQL?


Please bear with me, This 'What is NoSQL' is for ppl who needs a quick understanding of what it is. So If you are a Master in What NoSQL is, skip this and move to the next topic.

Guys, I have some working experience in quite a number of NoSQL databases and I am telling you NoSQL is still a 'buzzword' in the database world.
For easy understanding, I would like to list what ppl say about NoSQL and my opinion on it,

         1) NoSQL is vertically Scalable - Agreed

         2) NoSQL violates ACID principle - Not all NoSQL databases, I would say it depends on database Since most of it partially supports ACID. (i.e) Mongo, HBase and Sometimes Cassandra support 100 % Durability,  Mongo and HBase support row-level locking etc., But Of course there is no concept of Transaction in the NoSQL databases.

         3) NoSQL is a Key-Value store Architecture - Perfect, It is the core concept of the NoSQL. It is useful in faster write and read.

         4) NoSQL is for Bigdata - Agreed

Yes, all these denotes NOSQL in the database world, See it violates ACID, having key-value store structure definitely violates the core principle and concepts of relational databases and this is why it is also called as Not an SQL. I would say that NOSQL sacrifices these principles and concepts to provide the performance and data scalability.

NOSQL says take care of ACID in your client code and as a compromise for it, I will provide the performance :)

What is Cassandra?


Cassandra is a NoSQL database and It is not a Master-Slave database. So which means all the nodes in the Cassandra are same. It is a peer-to-peer distributed database so it has the Masterless architecture. (FYI, throughout this blog, NODE denotes Cassandra node)

In other Master-Slave databases like MongoDB or HBase, there will be a downtime If the Master goes down and we need to wait for the next Master to come up. It is not the case in Cassandra It has no special nodes i.e. the cluster has no masters, no slaves or elected leaders. This enables Cassandra to be highly available while having no single point of failure. this is the reason it supports 'A' in CAP.

As mentioned it is a distributed database system which means a single logical database is spread across a cluster of nodes and thus the need to spread data evenly amongst all participating nodes. Cassandra stores data by dividing data evenly across its cluster of nodes. Each node is responsible for part of the data. This is how it is able to support 'P' in CAP. So it is a database which supports AP.

This answers the question why do we need Cassandra in our application, Application which demands zero downtime needs a masterless architecture and we have Cassandra for us :) In simple words, Write and Read can happen from any node in the cluster in any point in time. The below example shows the sample cluster formation in Cassandra for 5 node setup.



Another highlighting point in Cassandra is, It can provide better workload performance with the increasing number of nodes, the below example demonstrates in a better way.  

Two Nodes



Four Nodes

As per Cassandra,  If two nodes can handle 100 K transactions per second, 4 nodes can handle 200 K transactions per second and goes on.