Sunday, 24 December 2017

Analytics in Cassandra



One of the often asked use cases in any database is the possibility for Analytics.
It all ends up with the question that How am I going to consolidate or do an aggregation of my data? 

To see how Cassandra responds to this, I would like to demonstrate and explain with one of the analytical use case.

Use Case :

I would say this is one of the most common use case. 

Let's take a scenario in which there are N number of nodes which collects its own statistics and update periodically in Cassandra. Say It collects the statistics from its own node and update for every one minute in Cassandra.
As a use case, Let's do an aggregation in any of the statistics collected in Node and collectively in cluster level.



Data Model for this use case :

How I designed the data model for this use case is 

nodedata - To collect the statistics from every node, clusterdata - To collect the consolidated statistics from all the nodes in the cluster.



To make it understandable this is How it looks in database level,


Table1 : nodedata
Table Description : This collects the statistics of each node
Table Schema :
                nodedata (
nodeip text,
timestamp timestamp,
flashmode text,
physicalusage int,
readbw int,
readiops int,
totalcapacity int,
writebw int,
writeiops int,
writelatency int,
PRIMARY KEY (nodeip, timestamp)
           ) WITH CLUSTERING ORDER BY (timestamp DESC)


Table2 : clusterdata
Table Description : This collects the total (aggregated) statistics of all the nodes
             clusterdata (
clustername text,
timestamp timestamp,
flashmode text,
physicalusage int,
readbw int,
readiops int,
totalcapacity int,
writebw int,
writeiops int,
writelatency int,
PRIMARY KEY (clustername, timestamp)
          ) WITH CLUSTERING ORDER BY (timestamp DESC)


Test Run Results :


I tried with different set of data counts and observed the result with different aggregation logics like
  • Max - which finds the maximum value in the statistics.
  • Sum - which collects the summation of all the statistics.
  • Average - Which collects the sum of all the statistics and perform the average of it with its total number.
I performed this test run with the 3 node setup,
Node 1 - 172.30.56.60, Node 2 - 172.30.56.61 and Node 3 - 172.30.56.62

1) For collecting the node level statistics the following queries are used on one of the statitics collected for a node.
 
  • SELECT max(readiops) from nodedata where nodeip = ‘172.30.56.60
  • SELECT sum(readiops) from nodedata where nodeip = ‘172.30.56.60
  • SELECT avg(readiops) from nodedata where nodeip = ‘172.30.56.60

Node level statistics



Data Count in numbers
Read Timeout
max(readiops) in ms
sum(readiops) in ms
avg(readiops) in ms
1440 (1 day)
5000 (Default)
149
299
365
10080 (1 week)
5000 (Default)
636
737
878
44640 (1 month)
5000 (Default)
944
1011
1211
100000 (70 days)
5000 (Default)
1462
1471
1563
200000 (140 days)
5000 (Default)
2512
2701
3003
300000 (210 days)
10000
6202
6119
6255
400000 (280 days)
10000
7212
8715
8985
500000 (348 days)
10000
8414
8915
9279

2) For collecting the cluster level statistics the following queries are used on one of the statitics collected for a node.


SELECT * FROM nodedata WHERE nodeip = \''+ nodeIps[nodeip] +'\' AND timestamp >= '+ fromTime +' AND timestamp <= '+toTime


        where nodeIps = '172.30.56.60','172.30.56.61','172.30.56.62'

                   timestamp is a range between every minute






Number of Data’s
Execution Time (in ms)
1 (1 min)
260
1440 (1 day)
8906
10080 (1 week)
46404
44640 (1 month)
203514
100000 (70 days)
464646
200000 (140 days)
789316
300000 (210 days)
1111819
500000 (348 days)
1852100



Test Run Observations :

 

 Points to highlight :

i) This test (performance of statistical table) is performed with minimum system configuration : 1 CPU, 4GB Ram in which 1 GB for Cassandra.

ii) This test is performed with maximum stress probaility of collecting statistics for every minute in all 3 nodes.
  Observation :

Since it supports 5 months of data with the minimum system configuration, This data model fits for this use case. If we have a plan to support 1 year stat data for aggregation then we need to modify the read timeout to 10000 ms and we are able to see still it scales.

 

User Defined Functions :

You won't believe me, Cassandra provides a way to write our own aggregation logic. As of now, Max, Sum and Average are default functions in Cassandra but let's say we want to write our own, It is possible in Cassandra.
I would like to explain few which is written by me,

1) To enable user defined functions, we need to enable this feature in yaml file like below,


2) Then write the function and aggregation method as shown below,

I have written an user defined function for 'Max' and 'Sum' operation in Cassandra,



We could see that, I have written an function for summation with caller on aggregate and when I perform the summation for a particular column 'totalcapacity' statistics, It is able to perform the sum and produce the result.

I would say it is one of the Awesometic feature in Cassandra, which makes us write our own logic for performing aggregation. 


I believe it's a Good Journey that we explored as much as possible in Cassandra.
As per my working experience in NoSQL Databases, Cassandra is one of the most trending databases which is designed with a unique Architecture.
I would say this uniqueness and intelligence in this database made me explore it and also to write about it. 

Thank you for travelling with me throughout my posts, Meet you in the next one :)

2 comments:

  1. I think this data model design in this use case is the reason for this throughput. let me try and get back if possible

    ReplyDelete