Cassandra Bible: Analytics in Cassandra

One of the often asked use cases in any database is the possibility for Analytics.
It all ends up with the question that How am I going to consolidate or do an aggregation of my data?

To see how Cassandra responds to this, I would like to demonstrate and explain with one of the analytical use case.

Use Case :

I would say this is one of the most common use case.

Let's take a scenario in which there are N number of nodes which collects its own statistics and update periodically in Cassandra. Say It collects the statistics from its own node and update for every one minute in Cassandra.
As a use case, Let's do an aggregation in any of the statistics collected in Node and collectively in cluster level.

Data Model for this use case :

How I designed the data model for this use case is

nodedata - To collect the statistics from every node, clusterdata - To collect the consolidated statistics from all the nodes in the cluster.

To make it understandable this is How it looks in database level,

Table1 : nodedata

Table Description : This collects the statistics of each node

Table Schema :

nodedata (

nodeip text,

timestamp timestamp,

flashmode text,

physicalusage int,

readbw int,

readiops int,

totalcapacity int,

writebw int,

writeiops int,

writelatency int,

PRIMARY KEY (nodeip, timestamp)

) WITH CLUSTERING ORDER BY (timestamp DESC)

Table2 : clusterdata

Table Description : This collects the total (aggregated) statistics of all the nodes

clusterdata (

clustername text,

timestamp timestamp,

flashmode text,

physicalusage int,

readbw int,

readiops int,

totalcapacity int,

writebw int,

writeiops int,

writelatency int,

PRIMARY KEY (clustername, timestamp)

) WITH CLUSTERING ORDER BY (timestamp DESC)

Test Run Results :

I tried with different set of data counts and observed the result with different aggregation logics like

Max - which finds the maximum value in the statistics.
Sum - which collects the summation of all the statistics.
Average - Which collects the sum of all the statistics and perform the average of it with its total number.

I performed this test run with the 3 node setup,
Node 1 - 172.30.56.60, Node 2 - 172.30.56.61 and Node 3 - 172.30.56.62

1) For collecting the node level statistics the following queries are used on one of the statitics collected for a node.

SELECT max(readiops) from nodedata where nodeip = ‘172.30.56.60’

SELECT sum(readiops) from nodedata where nodeip = ‘172.30.56.60’

SELECT avg(readiops) from nodedata where nodeip = ‘172.30.56.60’

Node level statistics

Data Count in numbers	Read Timeout	max(readiops) in ms	sum(readiops) in ms	avg(readiops) in ms
1440 (1 day)	5000 (Default)	149	299	365
10080 (1 week)	5000 (Default)	636	737	878
44640 (1 month)	5000 (Default)	944	1011	1211
100000 (70 days)	5000 (Default)	1462	1471	1563
200000 (140 days)	5000 (Default)	2512	2701	3003
300000 (210 days)	10000	6202	6119	6255
400000 (280 days)	10000	7212	8715	8985
500000 (348 days)	10000	8414	8915	9279

2) For collecting the cluster level statistics the following queries are used on one of the statitics collected for a node.

SELECT * FROM nodedata WHERE nodeip = \''+ nodeIps[nodeip] +'\' AND timestamp >= '+ fromTime +' AND timestamp <= '+toTime

where nodeIps = '172.30.56.60','172.30.56.61','172.30.56.62'

timestamp is a range between every minute

Number of Data’s	Execution Time (in ms)
1 (1 min)	260
1440 (1 day)	8906
10080 (1 week)	46404
44640 (1 month)	203514
100000 (70 days)	464646
200000 (140 days)	789316
300000 (210 days)	1111819
500000 (348 days)	1852100

Test Run Observations :

Points to highlight :

i) This test (performance of statistical table) is performed with minimum system configuration : 1 CPU, 4GB Ram in which 1 GB for Cassandra.

ii) This test is performed with maximum stress probaility of collecting statistics for every minute in all 3 nodes.

Observation :

Since it supports 5 months of data with the minimum system configuration, This data model fits for this use case. If we have a plan to support 1 year stat data for aggregation then we need to modify the read timeout to 10000 ms and we are able to see still it scales.

User Defined Functions :

You won't believe me, Cassandra provides a way to write our own aggregation logic. As of now, Max, Sum and Average are default functions in Cassandra but let's say we want to write our own, It is possible in Cassandra.

I would like to explain few which is written by me,

1) To enable user defined functions, we need to enable this feature in yaml file like below,

2) Then write the function and aggregation method as shown below,

I have written an user defined function for 'Max' and 'Sum' operation in Cassandra,

We could see that, I have written an function for summation with caller on aggregate and when I perform the summation for a particular column 'totalcapacity' statistics, It is able to perform the sum and produce the result.

I would say it is one of the Awesometic feature in Cassandra, which makes us write our own logic for performing aggregation.

I believe it's a Good Journey that we explored as much as possible in Cassandra.
As per my working experience in NoSQL Databases, Cassandra is one of the most trending databases which is designed with a unique Architecture.

I would say this uniqueness and intelligence in this database made me explore it and also to write about it.

Thank you for travelling with me throughout my posts, Meet you in the next one :)

Cassandra Bible

Sunday, 24 December 2017

Analytics in Cassandra

Test Run Results :

Test Run Observations :

User Defined Functions :

2 comments: