One of the often asked use cases in any database is the possibility for Analytics.
It all ends up with the question that How am I going to consolidate or do an aggregation of my data?
To see how Cassandra responds to this, I would like to demonstrate and explain with one of the analytical use case.
Use Case :
I would say this is one of the most common use case.
Let's take a scenario in which there are N number of nodes which collects its own statistics and update periodically in Cassandra. Say It collects the statistics from its own node and update for every one minute in Cassandra.
As a use case, Let's do an aggregation in any of the statistics collected in Node and collectively in cluster level.
Data Model for this use case :
How I designed the data model for this use case is
nodedata - To collect the statistics from every node, clusterdata - To collect the consolidated statistics from all the nodes in the cluster. |
To make it understandable this is How it looks in database level,
Table1 : nodedata
Table Description : This collects the statistics of each node
Table Schema :
nodedata (
nodeip text,
timestamp timestamp,
flashmode text,
physicalusage int,
readbw int,
readiops int,
totalcapacity int,
writebw int,
writeiops int,
writelatency int,
PRIMARY KEY (nodeip, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC)
Table2 : clusterdata
Table Description : This collects the total (aggregated) statistics of all the nodes
clusterdata (
clustername text,
timestamp timestamp,
flashmode text,
physicalusage int,
readbw int,
readiops int,
totalcapacity int,
writebw int,
writeiops int,
writelatency int,
PRIMARY KEY (clustername, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC)
Test Run Results :
I tried with different set of data counts and observed the result with different aggregation logics like
- Max - which finds the maximum value in the statistics.
- Sum - which collects the summation of all the statistics.
- Average - Which collects the sum of all the statistics and perform the average of it with its total number.
Node 1 - 172.30.56.60, Node 2 - 172.30.56.61 and Node 3 - 172.30.56.62
1) For collecting the node level statistics the following queries are used on one of the statitics collected for a node.
- SELECT max(readiops) from nodedata where nodeip = ‘172.30.56.60’
- SELECT sum(readiops) from nodedata where nodeip = ‘172.30.56.60’
- SELECT avg(readiops) from nodedata where nodeip = ‘172.30.56.60’
Node level statistics |
Data Count in numbers
|
Read Timeout
|
max(readiops) in ms
|
sum(readiops) in ms
|
avg(readiops) in ms
|
1440 (1 day)
|
5000 (Default)
|
149
|
299
|
365
|
10080 (1 week)
|
5000 (Default)
|
636
|
737
|
878
|
44640 (1 month)
|
5000 (Default)
|
944
|
1011
|
1211
|
100000 (70 days)
|
5000 (Default)
|
1462
|
1471
|
1563
|
200000 (140 days)
|
5000 (Default)
|
2512
|
2701
|
3003
|
300000 (210 days)
|
10000
|
6202
|
6119
|
6255
|
400000 (280 days)
|
10000
|
7212
|
8715
|
8985
|
500000 (348 days)
|
10000
|
8414
|
8915
|
9279
|
2) For collecting the cluster level statistics the following queries are used on one of the statitics collected for a node.
SELECT * FROM nodedata WHERE nodeip = \''+ nodeIps[nodeip] +'\' AND timestamp >= '+ fromTime +' AND timestamp <= '+toTime
where nodeIps = '172.30.56.60','172.30.56.61','172.30.56.62'
timestamp is a range between every minute
Number of Data’s
|
Execution Time (in ms)
|
1 (1 min)
|
260
|
1440 (1 day)
|
8906
|
10080 (1 week)
|
46404
|
44640 (1 month)
|
203514
|
100000 (70 days)
|
464646
|
200000 (140 days)
|
789316
|
300000 (210 days)
|
1111819
|
500000 (348 days)
|
1852100
|
Test Run Observations :
Points to highlight :
i) This test (performance of statistical table) is performed with minimum system configuration : 1 CPU, 4GB Ram in which 1 GB for Cassandra.
ii) This test is performed with maximum stress probaility of collecting statistics for every minute in all 3 nodes.
Observation :i) This test (performance of statistical table) is performed with minimum system configuration : 1 CPU, 4GB Ram in which 1 GB for Cassandra.
ii) This test is performed with maximum stress probaility of collecting statistics for every minute in all 3 nodes.
Since it supports 5 months of data with the minimum system configuration, This data model fits for this use case. If we have a plan to support 1 year stat data for aggregation then we need to modify the read timeout to 10000 ms and we are able to see still it scales.
User Defined Functions :
You won't believe me, Cassandra provides a way to write our own aggregation logic. As of now, Max, Sum and Average are default functions in Cassandra but let's say we want to write our own, It is possible in Cassandra.
I would like to explain few which is written by me,
1) To enable user defined functions, we need to enable this feature in yaml file like below,
2) Then write the function and aggregation method as shown below,
I have written an user defined function for 'Max' and 'Sum' operation in Cassandra,
We could see that, I have written an function for summation with caller on aggregate and when I perform the summation for a particular column 'totalcapacity' statistics, It is able to perform the sum and produce the result.
I would say it is one of the Awesometic feature in Cassandra, which makes us write our own logic for performing aggregation.
I believe it's a Good Journey that we explored as much as possible in Cassandra.
As per my working experience in NoSQL Databases, Cassandra is one of the most trending databases which is designed with a unique Architecture.
I would say this uniqueness and intelligence in this database made me explore it and also to write about it. As per my working experience in NoSQL Databases, Cassandra is one of the most trending databases which is designed with a unique Architecture.
I think this data model design in this use case is the reason for this throughput. let me try and get back if possible
ReplyDeleteSure, Ping me back for any help.
Delete