This is the continuation of Building Minimized Big Data Infrastructure. This time the focus would be more in data modelling for Cassandra. Cassandra is a column-based data storage. What the h*** does that mean? hold on guys. For those of you who have strong knowledge on RDBMS or conventional database this is very different. Make no mistake, not all NoSQL database are column-based. Some database like MongoDB is not column-based. Another example of column-based NoSQL is Apache HBase.
Most RDBMS like MySQL, SQL Server, etc, are row-based database. This means that for every new data, it must have unique key and it will be stored as a new row on the table. New data who don’t have unique key will be rejected.
This is not the case for column-based database. Data still need a key for indexing. But we can add new data with non-unique key. The new data will be stored in the same row and in new column. So instead of adding new row like in row-based database, column-based database add new column. This way we can have rows within same table that have different number of column. Whereas the row-based database, the number of column is fixed and the same across all rows within the same table. Column-based database use this strategy to make it faster to read and write data.
Cassandra has limitation on how big one row can be. If there are too many columns and too many data in one row then it will be very painful for the query to read the data. Based on several experiences, the optimum size for a row should not exceed 10 MB. More than that and query will become very slow and if it is more than 100 MB it will eventually crashed.
Imagine an internet of things system that takes data from thermostats every second. If we only put thermostat ID and the timestamp of the data as key. For each new data coming from thermostat, a new column will be created. The number of column will increase like crazy. To avoid such event a strategy to store time series data need to be implement.
Common strategy is to use composite keys. For example in our case of internet of things above, we add a composite key consisting the thermostat ID + Date. And also this composite key is part of the primary key. So the new primary key would be composite keys (device ID + date) and timestamp of the data. This strategy would reduce the column number down to 86400 column in one row. Every day, a new row is created for each device and the column only increase for every data inserted on that day.
Although 86400 columns seem a lot, our limit is not the number of the column but the size of that row. The goal is to keep each row under 10 MB. If the size is still above 10 MB then composite key need to be revised. We lower down the time granularity from date to hour. So every row represents data from a thermostat device for each hour instead of each day. simple calculation would be:
size of each data receive X number of maximum column a row can have = size of a row
If the size is bigger than 10 MB then change the composite keys and use smaller granularity of time (for example hour or minute).
Hope this helps 🙂