Time Series Data Modelling in Cassandra (Building Minimized Big Data Infrastructure Part-2)

This is the continuation of Building Minimized Big Data Infrastructure. This time the focus would be more in data modelling for Cassandra. Cassandra is a column-based data storage. What the h*** does that mean? hold on guys. For those of you who have strong knowledge on RDBMS or conventional database this is very different. Make no mistake, not all NoSQL database are column-based. Some database like MongoDB is not column-based. Another example of column-based NoSQL is Apache HBase.

cassandra

Most RDBMS like MySQL, SQL Server, etc, are row-based database. This means that for every new data, it must have unique key and it will be stored as a new row on the table. New data who don’t have unique key will be rejected.

This is not the case for column-based database. Data still need a key for indexing. But we can add new data with non-unique key. The new data will be stored in the same row and in new column. So instead of adding new row like in row-based database, column-based database add new column. This way we can have rows within same table that have different number of column. Whereas the row-based database, the number of column is fixed and the same across all rows within the same table. Column-based database use this strategy to make it faster to read and write data.

Cassandra has limitation on how big one row can be. If there are too many columns and too many data in one row then it will be very painful for the query to read the data. Based on several experiences, the optimum size for a row should not exceed 10 MB. More than that and query will become very slow and if it is more than 100 MB it will eventually crashed.

Imagine an internet of things system that takes data from thermostats every second. If we only put thermostat ID and the timestamp of the data as key. For each new data coming from thermostat, a new column will be created. The number of column will increase like crazy. To avoid such event a strategy to store time series data need to be implement.

single_key

 

Common strategy is to use composite keys. For example in our case of internet of things above, we add a composite key consisting the thermostat ID + Date. And also this composite key is part of the primary key. So the new primary key would be composite keys (device ID + date) and timestamp of the data. This strategy would reduce the column number down to 86400 column in one row. Every day, a new row is created for each device and the column only increase for every data inserted on that day.

composite_key

Although 86400 columns seem a lot, our limit is not the number of the column but the size of that row. The goal is to keep each row under 10 MB. If the size is still above 10 MB then composite key need to be revised. We lower down the time granularity from date to hour. So every row represents data from a thermostat device for each hour instead of each day. simple calculation would be:

size of each data receive number of maximum column a row can have =  size of a row

If the size is bigger than 10 MB then change the composite keys and use smaller granularity of time (for example hour or minute).

Hope this helps 🙂

Building Minimized Big Data Infrastructure (Part 1)

This post is about my experience in planning of a big data infrastructure as minimum and as efficient as possible. The entire process of building the infrastructure will consists of several posts. This is the first one.

After an extensive and thoughtful consideration, I started by laying the core component as the groundwork of this infrastructure. the core component consists of the data storage and the data processing.

Knowing the type of data will be flowing to the infrastructure i came up with Cassandra. The data are structured data, thus, HDFS from Hadoop is not required. At least not at the moment since the requirement clearly stated to be as efficient as possible. after considering several data storages i decided to use Cassandra. cassandra is chosen because its easiness, scalability, and reliability.

cassandra

Data processing component is the part that does the processing of the data from data storage to produce a meaningful insight or information. For the data processing part, I chosed Apache Spark. Indeed, there are many other options for this part. Apache Storm is one of them. Apache Spark is chosen because it has the complete packages. It has component for streaming processing and also machine learning component. Spark also can be easily integrated with hadoop in case such scenario is required in the future.

storm

Once the core components are chosen. The next logical step is to connect them. Doh, without interoperability those are useless. Luckily, there are library to do just that. There is spark-casandra connector. It is quite convenient and easy to use. It can pull data from cassandra directly as Spark’s RDD. A small Spark application was created to test this connection. For this step I followed this guy’s steps.

OK sorry if its too short. the next component will be in the next post. It will be about streaming data processing and messqging component. Stay tuned 🙂

Big Data in Sport Industry (Moneyball)

MV5BMjAxOTU3Mzc1M15BMl5BanBnXkFtZTcwMzk1ODUzNg@@._V1_SY317_CR0,0,214,317_AL_

This post is not about technical stuff of Big Data. I really need a break from technical stuff and pull back to see the big picture once more. It can be really motivating when we see Big data technology and its application in real life. The industry I am going to talk about is sport industry. It is a billions dollars industry and quite comparable with entertainment industry in term of value.

Moneyball. It is about this Brad Pitt’s movie. It is not too old movie. This movie was based on a true event happened in 2002. In short, this movie is about winning a game in an unfair condition. Billy Beane was a coach of Oakland Athletic. An underdog baseball team in US. They had so little money to hire top notch baseball player in the country. Stakeholders refuse to give more money to the team. It is so unfair where big teams with big bucks seem to easily rule the entire competition since they have resources to hire best and expensive players in the league. It seemed impossible to win a game for Oakland Athletic.

However, Billy Beane decided to see the game from different angle than the conventional one. He sought the baseball team’s capability not by how many top notch players it has. He met Peter Brand who was an economics from Yale. Peter Brand had a method to assemble high quality team from undervalued players. He analyzed statistics of every undervalued players from their track records. It was a big bet. But Billy Beane had no other choice. If he had to hire undervalued players due to a very limited budget, he would prefer to hire undervalued players that could assemble a winning team. And the result was Oakland Athletic won 20 matches in a row. It stunned many people in the sport industry. They called this technique Moneyball.

What Peter Brand did for Oakland Athletic was Big Data analytics plain and simple. Although, many Big Data technologies had not surfaced at that time but I personally consider it is Big Data analytics. Peter Brand managed to see what others didn’t through statistics and historical data. They managed to achieve what other teams, even big teams, could not. Assemble a winning team with a very limited money. He found the holy grail through analytics.

Now with many big data technologies emerged and well-established, moneyball is on steroid. A huge of data are analyzed to give a maximum results with minimum effort as possible. They don’t use it only in baseball but also basketball, american football and football (the americans call it soccer). I even suspect that Big Data analytics played big role in German’s victory in the last World Cup in Brazil. There were no superstar players in German team like Messi, Ronaldo nor Neymar. But they play great as a whole team. Too bad it is just a suspicion since I have no access to their strategy.

In the Moneyball movie, what Peter Brand analyzed was limited to the players’ data. With current technology those data have expanded beyond imaginable. It is not just players’ data but also coaches’ data, managers’ data, the entire historical teams’ data. The data also include weather, city, time, date, season even supporters and population of the city where the games were held. All of these data are analyzed to find the perfect strategy not just to assemble the best team but also to create strategy on how to win a game.

A little bit out of the scope report. Moneyball strategy is now also being used in the human resources field. HR people use this strategy to find the best candidate for their companies. I hope this post helps 🙂

The Perfect (Apache) Storm

storm

Happy new years to all readers of this blog. Those who are regularly reading my posts or unintentionally reading this blog. I welcome you all. This post is my first real post in the new year of 2015. There have been lots of storm and heavy rains here in Indonesia. It makes me think about another storm that is widely used in Big Data world. The Apache Storm.

My encounter with apache storm happened few months ago when I was designing for GPS data ingestion. It seems to be a perfect solution for my problem. Some big companies have been using it for years, e.g. Twitter. It claims to be able to handle thousands of messages per second. Twitter use it to process millions, if not thousands, of tweets they receive every seconds. So, this got to be the best, no? Well yes and no. From my experiences it really depends on our situation and condition.

Storm is part of Hadoop Stack in many Hadoop distributions such has Hortonworks and Cloudera. So managing it is very easy. It also integratable with many messaging queue technologies like AMQP, Apache Kafka and Kestrel. However, Storm capabilities in analyzing and machine learning is not as good as Apache Spark. And Spark also have component that does what Storm does called Spark Streaming.

Apache Storm architecture can be divided in two types. First is the physical architecture. Physical architecture is the exact Storm’s components and other required software where we installed inside server nodes. The first one is software that is absolutely needed by Storm in order to work. It is Apache Zookeeper. So make sure you have it before having Storm. Zookeeper can be installed in the same node as Storm. However, in some cases where stateful Storm is needed, it is highly recommended to separate Zookeeper and Storm in different Nodes because Zookeeper will be used to store the states.

Second component in physical architecture is Nimbus. Nimbus is Storm’s component that acts like master in the Storm’s cluster. Node that has Nimbus installed will become Storm’s master node, sort of speak. Nimbus distribute codes and works to all Storm’s worker nodes. The third component is supervisor. Supervisor is installed in each Storm’s worker nodes. Supervisor receives works from Nimbus and order a worker process to do the work. The fourth component is the worker. It resides in the same node as the supervisor. It does all the processing of the data. Worker works based on the topology that is submitted to the Storm cluster prior the Storm process the data.

Now, the logical infrastructure of Storm. This is basically the topology where we define the procedure on how we process the messages we receives. It consists of Three components. The stream of messages or tuples. It is the messages we receive. Second is the spout where the stream or tuples come from. Spout normally interfaces with external message sources such as Apache Kafka or Kestrel. Spout does not do any heavy analytical work. actually it would be inappropriate to have any analytical work in spout. Spout then push the message to bolts. Bolt is the third component. Its job is to process and analyze the incoming messages. This is where the real analytics happen. Bolt can push the message to another bolt or save it in data store. The structure of arrangements among spouts and bolts are called Topology.

topology

Using basic Storm is a stateless storm. It means we don’t store the state we have from running the topology and consuming the tuples. It is not suitable for handling process that requires state. For example counting hashtags of all tweets we receives. The number of each hastags we received change every time we receive tweets that contain hashtag. Somehow we need to store this number.

At first, I thought Storm cannot handle this stateful processing. I was wrong. Storm now has a library called Trident to do exactly this. Trident can handle stateful data processing. Mainly it stores the state in Zookeeper. This is why for stateful storm I highly recommend to install Zookeeper and Storm on different nodes. There are several libraries that extend Trident capabilities to store the state not only on Zookeeper but also in datastore such as Cassandra. Other libraries also extends trident and add machine learning capabilities on the table.

So with all this capabilities does Apache Storm really inferior to Apache Spark in all aspect. Personally, I would say no. Apache Storm with Trident and all of those libraries is much more preferable than Apache Spark in some cases, like wise. Thats my findings of the perfect (apache) storm. Free to share your findings about Storm 🙂