Making Big Data Learning Easier with Docker

DockerLogo

First of all I would like to thank you to all my beloved readers for reading this blog. Because of you I received another stats booming notifications from wordpress.Once the traffic has reached a high number, I would migrate this blog to a dedicated web site with its own domain name.

notif-wp

Also for my Indonesian readers. The definition of Big Data in Bahasa Indonesia has been selected by google as Google’s definition of Big Data in Bahasa Indonesia. Jadi terima kasih banyak, thank you, merci, grazie, obrigado, xie xie, kansahamnida, arigatou, danke, gracias, धन्यवाद. Those are thank you in some of the languages that my visitors are coming from. I want to say in all languages but I still cannot speak Swahili 🙂

OK now back to the post. In the past 2 weeks I learned about a new emerging technology in software development. Its not really new actually it has been around for more than 3 years. And yes ofcourse its name is Docker.

For those who don’t know, Docker is a container technology where we can install various software. It is kind of lightweight virtual machine. Docker has gained popularity in the past few years.

One of the thing I like about docker is Docker hub. Docker hub is some kind of repository that contains many linux images pre-installed with various software. For example you can find linux image with mysql database pre-installed. Having this image you can quickly setup docker container without having to install all required dependency softwares. We can also create our own image and put it on Docker hub.

Docker can ease us to learn, develop and test big data application. How? exactly by using the pre-defined image with all the required dependencies.

I looked on Docker hub and see there are quite many big data softwares available in it for example elasticsearch, cassandra, spark, mongodb, and couchbase. All of them are official image and not created by someone and just put it on Docker hub.

Using this method, no need to setup NoSQL database, data analytics framework, and so on. It is much faster to start creating big data application and tested it right on our PC/laptop. We can also simulate a cluster there.

One thing I cannot fine hadoop official in Docker hub. Perhaps because even hadoop is dockerized, most laptop/PC cannot cope with it.

However, no need to fear the good thing about this community is that if it is not there, someone will put it there soon. And its true. There is Ferry that helps us to create big data cluster, not only hadoop. There is Pachyderm aan analytics tools to analyze data in a container. And Coho to help running big data application as microservices.

I personally will not use those three tools in production right now not because they are bad, but because they are new and I haven’t heard many companies use them, so not really combat-proven for relying pour big data infrastructure to them. Another reason is because some of them are not free and open source but that is a matter of personal preferences. But nevertheless these docker and all of thopse tools are good to help learning big data. Hence in the title I put ‘Learning’ and not ‘Operating in Production’ nor ‘Deploying in Production’.

Hope this helps 🙂

Lambda Architecture ala Cassandra

First of all I want to announce some good news. I received a notification of stats booming from wordpress. I don’t feel any changes in pas few days. But, indeed, in the past few months the traffic has increased significantly. It’s not as huge as popular sites. But it makes me happy knowing my writings are enjoyed by many people. I guess wordpress run the traffic check in batch so its kind a late to get the notification at this moment. Here is the notification looks like.

notif-wp

OK back to the post. One of my earliest posts, I wrote about general architecture of Big Data which is actually a description of lambda architecture.

The general implementation of Big Data is to have the big data analytics attached to the current and conventional system (the one using RDBMS). All we need is some kind of connector that will, either, stream or export in bulk to the big data analytics system. The analytics process in big data can be handled in realtime or batch manners (the lambda architecture). 

lambda-conventional2

Cassandra on the other hand offer another form of lambda architecture. Here Cassandra (at least the guys in Datastax) wants to be the one stop service providing all solution. While it might work for some cases, it may not be suitable for other cases. Here is cassandra solution looks like.

lambda-cassandra2

In the picture above we dont show backend explicitly but I think you get the point. No RDBMS. all transactions for frontend can be served using the transaction data center. And by setting the replication accross all other data centers (analytics and search), the auta is automatically replicated. This removes the needs of using data ingestion tool to transfer data.

Later, for analytics we can run Spark both in realtime and/or batch mode without disturbing the transaction DC that is serving clients realtime. The results of the analytics can then be served to amangers, or customers. Search data center will need to use Lucene (Solr or Elasticsearch) to serve good searching capability. This is what is missing in open source cassandra. Only available using DSE.

Hop this helps 🙂

Analytics on Cassandra

cassandra

Some of you who follow my posts about Cassandra and perhaps try on your own, would definitely realize that Apache Cassandra is mostly suitable for OLTP, i.e. Cassandra is very good for logging transactions. Its write speed is amazing and can log 1 million writes per second.

However, cassandra is not so good when it comes to analytics. The main reason is the query engine is highly designed to serve queries that have primary keys in their parameters. Unlike RDBMS, where the queries can use even non-primary keys only as their parameters.

Having this limitation, some has turned their back and go using other NoSQL databases. Because what good of having transactions logged in if we cannot analyze it. By analyze I mean OLAP-stuffs (like slice and dice) and full-text search.

The Apache Cassandra community and Datastax realizes this and include bunch of features to address this. Those features are:

  1. Integration with Apache Spark. Apache Spark can use data within Cassandra data store and perform analytics and even machine learning on it.
  2. Datastax release Datastax Enterprise. That is Cassandra with bunch other software integrated with it. Such software include apache spark, Apache Solr, etc.
  3. Integrate Elasticsearch. This is an effort to provide full-text search capability on Cassandra. A trigger to insert data to Elasticsearch is created so every new insert will go to elastic search too. You can find it here.
  4. Using User Defined Function Aggregate. This is a new feature introduced in Cassandra 2.2. The idea is we can define our own aggregate function like average and even GROUP BY. These function will not required to use the primary keys. If it doesn’t use primary keys the results would take longer time to complete.

Those are the efforts to achieve analytics-capability on Cassandra. If you found one that I missed. Feel free to contact me. I will add it in the list.

Hope this helps 🙂

MemSQL: Pas la même chose

memsql

For those who doesn’t know the title it means MemSQL: It is not the same thing. I wrote the the title in french since it can rhyme with MemSQL. Yes, indeed, MemSQL is our topic of post today.

When I first the name MemSQL I thought this is another memory database. I wasn’t totally wrong. It is a memory database. My mistake was I thought it was the same as MemCache or Redis.

Boy was I shocked. It is mainly store the data in memory first but it will save the data to disk to have more persistent storage. It uses memory to fast process the data. By process I don’t mean just push the data to table or some process like MongoDB. The process I mean is more to transaction process.

If there is one weakness of NoSQL databases that I heard mostly by people who just use NoSQL is the lack of transaction capabilities. Yes there are some NoSQL databases  claim to havev this feature to some extent but it is still not as good as how RDBMS handle transactions.

For people who know deep about NoSQL databases, they would say that most, if not all, NoSQL is not trying to address the consistency to the fullest as RDBMS. This is understood since NoSQL is intended to store a very huge data and maintaining consistencies among these huge data is hard.

I am not saying that consistency is completely ignored by NoSQL databases. Data will consistent in NoSQL databases given the right time so that the data is all distributed.

To tackle this problem, the normal action taken is to use more than one type of database. One NoSQL database and one RDBMS database. This action is considered as one of best practice in Big Data usage.

Here is where MemSQL comes to the rescue. MemSQL somehow disrupt this status quo. It aims to replace the practice of using two databases with its abilities.

MemSQL can handle hundreds of thousands insert operation like other NoSQL databases, thus, suitable for OLTP. While most, if not all, NoSQL databases have difficulties to apply transactions. It does not since it uses memory as first storage where all processing is faster than disk. Because of this hardware requirement for this memory is quite high. The minimum RAM requirement is 8 GB.

To incorporate transactions processing, MemSQL apply the famous Write Ahead Log (WAL) strategy. all transaction’s operations are written on disk before actually start the transaction. This way, when a disaster happen like for example the node is down, the transaction can resume, or rolled back, to keep the data consistency.

MemSQL also built with scalability in mind. So we can expand the cluster according to our needs. Another one principal feature that I think most NoSQL database must have is its capability to integrated with major cloud computing service like AWS. Also one thing I like is that we can use common MySQL client to access data in MemSQL. I think their goal is to make adoption easier since MySQL is still quite famous.

Some companies like pinterest, use MemSQL to hepl boost their realtime analytics since some anomalies can be detected in very fast while they are happening. In conventional NoSQL databases anomaly normally detected when batch analytics are run after some period of time. Thanks to some computing library like Apache Spark this time can be reduced but it is not as easy as using MemSQL and just simple SQL knowledge.

Hope this helps 🙂