Global Big Data Boot Camp

logo

Sorry for not writing a post last week. I was very busy setting up office and interviewing for Yolcu360. I personaly haven’t do some research yet to create a post for this week.

So I will just make a post to inform a big data event coming this week to Dallas. This is in favor of a friend who is the initiator of that event. Its called Global Big Data Boot Camp. This is the second time of this event.

So for those of you are interested or happened to be near Dallas area. Feel free to check it out. I helped prepare some of demo for this event and worked with several experts in Big Data area like Jonathan Shook from Datastax.

There you will know not only the Big Data technology but also some sample use cases where we apply Big Data from the experts of Hadoop, Cassandra, HBase and many more.

So check it out!!!

HBase Data Modeling

hbase

After some post about Cassandra and a post about MongoDB. I will make a post about HBase. HBase is part of Hadoop technology family. HBase role in Hadoop ecosystem is as low-latency data storage. It provides faster data access than Map Reduce. Although HBase is part of Hadoop’s environment, it is quite slow in term of development progress. It is more than 4 years old but the stable release has not reached maturity yet, the major version is not yet 1.

HBase is a column-oriented database which is similar to Cassandra. By having column-oriented architecture means that incoming new data will not automatically stored in new row like in RDBMS but if the row-key already exists in the table, it will store the new data in the existing row. More like Cassandra, isn’t it?

However, there are differences between Cassandra’s implementation of column-oriented database and HBase’s implementation of column-oriented database. Details on how Cassandra implement its version of column-oriented database can be seen in my previous post.

In HBase there is no similar term of column-key where every new incoming data must have besides row-key, or partition-key in Cassandra. Unlike Cassandra where the definition of column is fixed when we define the table, In HBase we can add column when we insert data into it.

So what would we declare when create a table in HBase? we only declare the table name, primary key and a column-family(ies). What is column-family? Column family is a logical grouping of columns. The columns themselves are not defined when we create a table, especially when we create it using HBase shell. We can create the column on the fly when we insert the data. How cool is that?

Let’s make some illustration to clarify what I am talking about. let’s say we are going to create a table of authors and their books. We have authors’ data which is normally contain information like first name and last name. Let’s keep it simple here. We also have books’ data which contains the ISBN (International Standard Book Number) and book title. Keep it simple I say.

We will use HBase shell to create the table and do some insert statement. To open HBase shell use the following command on terminal.

hbase shell

After we enter the HBase shell we will issue command to create the table. It is remakrably as simple as this:

create 'AuthorsAndBooks', 'author', 'book'

The command above create a table name AuthorsAndBooks. It contains two column families. The first column family is author where it contains authors’ personal information, hence the name is author. The second column family will contains the books information, hence the name book. As you can see no columns are defined when we create the table above. We will define the column, like first name, as we insert the data.

To insert the data using hbase shell use the following command:

hbase> put 'AuthorsWithBooks', 'authorid1234', 'author:firstname', 'my first name'
hbase> put 'AuthorsWithBooks', 'authorid1234', 'author:lastname', 'my last name'
hbase> put 'AuthorsWithBooks', 'authorid1234', 'book:book_id_12345', '<isbn>12345</isbn><title>mary had a little lamb</title>'
hbase> put 'AuthosWithBooks', 'authorid1234', 'book:book_id_67890', '<isbn>67890</isbn><title>the importance of being earnest</title>'

As you can see we define the first name and last name when we insert the data. Also we are creating nested entity in the table for the book. We see that the book information is stored in one cell, isbn and title. To query this we need to parse the book column family.

Hope this helps 🙂