Sample Hive-HBase Integration on LastFM Data (That actually works)

hive_logo_mediumhbase

Last week’s post lead to a several request on example of a Map Reduce program using Apache Hive that actually store the results on Apache HBase. Just for your information, Hive can do a lot more than just storing data into HBase table. It can also store data on other NoSQL table, like Cassandra for example. But as requested, for this post I will only show how to do that on HBase only.

lastfmlogo

The dataset I am going to use is provided by LastFM. LastFM is a music web site that provides music as a service to its users. You may think about it as like Spotify. LastFM provide sample of data that contain list of user information such as their genders, countries and signed up date. It also contains list of music played by their users. The data set can be downloaded here. For this post we will use the smallest dataset which the size is 1K. Bigger dataset has 360K in size.

In the dataset there are two files. The first file is called userid-profile.tsv. It is a Tab Separated file that contains all user information such as userid, gender, age, country and date of signed up.

The second file is called userid-timestamp-artid-artname-traid-traname.tsv. It is the file that contains the list of music played by LastFM’s users. It contains information like who play the music, when it is played, artist name, track id, and the track name. The second file is very big. So in case you are experiencing crash or the map reduce process is too slow for you. It might worth to consider breaking this file into several small files.

The Map Reduce application we will do in Hive will be to combine those two files in order to get what are the popular musics played for each country. The result would be stored in HBase. From the results we can find what are the most popular song played by users in the United States or Japan or Mexico.

Let’s get started. First we need to create table in Hive that will  be the placeholder for the userid-profile.tsv file. We need this place holder so we can refer it later for the Hive Query that will perform the Map Reduce. After the table is created we will load the data from userid-profile.tsv file into the table. To create such table and load the data in it, use the following command on Hive.

CREATE TABLE rawusers(userid string, gender string, age int, country string, signup string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

LOAD DATA LOCAL INPATH '/root/userid-profile.tsv' 
INTO TABLE rawusers;

The next is to create another table that will be used as the placeholder for userid-timestamp-artid-artname-traid-traname.tsv. Afterwards, we will load the data from that file into the table. To do that use the following command:

CREATE TABLE rawplays(userid string, timestamp 
string, artid string, artname string, traid string, traname string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

LOAD DATA LOCAL INPATH '/root/userid-plays.tsv' 
INTO TABLE rawplays;

Now we have finished the setup for the placeholder tables. The next step is to create table in HBase that will store the results of the Map Reduce process. The table creation can be done in Hive. Even it is an HBase table, we still can access it in Hive. That is why we call it Hive-HBase integration. To create such a table run the following command, still in Hive:

CREATE TABLE playspercountry(key struct<country:string,trackname:string>, NumberOfPlays int) 

ROW FORMAT DELIMITED 
COLLECTION ITEMS TERMINATED BY '|' 

STORED BY 
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES 
("hbase.columns.mapping" = ":key,cf1:NumberOfPlays")
TBLPROPERTIES 
("hbase.table.name" = "playspercountry");

After creating such table you can check in Hive if that table exist using the command

show tables;

You can also check if the same table exist in Hbase using the following command run on Hbase shell:

list

Now the main show. We will create a Hive command that will do the Map Reduce process of inserting the songs per country along with the frequencies of those songs had been played. We will co that in one command. To do that run the following command on Hive:

INSERT OVERWRITE TABLE playspercountry 

SELECT named_struct("country",u.country, 
"trackname",p.traname) AS key, 
count(*) AS NumberOfPlays

FROM rawplays p 
LEFT OUTER JOIN rawusers u ON (p.userid = u.userid)
WHERE u.country <> ''

GROUP BY p.traname, u.country 
ORDER BY key.country ASC, NumberOfPlays DESC;

In Hive when you execute the above command, you can see how many mappers and how many reducers being created to complete the command.

Once the command is complete the playspercountry table will be filled with the data of songs per country along with their frequencies. You can check in Hive using the following command:

select * from playspercountry;

In Hive you can also check the table by running the following command on HBase shell:

scan 'playspercountry'

There you go you have a complete Map Reduce process that store the results on HBase. The next logical step would be to create an HBase client app that will read HBase and display the result.

Hope this helps 🙂

Hive-HBase integration on Hortonworks That Actually Works

hdpsandbox

Sorry for no post last week I was very busy. So for this week post I will put something that based on my observation does not exist yet on the internet. Not even in the Hortonworks web site. At least not as complete as this post.

hive_logo_mediumhbase

I got this idea when I was working on a request for Big Conference Data in Dallas. The request is to make simple Hive script that make some data analytics and store the results to HBase. The analytics can be done using java, hive or pig as long as it is map reduce on Hortonworks. After some consideration, I choose Hive because of its simplicity. And also it can use Apache Tez as execution engine. At first, I thought it was just piece of cake. I can work on it in a jiffy. Many guidelines are available on the internet including from Hortonworks and Hive’s official site.

And after try running the procedure according to the guidelines I found, it turned out that none of them have the complete guidelines until it actually works. That’s why I am interested to create one post that provide the complete guideline for having Hive and HBase integration.

A brief description about Hive-HBase integration. Hive by default store the table as file on HDFS. HBase also store its table on HDFS. However, these two are different in format. The Hive files on HDFS are normally text files. While HBase tables are in different format.

Hive-HBase integration means that the Hive table is stored as HBase table. Having this way, the data on the table can be accessed by both Hive and HBase.

So what is the benefit of having Hive-HBase integration compare to just using Hive or using HBase?

To answer that question, ones must know what are the purpose of Hive and HBase. Both serve different purposes. Hive basically is a data analytics framework. It does the analytics process by using Map Reduce. Although it also can make queries to get the results of the analytics, this takes much longer time compare to queries on database like HBase.

HBase, on the other hand, is a NoSQL database. It is not a data analytics framework. HBase can serve other client application who wish to access the data that it has. HBase queries are faster to execute than Hive Queries.

From the definition of Hive and HBase above, it is clear what Hive and HBase integration could bring. The raw data can reside as simple text files on HDFS. Hive then can analyze the text files and store the results on HBase for faster queries. There you go you got the whole set working together to serve better service.

I tested the following step on Hortonworks sandbox 2.2.4. Here are the step of Hive-HBase integration on Hortonworks that actually works:

  1. Make sure Hive (Hive Server and its metastore, etc) and HBase (Zookeeper, HBase Master, and HBase Region Server) are installed and working normally.
  2. Modify hive-site.xml in Hive’s configuration directory. Add the path to the jars. The jars will be used by Hive to write data into HBase:

<property>
<name>hive.aux.jars.path</name>
<value>
file:///usr/hdp/2.2.4.2-2/hive/lib/zookeeper-3.4.6.2.2.4.2-2.jar,
file:///usr/hdp/2.2.4.2-2/hive/lib/hive-hbase-handler-0.14.0.2.2.4.2-2.jar,
file:///usr/hdp/2.2.4.2-2/hive/lib/guava-11.0.2.jar,
file:///usr/hdp/2.2.4.2-2/hbase/lib/hbase-client-0.98.4.2.2.4.2-2-hadoop2.jar,
file:///usr/hdp/2.2.4.2-2/hbase/lib/hbase-common-0.98.4.2.2.4.2-2-hadoop2.jar,
file:///usr/hdp/2.2.4.2-2/hbase/lib/hbase-hadoop2-compat-0.98.4.2.2.4.2-2-hadoop2.jar,
file:///usr/hdp/2.2.4.2-2/hbase/lib/hbase-it-0.98.4.2.2.4.2-2-hadoop2.jar,
file:///usr/hdp/2.2.4.2-2/hbase/lib/hbase-prefix-tree-0.98.4.2.2.4.2-2-hadoop2.jar,
file:///usr/hdp/2.2.4.2-2/hbase/lib/hbase-protocol-0.98.4.2.2.4.2-2-hadoop2.jar,
file:///usr/hdp/2.2.4.2-2/hbase/lib/hbase-server-0.98.4.2.2.4.2-2-hadoop2.jar,
file:///usr/hdp/2.2.4.2-2/hbase/lib/hbase-shell-0.98.4.2.2.4.2-2-hadoop2.jar,
file:///usr/hdp/2.2.4.2-2/hbase/lib/hbase-thrift-0.98.4.2.2.4.2-2-hadoop2.jar
</value>
</property>

Adjust the file paths and file names to the correct paths and names in your machine / virtual machine.

3.  Add the property to point to the zookeeper. Still on hive-site.xml:


<property>
<name>hbase.zookeeper.quorum</name>
<value>127.0.0.1</value>
</property>

Adjust the zookeeper address according to the address of your zookeeper cluster.

4. Add the property to point to HBase master and nameserver addresses:


<property>
<name>hbase.master</name>
<value>127.0.0.1:60000</value>
</property>

<property>
<name>hbase.nameserver.address</name>
<value>127.0.0.1</value>
</property>

Adjust the values of the properties address according to your HBase cluster.

That’s it.

The missing information is adding hbase.nameserver.address property. This information does not exist on Hortonworks web site for Hive_HBase integration nor on Hive’s official website for Hive-HBase integration. Without this property, you cannot use Hive to query data on HBase table.

Hope this helps 🙂

Setting Up Hadoop Sandbox on Standalone Node

hdpsandboxcloudera

This will be relatively short post. This post just to put some of my experiences in setting up Hortonworks Sandbox and Cloudera Quickstart on Linuxmint. I know for most people setting up sandbox is their first way to learn and do proof of concept on Hadoop before setting up the real Hadoop cluster. However, my previous laptop did not have the enough resources to do so. Only now I have the chance to try it out on my favorite operating system, Linux.

Both are the same just the names are different. Both require Virtual Machine. My Virtual Machine of choice is Virtualbox. You can use VMWare or Hyper-V.

Caution, you need to have at least Intel i5 and at least 6GB of RAM. Free storage around 10 GB to be safe. Enable virtualization on you BIOS because normally it is disabled. This is the standard requirement for virtual machine.

The one thing that really pushed me to write this post is because I was very annoyed with a download manager called FlareGet. It was the top choice on Google search when I googled it for download manager. So I thought it was good. But I was wrong.

Hortonworks sandbox is 5.8 GB and Cloudera quickstart is 4GB. So it takes hours for me to download them. Make sure you spent some times to download them. However, the outcome for both downloads are bad ones. The size for Hortonworks sandbox became 8.5 GB and for Cloudera quickstart became 7 GB. There were no errors whatsoever reported by FlareGet.

Since it took quite some times to download I decided to try run both sandbox and quickstart, not at the same time ofcourse. But what happened was both cannot run on Virtualbox because of missing files. This doesn’t make sense, how come the size that has been blown up by adding extra 3 GB than their normal size but sill have missing files.

First I didn’t suspect FlareGet was the culprit. I thought it was Hortonworks and Cloudera. Because the documentations for sandbox only available for Mac and Windows. So I thought for Linux just setup the Hortonworks or Cloudera cluster without using Sandbox or Quickstart. But after googling a while I cannot find similar issues on their QAs nor forums.

Knowing this fact, I tried to download again, this time using the old and faithful wget command on Linux. Again I have to spend quite some times because of the files sizes.

And the result is I got the right file sizes. And the most important they run perfectly on my Virtualbox. And it is not the first time FlareGet disappointed me. It did the same mistake when I downloaded Eclipse. So there you go the main idea for this post. I will use the faithful wget for now and try to find another good Download Manager other than FlareGet. Just DO NOT use FlareGet.

I hope this post can save you some effort and time.

The Era of Cloudera

cloudera

Hello again, to catch up some of the ‘no-post’ week. I will write another post. This is the first post written on my new laptop. Consider it a test drive or test write to be exact. This post is about one of Hadoop Distro called Cloudera. Cloudera is the first and the most widely used Hadoop Distribution in the world. You may she its the market leader.

Cloudera people are also very active contributors to Hadoop project. Cloudera even have their own query engine that is different than Map Reduce. It is called Impala. It is faster to query on HDFS than the conventional Map Reduce.

Cloudera, unlike Hortonworks, comes in two flavours. The first is the flavour that most people like, the free version. And the second flavour is ofcourse the paid version. For paid version you will have the support from Cloudera team but also some other software that are specially developed by Cloudera, like Cloudera Navigator.

Like Hortonworks, Cloudera also has its sandbox version where we can play around by installing it on our single PC or laptop. Moreless is the same. I personally never play using sandbox either in Hortonworks or Cloudera. Its kids play. If I want to find out how the Hadoop Distro looks like, I would fire a small cluster (3-4 nodes) on cloud computing service like Rackspace or Digitalocean and setup the real cluster there. It is more realistic that way. Good testing on some of its feature would only cost around 10 USD so it is quite cheap. I guess I developed an instinct to get the most of something using the least cost as possible. The art of being frugal.

But what I want to really talk about is my instinct or its similarities with Hortonworks, but something that makes it different.

Cloudera is damn very easy to setup. Hortonworks use Apache Ambari to help setup, monitor and provision Hadoop cluster. But even with Ambari some things must be done manually like installing Java, setting NTP and so on. At least this is what I found when I use HDP 2.0 and 2.2.

Such tedious task have been put in automation using the Cloudera Installation. Everything. Just download the installer, clicks (pay attention on what you click), and the Cloudera cluster is ready within less than an hour. Even less than 30 minutes at some cases.

For this reason I choose Cloudera over Hortonworks when I have to test some Big Data application. The speed of setting up cluster is very impressive. This speed bring down the total cost of cloud computing that I have to use.

For those of you who have the same experience or different one, please keen to share. Hope this helps.

The Rise of MongoDB

Forgive me for neglecting this blog for some times. I had too many work on my hands. Many things happened in the last few weeks. Some family business and my faithful laptop died after more than 4 years of its excellent service. I am sill in the search for a new one. Let me know if any of you guys want to suggest some good laptop.

for this post, I will share some of knowledge about a NoSQL to you, my beloved readers. This post is bout MongoDB. For those who read my posts, would recognize that the mostly discussed NoSQL database is Cassandra. No wonder, Cassandra is easy to use, easy to scale, great writing speed, easy to monitor and there is no single point of failure unlike other NoSQL database that I know of.

But having all those advantages, will those make people to switch to Cassandra? The answer for this is a definite NO. Why? because although Cassandra has all of those features, it also has some drawbacks that for some requirement, its just won’t work for them.

One of the drawback is the inability to do flexible query. Cassandra queries must specify the partition keys and, optionally, the column key. other fields/column that are part of these two categories should not be put as the criteria in the search. There is possibility to put other columns using the ALLOW FILTERING but it doesn’t always work in any Cassandra version and it is highly NOT RECOMMENDED to do so. This is because it can bring Cassandra to its knees.

wiredtiger

For those who wants to have an alternative, MongoDB now comes to the rescue. Now, MongoDB still doesn’t have the no single point of failure like Cassandra, but with its multiple master architecture on its cluster, it can make 99.999% or even higher availability.

Now that is not what makes MongoDB different. That difference was made around end of last year (end of 2014). MongoDB bought a NoSQL database startup called WiredTiger.

For those of you who are not familiar with WiredTiger, it is quite young and less popular NoSQL compare to MongoDB, Cassandra or HBase. However, WiredTiger is very good in term of performance. Some performance even can match Cassandra like write speed that can goes up to 35000 writes per second. This amazing performace attracts MongoDB to integrated WiredTiger.

figure2

Having WiredTiger integrated, MongoDB can perform quite nicely and provide the more flexibel queries unlike Cassandra. To accommodate WiredTiger, Some changes have been done on MongoDB version 3.0. The architecture in MongoDB 3.0 is more modular, especially the storage engine. MongoDB still use its existing storage engine called MMAPv1. For those who wish to use WiredTiger some configuration must be changed. More details in here.