Last week’s post lead to a several request on example of a Map Reduce program using Apache Hive that actually store the results on Apache HBase. Just for your information, Hive can do a lot more than just storing data into HBase table. It can also store data on other NoSQL table, like Cassandra for example. But as requested, for this post I will only show how to do that on HBase only.
The dataset I am going to use is provided by LastFM. LastFM is a music web site that provides music as a service to its users. You may think about it as like Spotify. LastFM provide sample of data that contain list of user information such as their genders, countries and signed up date. It also contains list of music played by their users. The data set can be downloaded here. For this post we will use the smallest dataset which the size is 1K. Bigger dataset has 360K in size.
In the dataset there are two files. The first file is called userid-profile.tsv. It is a Tab Separated file that contains all user information such as userid, gender, age, country and date of signed up.
The second file is called userid-timestamp-artid-artname-traid-traname.tsv. It is the file that contains the list of music played by LastFM’s users. It contains information like who play the music, when it is played, artist name, track id, and the track name. The second file is very big. So in case you are experiencing crash or the map reduce process is too slow for you. It might worth to consider breaking this file into several small files.
The Map Reduce application we will do in Hive will be to combine those two files in order to get what are the popular musics played for each country. The result would be stored in HBase. From the results we can find what are the most popular song played by users in the United States or Japan or Mexico.
Let’s get started. First we need to create table in Hive that will be the placeholder for the userid-profile.tsv file. We need this place holder so we can refer it later for the Hive Query that will perform the Map Reduce. After the table is created we will load the data from userid-profile.tsv file into the table. To create such table and load the data in it, use the following command on Hive.
CREATE TABLE rawusers(userid string, gender string, age int, country string, signup string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; LOAD DATA LOCAL INPATH '/root/userid-profile.tsv' INTO TABLE rawusers;
The next is to create another table that will be used as the placeholder for userid-timestamp-artid-artname-traid-traname.tsv. Afterwards, we will load the data from that file into the table. To do that use the following command:
CREATE TABLE rawplays(userid string, timestamp string, artid string, artname string, traid string, traname string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; LOAD DATA LOCAL INPATH '/root/userid-plays.tsv' INTO TABLE rawplays;
Now we have finished the setup for the placeholder tables. The next step is to create table in HBase that will store the results of the Map Reduce process. The table creation can be done in Hive. Even it is an HBase table, we still can access it in Hive. That is why we call it Hive-HBase integration. To create such a table run the following command, still in Hive:
CREATE TABLE playspercountry(key struct<country:string,trackname:string>, NumberOfPlays int) ROW FORMAT DELIMITED COLLECTION ITEMS TERMINATED BY '|' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:NumberOfPlays") TBLPROPERTIES ("hbase.table.name" = "playspercountry");
After creating such table you can check in Hive if that table exist using the command
You can also check if the same table exist in Hbase using the following command run on Hbase shell:
Now the main show. We will create a Hive command that will do the Map Reduce process of inserting the songs per country along with the frequencies of those songs had been played. We will co that in one command. To do that run the following command on Hive:
INSERT OVERWRITE TABLE playspercountry SELECT named_struct("country",u.country, "trackname",p.traname) AS key, count(*) AS NumberOfPlays FROM rawplays p LEFT OUTER JOIN rawusers u ON (p.userid = u.userid) WHERE u.country <> '' GROUP BY p.traname, u.country ORDER BY key.country ASC, NumberOfPlays DESC;
In Hive when you execute the above command, you can see how many mappers and how many reducers being created to complete the command.
Once the command is complete the playspercountry table will be filled with the data of songs per country along with their frequencies. You can check in Hive using the following command:
select * from playspercountry;
In Hive you can also check the table by running the following command on HBase shell:
There you go you have a complete Map Reduce process that store the results on HBase. The next logical step would be to create an HBase client app that will read HBase and display the result.
Hope this helps 🙂