Perso: A Recommendation Engine for Woocommerce

Hello after two posts in Indonesian now back to english. This post is different than the rest pof other posts. My other posts are stories of my experiences using big data. But this time is my experience in creating a product.

One of the field I mentioned where big data can play a big role is for recommendation system. Recommendation system is a system that recommends somethings to the users or visitors of a web. It is widely used system in e-commerce, news media, etc. Recommendation system or some called it recommendation engine because it is not an integral part of a web site system, must suggests relevant products or articles that will attract users more. This will increase user engagement in our website.

So this is my story in creating Perso: A recommendation engine for woocommercePerso: A recommendation engine for woocommerce.

Lanjutkan membaca “Perso: A Recommendation Engine for Woocommerce”

Shifting to Redshift

Amazonredshift_220x110

This post is about one of Amazon’s tool for big data. It is called Amazon Web Service Redshift. It is esecially dedicated for data warehousing. I mentioned about Redshift in my other post about Amazon’s infrastructure for Big Data. Redshift is built on top of PosptgreSQL. However, Amazopn ut some of their magic on it so it is not 100% the same as PostgreSQL that we know of.

Lanjutkan membaca “Shifting to Redshift”

Review of Latent Dirichlet Allocation (A Clustering Algorithm)

kmeans

Again I would like to thank to all my reader where ever you are. I took a quick look on the stats of this months since today is the last day of October. And the result is quite pleasing. This month has the highest views and highest visitors ever. The trend is always up from month to month. It won’t be possible without you guys. So hope you enjoy my writings on this blog.

This week post is about a clustering algorithm called Latent Dirichlet Allocation (LDA). When I first read about this algorithm I thought it is just another clustering algorithm like K-Means, that I wrote several post ago. Boy, was I wrong for underestimating this algorithm.

Lanjutkan membaca “Review of Latent Dirichlet Allocation (A Clustering Algorithm)”

Data Cleansing on Big Data using Classification Algorithm

classification

First of all thanks for all of you my beloved readers. This month I saw significant increase in the number of views, and visitors, compare to previous months.  So keep them coming.

This post is about data cleansing or data scrubbing. Data cleansing is a process that must be carried out before the data are analyzed. Based from my experiences, many systems, either big data system or conventional data systems, don’t conduct this process properly or even don’t do this process at all.

Neglecting data cleansing process before doing analytics can bring dire consequences. The analytics results can be totally biased towards the wrong results. How far the results are biased depends on how many noises we have in our data. In many industries that heavily use data to make decisions, these noises are responsible for wrong decision taking, thus, revenue loss of millions, if not billions, of dollars.

One of the golden rules of data scientist is that the results of simple analytics using cleansed data are much better than the results of complex/sophisticated analytics using un-cleansed data (many noises).

In the conventional data system data cleansing is much easier because the amount of data is not too many and in structured form. In this kind of system normally people just identify the criteria of noises and create filter to separate noises and real data. There are many ways this filter can be created, starting from simple script to complex diagram using Business Process Management tool. We will not discuss further about the data cleansing in conventional system, because ow yeah its a big data blog. So sorry folks. Those who are interested in the filter can contact me directly.

In the big data, the data cleansing process is more difficult because the huge amount of data and also the characteristics of the data that can be unstructured. This requires different approach than conventional data system.

The approach I want to discuss here is about the usage of classification algorithms to do data cleansing. There are other methods to do data cleansing in big data like creating model, train them and then deploy the model but we will not focus on other data cleansing method in this post.

A brief definition of classification is a method to classify existing data into several classes based on criteria that are pre-defined. These criteria can be revised along the way, thus, it is called supervised learning.

Ok, here is the idea of using the classification algorithm for data cleansing. First of all we need to specify samples of incorrect data. This way all data that are not part of incorrect data are correct data. From time to time, we need to review the data at random in both classes, correct and incorrect data to see if there are no data misplaced. By doing so, the system will learn more about incorrect and correct data.

The correct data can be used further for analytics to provide meaningful data for decision makers. On the other hand the incorrect data also can be analyzed further but rather for system evaluation. By evaluating the system we can spot what caused these noises data and can take actions to tweak the system in order to reduce noises.

Some popular classification algorithms are native bayes, logistic regression, decision tree, and random forest. These algorithms are available in popular machine learning library like Apache Spark.

Hope this helps 🙂

Finding The Influencers on Twitter

graphx_logogiraph

After many many post about technical Big Data. Now I am going to find one of the big data advantages in the social marketing world. It is about finding twtiter influencers. This post will only consist of general steps to find influencer on twitter. Technical stuffs like codes will not be written in this post. Mainly because it is not done and properly tested yet. Basically this post is based on my observations on how Spark GraphX works.

So here are the steps I have in mind:

  1. Define the universe of your data. Twitter users are a lot. So defining the tweets you want to analyze will help a lot. Defining the data can be based on the twitter users, based on the location, based on certain tags, etc.
  2. Build twitter ingestion engine. Mainly this is an app that will use Twitter API and send the tweets to the queuing component.
  3. Select Queueing component. This is basically a message queue component. SInce tweets are a lot, haing queuing component helps to temporarily store the tweets before getting processed. The popular queuing component is apache kafka but feel free to use that suits your situation.
  4. Process the tweets using stream processing component. This can be storm or spark streaming. Here we can apply the rule that will select which tweets that match our dataset criteria we defined in the first step. Afterwards, tweets that are matters will be stored on data storage. It can be NoSQL or HDFS or S3.
  5. The create an analytics component that can define whic users are influencers. We can define an influencers are persons who their tweets are retweeted by many other people. Regarding the number we can define an arbitrary number or by another analysis to get the average number of retweets, and all retweets above that average number are considered influencers. Then in the end we store the twitter IDs of these influencers to a fast-retrieved data storage. On this step we can use Spark GraphX, Giraph, Graphlab, or other tools for graph analytics.
  6. Finally visualization tool or an API to get influencers twitter IDs can retrieve them from this storage.

Knowing influencers can be very beneficial to focus marketing campaign so that it only address these people who can influence many others. This way the cost of marketing can be reduce significantly.

Hope this helps 🙂

Lambda Architecture ala Cassandra

First of all I want to announce some good news. I received a notification of stats booming from wordpress. I don’t feel any changes in pas few days. But, indeed, in the past few months the traffic has increased significantly. It’s not as huge as popular sites. But it makes me happy knowing my writings are enjoyed by many people. I guess wordpress run the traffic check in batch so its kind a late to get the notification at this moment. Here is the notification looks like.

notif-wp

OK back to the post. One of my earliest posts, I wrote about general architecture of Big Data which is actually a description of lambda architecture.

The general implementation of Big Data is to have the big data analytics attached to the current and conventional system (the one using RDBMS). All we need is some kind of connector that will, either, stream or export in bulk to the big data analytics system. The analytics process in big data can be handled in realtime or batch manners (the lambda architecture). 

lambda-conventional2

Cassandra on the other hand offer another form of lambda architecture. Here Cassandra (at least the guys in Datastax) wants to be the one stop service providing all solution. While it might work for some cases, it may not be suitable for other cases. Here is cassandra solution looks like.

lambda-cassandra2

In the picture above we dont show backend explicitly but I think you get the point. No RDBMS. all transactions for frontend can be served using the transaction data center. And by setting the replication accross all other data centers (analytics and search), the auta is automatically replicated. This removes the needs of using data ingestion tool to transfer data.

Later, for analytics we can run Spark both in realtime and/or batch mode without disturbing the transaction DC that is serving clients realtime. The results of the analytics can then be served to amangers, or customers. Search data center will need to use Lucene (Solr or Elasticsearch) to serve good searching capability. This is what is missing in open source cassandra. Only available using DSE.

Hop this helps 🙂

Menggunakan Big Data Untuk Mengatasi Masalah Perpajakan di Indonesia

tax

Kali ini postingnya adalah dalam bahasa Indonesia. Sorry from time-to-time I would like to write in Indonesian since Indonesian readers are the biggest audiences of my blog. Besides that this post is about a problem specific to Indonesia.

Posting kali ini menyoroti masalah perpajakan di Indonesia. Masalah perpajakan yang saya maksud adalah tentang penentuan jumlah pajak yang harus dibayar.

Sebagaimana kita ketahui sejak Jokowi dilantik menjadi Presiden, beliau mencanangkan target penerimaan pajak yang cukup tinggi dibanding tahun-tahun sebelumnya yaitu Rp 1.294,258 triliun berdasarkan APBN-P 2015. Hal ini didasari dari fakta yang memang orang yang membayar pajak masih sangat minim dibanding yang seharusnya.

Awalnya saya kira wajar saja karena memang taglinenya adalah pajak membangun negri. Tapi setelah saya berbicara dengan teman-teman saya para pedagang mereka banyak sekali yang mengeluh kebijkan pajak ini karena sangat memberatkan mereka.

Dulu saya selalu beranggapan bahwa yang namanya pajak di mata pengusaha memang memberatkan. Tapi setelah lama bergaul dengan para pedagang saya tahu bahwa dalam kasus ini memang banyak pedagang yang tersakiti. Terutama pedagang barang komoditas kebutuhan sehari-hari. Kok bisa?

Jadi begini sudah menjadi rahasia umum di dunia perdagangan bahwa untuk perdagangan barang komoditas maka margin yang diperoleh sangat kecil. Barang komoditas adalah barang yang banyak dijual dimana-mana dan perbedaan dari segi kualitas antara satu pedagang dan pedagang lain yang menjual barang komoditas yang sama adalah sangat sedikit atau tidak ada bedanya sama sekali.

Dengan kondisi ini jelas yang terjadi adalah perang harga, dimana pedagang yang memberi harga terendahlah yang menang. Alhasil minimlah margin atau keuntungan mereka.

Nah perturan pajak tidak melihat besarnya keuntungan dan biaya-biaya (variable dan fix costs), tapi melihat omset/revenue dari pedagang. Banyak pedagang yang mengeluhkan karena kalau mereka benar-benar jujur membayar pajak sesuai perturan maka usaha mereka akan tutup. Dan hal ini memang sudah terjadi ada beberapa pengusaha yang terpaksa gulung tikar atau tiarap untuk sementara waktu.

Kalau semua pedagang komoditas ini benar-benar memasukkan unsur pajak ini ke harga barang dagangannya maka yang terjadi adalah inflasi besar-besaran. Karena harga barang komoditas yang dibutuhkan semua orang ini akan melonjak diatas 10% karena pedagang tidak ingin merugi dalam usahanya.

Menurut saya tuntutan  target penerimaan pajak yang sangat tinggi membuat petugas pajak benar-benar mencari sumber pajak sebesar mungkin. Sedangkan saat ini ekonomi Indonesia sedang tidak bagus. Rupiah terus merosot terhadap dolar. Bisnis sedang lesu. Klop-lah kondisi ini menghimpit para pedagang.

Saya mencoba adil dan melihat masalah ini tidak hanya dari sisi pedagang tapi juga dari petugas pajak. Beberapa kenalan yang kerja di kantor pajak pun saya ajak ngobrol. Mereka bilang memang target penerimaan pajak yang tinggi membuat mereka harus bertindak untuk menghasilkan pajak sebesar-besarnya apalagi memang penerimaan pajak di Indonesia sampai hari ini sangat kecil. Jumlah orang yang punya NPWP sangat sedikit, ditambah dari yang memiliki NPWP itu yang aktif membayar pajak tidak semua. Itupun banyak juga yang membayar pajak lebih sedikit dari seharusnya.

Dengan kondisi seperti ini yang mereka bisa lakukan adalah benar-benar menegakkan peraturan. Karena kalau target penerimaan pajak tidak terpenuhi tentunya mereka akan dapat raport merah.

Baik instansi pajak maupun pedagang sama-sama betul sekaligus sama-sama salah. Instansi pajak betul karena mereka menegakkan peraturan untuk memungut pajak untuk negara. Hanya saja peraturan yang mereka ikuti salah karena tidak mempertimbangkan kondisi para pedagang dimana aada yang kalau diterapkan benar-benar maka usahanya akan tutup yang berimbas ke bertambahnya pengangguran.

Para pedagangpun betul karena mereka adalah faktor utama ekonomi bergerak. Dengan kondisi sekarang susah buat mereka benar-benar taat aturan pajak. Tapi ada juga pedagang yang sebenarnya mampu membayar pajak tapi tetap tidak membayar sesuai kemampuannya.

Nah dengan kondisi sekarang apa yang bisa dilakukan dengan menggunakan Big Data?

Saya berpikir untuk memiliki semacam transition period. Di masa lalu, mungkin sampai sekarang banyak pajak yang tidak dibayarkan penuh. Fungsi dari masa transisi ini adalah melakukan evaluasi kenapa pajak tidak dibayarkan penuh dan apa solusi untuk itu.

Transaksi cashless yang menggunakan bank sudah sangat umum dan pasti akan semakin bertambah volumenya di masa depan. Pemerintah bisa memeriksa setiap transaksi ini. Tapi meskipun memeriksa pastikan sanksi tidak diberikan ke para pengusaha selama masa transisi ini. Karena jika diberikan maka otomatis akan kabur semua mereka ini.

Dari semua transaksi ini baik pembelian, penjualan, gaji dsb-nya bisa dianalisis berdasrkan jenis usahanya berapakah seharusnya pajak yang diberikan. Proses evaluasi ini terus berlangsung meskipun sudah lewat masa transisi karena jenis usaha bisa berkembang dan dunia bisnis itu sangat cepat perubahannya. Sehingga jumlah pajakpun juga bisa berubah jika jenis usahanya berubah.

Satu-satunya yang me,bedakan masa transisi ini adalah tidak adanya sanksi untuk pengusaha selama masa ini. Hasil dari masa transisi ini nantinya adalah sistem perpajakan yang disesuaikan dengan jenis usahanya. Ini juga mempertimbangkan masukan dari apra pengusaha.

Setelah masa transisi ini tetap pengawasan dilakukan. Jika nantinya ditemukan pelanggaran, misal menjalankan usaha yang tidak sesuai dengan yang dinyatakan di laporan pajaknya, maka sanksinya harus sangat tegas.

Analisis transaksi juga terus dilakukan untuk terus beradaptasi terhadap dunia bisnis yang terus berkembang serta untuk mendeteksi penyimpangan.

Driving Big Data with Machine Learning

First of all I would like to apologize to all my readers. Last week passed without a single new post. I will try to make it up this week. So for this post I will start with machine learning. The subject that occupied my time last week.

After reviewing all of the Big Data technologies, it is very obvious machine learning is at the top of the food chain. No matter how big your data is, how sophisticated the technology you use, without analytics, those are just simply raw rocks with very small or no values. To shape those rocks into diamonds data, analytics must take place.

For those of you still confuse about analytics. it is summarizing data. Aaron Kimball from wibidata mentioned that 80% of analytics is sums and averages. It’s true. For most cases, just sums all the data or find the average value of all data, sometimes accompanied with minimum and maximum values, are more than enough to give them insight about the data. These kind of insight are normally for decision makers.

Wait, I thought this post is about machine learning? Yes it is. So, Why on earth we are discussing about analytics? Because machine learning is part of analytics. it is part of the elite 20 % of the analytics. Normally those who knows and can apply machine learning on Big Data, can apply the sums and average on big Data. That is why this elite 20% is on top of the food chain.

Basically, machine learning analyze the data looking for pattern in the data. The pattern can be used for many purpose like for example fraud detection, customer segmentation, recommendation, etc. For me, what makes machine learning special is its ability to adapt and learn on the data. In other words, the results of machine learning gets better along with the increase of data and the longer it runs. So in real world it can detect a new kind of fraud. the most popular that almost everyone use is the email classification. It use machine learning to identify each incoming email whether it is a spam, ordinary mail or priority mail.

Is that all i learned last week? No. I learned several machine learning algorithm, some formulas with greek alphabets, etc. I will not discuss the formulas here, it is just too much in here. In machine learning, to solve a particular problem, there are several possible algorithm. For example, to have a recommendation system like the one amazon has, there are several available algorithm like ALS, coocurrence analysis, etc. To choose which algorithm is the best for the problem, it takes several factors:

  1. Accuracy. How accurate the results of the algorithm is defined by the user who use the system.
  2. Scalability. It means, is it still can be done with the existing limited resources to run on production. THis factor is the one that makes Netflix not to use the algorithm from the winner of Netflix Recommendation competition

spark

For big Data technology, two main machine learning softwares are Apache Mahout and Apache Spark.Apache Spark’ machine learning component i called Machine Learning Library (mllib). In my opinion it’s algorithms are not as complete as Mahout’s.

Apache Mahout last year mentioned that they will use Apache Spark’s computational infrastructure. But still the machine learning algorithms are still using Mahout’s algorithms. Mahout 1.0 starts using Spark. However, last week experiment on mahout 1.0 resulting that it is not applicable for the latest Spark 1.2. it is quite disappointment.

I must say Mahout’s decision to join up to use Spark is a good decision. Because Spark’s computational capabilities are way above mapreduce, Mahout’s previous processing platform. It is faster and more resilient than map reduce. It’s just that the fact it cannot cope with the latest mahout disappoint me.

Another alternatives is to implement our own algorithm. To do this we need to know the basic idea of the algorithm and the all mathematical formulas behind the algorithm. There are two guys that I like in explaining the algorithms. First is Sean Owen from Cloudera and Ted Dunning from MapR. They are quite helpful in explaining about machine learning. Their presentations and works helps a lot for Machine learning practitioners.

Hope this gives a brief view of waht machine learning in big data is 🙂

The Data Ingestion (Building Minimized Big Data Infrastructure part-4)

Data Ingestion (bahasa: pengisian data) is a process to get data from outside and put them in out Big Data for processing. That is the free definition that i use. Now that we have most of our storage softwares are set, like Cassandra and elasticsearch, we need to set up data ingestion component.

The first component of data ingestion is the part is an adapter that take data or information directly from outside. This part need to communicate via network, thus, TCP-based software is needed. Preferably, it is using socket-based or HTTP-based because those are the most common protocols. Many libraries are available for those protocols. for example Rest API for hTTP-based or Netty for socket-based in Java.

kafka

The second component is the message queue component. Since typically data will be flowing very rapidly and in a huge volume, then messaging queue solution is needed. The commonly used solutions for this part is using Apache storm and apache Kafka.  The first component after retrieve the data will route them fo kafka. Afterwards, Storm will pick up the data from Kafka, analyze them and put it in the data storage. In this case, it will be Cassandra.

storm

Between the softwares in both components, Apache Storm is the most crucial and difficult to manage. The reasons are because Storm is the ‘brain’ of . The other softwares basically just reformat and forward the data. While Storm must analyze the data. Some of us perhaps hear about analytics but never really understand what is it.

In plain and simple definition analytics is summarizing the data. 80% of analytics are sum and average. For example  the number of visitors per day of a site, mostly used hashtag in twitter, the average visit time per visitor in a site, etc. the rest are like finding minimal or maximal value, etc.

So Storm should not, only, just reformat and forward the data into database. It must summarize and get important information out of the huge amount of data coming in. That is analytics for Storm. Therefore, GPS Data processing that just get data from GPS devices, translate the data, find the nearest point of interest to locate the device, does not included in the analytics definition. Because the amount of data coming in and coming out are the same and there is no new information, just the data have been assigned new value to make them easier to see.

So what kind of analytics we should do on Storm? it depends on what kind of question you are trying to ask. It can be mostly used hashtag, mostly shared url, etc.

I hope this helps when designing data ingestion infrastructure 🙂

 

 

Big Data in Sport Industry (Moneyball)

MV5BMjAxOTU3Mzc1M15BMl5BanBnXkFtZTcwMzk1ODUzNg@@._V1_SY317_CR0,0,214,317_AL_

This post is not about technical stuff of Big Data. I really need a break from technical stuff and pull back to see the big picture once more. It can be really motivating when we see Big data technology and its application in real life. The industry I am going to talk about is sport industry. It is a billions dollars industry and quite comparable with entertainment industry in term of value.

Moneyball. It is about this Brad Pitt’s movie. It is not too old movie. This movie was based on a true event happened in 2002. In short, this movie is about winning a game in an unfair condition. Billy Beane was a coach of Oakland Athletic. An underdog baseball team in US. They had so little money to hire top notch baseball player in the country. Stakeholders refuse to give more money to the team. It is so unfair where big teams with big bucks seem to easily rule the entire competition since they have resources to hire best and expensive players in the league. It seemed impossible to win a game for Oakland Athletic.

However, Billy Beane decided to see the game from different angle than the conventional one. He sought the baseball team’s capability not by how many top notch players it has. He met Peter Brand who was an economics from Yale. Peter Brand had a method to assemble high quality team from undervalued players. He analyzed statistics of every undervalued players from their track records. It was a big bet. But Billy Beane had no other choice. If he had to hire undervalued players due to a very limited budget, he would prefer to hire undervalued players that could assemble a winning team. And the result was Oakland Athletic won 20 matches in a row. It stunned many people in the sport industry. They called this technique Moneyball.

What Peter Brand did for Oakland Athletic was Big Data analytics plain and simple. Although, many Big Data technologies had not surfaced at that time but I personally consider it is Big Data analytics. Peter Brand managed to see what others didn’t through statistics and historical data. They managed to achieve what other teams, even big teams, could not. Assemble a winning team with a very limited money. He found the holy grail through analytics.

Now with many big data technologies emerged and well-established, moneyball is on steroid. A huge of data are analyzed to give a maximum results with minimum effort as possible. They don’t use it only in baseball but also basketball, american football and football (the americans call it soccer). I even suspect that Big Data analytics played big role in German’s victory in the last World Cup in Brazil. There were no superstar players in German team like Messi, Ronaldo nor Neymar. But they play great as a whole team. Too bad it is just a suspicion since I have no access to their strategy.

In the Moneyball movie, what Peter Brand analyzed was limited to the players’ data. With current technology those data have expanded beyond imaginable. It is not just players’ data but also coaches’ data, managers’ data, the entire historical teams’ data. The data also include weather, city, time, date, season even supporters and population of the city where the games were held. All of these data are analyzed to find the perfect strategy not just to assemble the best team but also to create strategy on how to win a game.

A little bit out of the scope report. Moneyball strategy is now also being used in the human resources field. HR people use this strategy to find the best candidate for their companies. I hope this post helps 🙂