Realtime Data Enrichment

This is quite onvious stuffs to do and perhaps many of you already know what and how to do this realtime data enrichment. But for those who don’t know I will dedicate this post just for that.

Realtime data streaming enrichment is to transform, update, or add information to incoming data in realtime or near realtime manner. This process is normally intended to have the incoming information more meaningful and later it will be much easier to retrieve meaningful information easily (with less JOIN process).

Why we don’t just send the meaningful data in the first place so that no data enrichment process will be required later? In many cases this options are not possible. For example: in the internet of things devices, most¬†those devices normally doesn’t have capabilities to ‘enrich’ the data before sending them. And these devices must also handle intermittent internet connection so the data they sent must be as small as possible.

A real example that wwe faced is for the GPS device. GPS device can only send its location (lat & long), speed, fuel, alarm and other information that can be easily tapped by GPS devices without having to ake complex data lookup to enrich the data.

Once the data arrived on the server, the server then enrich the data like for example by adding the street name, district, city and other information by matching the data from GPS device with the lookup data/table. Afterward, that enriched data can be put in data storage. This will ease user to queries vehicles based on street name, etc, and not based on latitude & longitude.

When we talk about realtime data enrichment then realtime data processing is the key. Two popular open source realtime data processing are Storm and Spark Streaming. From my findings. both of them can do realtime data enrichment.

storm spark

For Apache storm one library to help make data enrichment is to use look up table from the database is in here. For Spark Streaming it can use the existing Spark library’s existing part to connect to database and enrich data.

Hope this helps ūüôā

Menggunakan Big Data Untuk Mengatasi Masalah Perpajakan di Indonesia

tax

Kali ini postingnya adalah dalam bahasa Indonesia. Sorry from time-to-time I would like to write in Indonesian since Indonesian readers are the biggest audiences of my blog. Besides that this post is about a problem specific to Indonesia.

Posting kali ini menyoroti masalah perpajakan di Indonesia. Masalah perpajakan yang saya maksud adalah tentang penentuan jumlah pajak yang harus dibayar.

Sebagaimana kita ketahui sejak Jokowi dilantik menjadi Presiden, beliau mencanangkan target penerimaan pajak yang cukup tinggi dibanding tahun-tahun sebelumnya yaitu Rp 1.294,258 triliun berdasarkan APBN-P 2015. Hal ini didasari dari fakta yang memang orang yang membayar pajak masih sangat minim dibanding yang seharusnya.

Awalnya saya kira wajar saja karena memang taglinenya adalah pajak membangun negri. Tapi setelah saya berbicara dengan teman-teman saya para pedagang mereka banyak sekali yang mengeluh kebijkan pajak ini karena sangat memberatkan mereka.

Dulu saya selalu beranggapan bahwa yang namanya pajak di mata pengusaha memang memberatkan. Tapi setelah lama bergaul dengan para pedagang saya tahu bahwa dalam kasus ini memang banyak pedagang yang tersakiti. Terutama pedagang barang komoditas kebutuhan sehari-hari. Kok bisa?

Jadi begini sudah menjadi rahasia umum di dunia perdagangan bahwa untuk perdagangan barang komoditas maka margin yang diperoleh sangat kecil. Barang komoditas adalah barang yang banyak dijual dimana-mana dan perbedaan dari segi kualitas antara satu pedagang dan pedagang lain yang menjual barang komoditas yang sama adalah sangat sedikit atau tidak ada bedanya sama sekali.

Dengan kondisi ini jelas yang terjadi adalah perang harga, dimana pedagang yang memberi harga terendahlah yang menang. Alhasil minimlah margin atau keuntungan mereka.

Nah perturan pajak tidak melihat besarnya keuntungan dan biaya-biaya (variable dan fix costs), tapi melihat omset/revenue dari pedagang. Banyak pedagang yang mengeluhkan karena kalau mereka benar-benar jujur membayar pajak sesuai perturan maka usaha mereka akan tutup. Dan hal ini memang sudah terjadi ada beberapa pengusaha yang terpaksa gulung tikar atau tiarap untuk sementara waktu.

Kalau semua pedagang komoditas ini benar-benar memasukkan unsur pajak ini ke harga barang dagangannya maka yang terjadi adalah inflasi besar-besaran. Karena harga barang komoditas yang dibutuhkan semua orang ini akan melonjak diatas 10% karena pedagang tidak ingin merugi dalam usahanya.

Menurut saya tuntutan  target penerimaan pajak yang sangat tinggi membuat petugas pajak benar-benar mencari sumber pajak sebesar mungkin. Sedangkan saat ini ekonomi Indonesia sedang tidak bagus. Rupiah terus merosot terhadap dolar. Bisnis sedang lesu. Klop-lah kondisi ini menghimpit para pedagang.

Saya mencoba adil dan melihat masalah ini tidak hanya dari sisi pedagang tapi juga dari petugas pajak. Beberapa kenalan yang kerja di kantor pajak pun saya ajak ngobrol. Mereka bilang memang target penerimaan pajak yang tinggi membuat mereka harus bertindak untuk menghasilkan pajak sebesar-besarnya apalagi memang penerimaan pajak di Indonesia sampai hari ini sangat kecil. Jumlah orang yang punya NPWP sangat sedikit, ditambah dari yang memiliki NPWP itu yang aktif membayar pajak tidak semua. Itupun banyak juga yang membayar pajak lebih sedikit dari seharusnya.

Dengan kondisi seperti ini yang mereka bisa lakukan adalah benar-benar menegakkan peraturan. Karena kalau target penerimaan pajak tidak terpenuhi tentunya mereka akan dapat raport merah.

Baik instansi pajak maupun pedagang sama-sama betul sekaligus sama-sama salah. Instansi pajak betul karena mereka menegakkan peraturan untuk memungut pajak untuk negara. Hanya saja peraturan yang mereka ikuti salah karena tidak mempertimbangkan kondisi para pedagang dimana aada yang kalau diterapkan benar-benar maka usahanya akan tutup yang berimbas ke bertambahnya pengangguran.

Para pedagangpun betul karena mereka adalah faktor utama ekonomi bergerak. Dengan kondisi sekarang susah buat mereka benar-benar taat aturan pajak. Tapi ada juga pedagang yang sebenarnya mampu membayar pajak tapi tetap tidak membayar sesuai kemampuannya.

Nah dengan kondisi sekarang apa yang bisa dilakukan dengan menggunakan Big Data?

Saya berpikir untuk memiliki semacam transition period. Di masa lalu, mungkin sampai sekarang banyak pajak yang tidak dibayarkan penuh. Fungsi dari masa transisi ini adalah melakukan evaluasi kenapa pajak tidak dibayarkan penuh dan apa solusi untuk itu.

Transaksi cashless yang menggunakan bank sudah sangat umum dan pasti akan semakin bertambah volumenya di masa depan. Pemerintah bisa memeriksa setiap transaksi ini. Tapi meskipun memeriksa pastikan sanksi tidak diberikan ke para pengusaha selama masa transisi ini. Karena jika diberikan maka otomatis akan kabur semua mereka ini.

Dari semua transaksi ini baik pembelian, penjualan, gaji dsb-nya bisa dianalisis berdasrkan jenis usahanya berapakah seharusnya pajak yang diberikan. Proses evaluasi ini terus berlangsung meskipun sudah lewat masa transisi karena jenis usaha bisa berkembang dan dunia bisnis itu sangat cepat perubahannya. Sehingga jumlah pajakpun juga bisa berubah jika jenis usahanya berubah.

Satu-satunya yang me,bedakan masa transisi ini adalah tidak adanya sanksi untuk pengusaha selama masa ini. Hasil dari masa transisi ini nantinya adalah sistem perpajakan yang disesuaikan dengan jenis usahanya. Ini juga mempertimbangkan masukan dari apra pengusaha.

Setelah masa transisi ini tetap pengawasan dilakukan. Jika nantinya ditemukan pelanggaran, misal menjalankan usaha yang tidak sesuai dengan yang dinyatakan di laporan pajaknya, maka sanksinya harus sangat tegas.

Analisis transaksi juga terus dilakukan untuk terus beradaptasi terhadap dunia bisnis yang terus berkembang serta untuk mendeteksi penyimpangan.

Stepping into Python

Many things happen in the last one month. I will cover one of them. This new thing is called Python. I know its not something new many have used it but it is new to me. Yes, I taught myself python and stepping out beyond, the still powerful, Java.

python-logo

I started with simple Python programming language such as syntax, control flow, etc. And then step into its frameworks like Django, Flask, etc. Its a normal step when we learn new language.

numpy_logo scipy scikit-learn-logo-small

What I want to tell about is not python’s language or its frameworks. But some of its libraries for machine learning. There are ¬†four main libraries that are useful for conducting machine learning activities using python. Those are numpy, scipy, pandas and Scikit-learn.

Surprisingly these libraries have been used in production by many startups, including Tellapart that has been bought by Twitter for 500 Millions dollars.

My short research shown that these libraries can be used with Hadoop technology stacks. Simple example is incorporating these libraries in Map Reduce program. As we all know that we can create a map reduce using python thanks to Hadoop streaming.

But as any other machine learning implementation on Big Data it is not as obvious as the sun in mid summer. We need to clean the data, select the proper algorithm, make some adjustment on the parameter, analyze, interprest the results and finally display the result using visualization library. I am still working on that for my personal project. I will inform you about the progress.

Feel free to work on these libraries in python and let me know the results.

Enter The World of Graph

Soo sorry for neglecting this blog for more than 2 weeks. I have a new thing to take care of. Now it seems it has been sorted out. So I can active anymore. I hope in the future I always have a time to write a post for this blog at least once a week like I used to.

graph_theory

For this post I will talk about the world of graph. Quick and dirty definition of Graph by Wikipedia is a representation of a set of objects where some pairs of objects are connected by links. Everything in this world can be represented by using graph. In the data domain, relationship among entities are clearly is a graph.

Not until the last decade, graph database and graph analysis emerge as a new field of study that can be used to solve many problems such as collaborative filtering, fraud detections, social networks, identity and access management, and many more.

One might ask, if we already have relationship in relational databases that can represent graphs, then why would we need graph database or graph analysis tools. Wouldn’t it be just enough by having the realtionships and query them using SQL?

Well the answer for that question is Yes and No. I like answering question with that answer which means there are no absolute things in this world ūüôā

Yes you can just use the existing relational database for storing and analyzing graphs. But the Yes is only if you have very simple and very little entities related to one another (say less than 5 entities and less than 5 relationships). There is one kind of algorithm for representing graph in RDBMS that is famous called adjacency list model that you can check it out.

Definitely No, if the relations are quite complex and the data are¬†very huge, for example more than 20 entites and more than 20 relationships among them, then it is easy to make RDBMS’ performace deteriorate quite significantly. This is because in RDBMS it is common to do JOINS in case we need to answer queries that involve more than one entities. If we need to make JOINS of all those entity and especially if the data is very big (bigger than 100 GB) then you can kiss RDBMS a good bye.

So if RDBMS is not good enough, why not using NoSQL. Afterall it is designed for huge data isn’t it? Yes it is. But it still doesn’t satisfy the analysis purpose of a Graph. Why? Because the nature of NoSQL are putting the data in one place either in one Document like for example MongoDB or in one Row like HBase or Cassandra. There are no relationships. That is the characteristics that makes NoSQL can handle huge data and can serve query for those huge data efficiently. Because there are no JOINS. Knowing this fact it is almost impossible to get even graph representation. It is still possible but it has to be the burden of application level and not database level. In this case it depends on how good the technology you use at application level and how good are your¬†software engineers.

At the center of graph processing and analysis is graph database. Currently there aren’t many graph database out there compare to relational databases or NoSQL databases. I will mention several of graph databases.

neo4j

First is Neo4J. I would say it is the most mature graph database of all. It is almost a decade old since its first launching. It is quite simple, easy to use and yet open source graph database. Many companies are using Neo4J. For beginners, Neo4J is a good starting point to learn about graph processing and analysis. It can fit nicely in single machine. However, from experiences of several people who used Neo4J it gets more diifficult when the data gets bigger and scalability is needed.

giraph

Second is Apache Giraph. This actually not a real graph database.¬†Apache Giraph is an iterative graph processing system built for high scalability. That’s what its website says. So the data of the graph (edges and vertices) are actually stored somewhere else. Usually these data are stored in Hadoop and Giraph acts as Map Reduce process and analyze these data. Apache Giraph is just a tool to process and analyze those data. It is currently used by Facebook to analyze the graph of their users. So you can imagine the power of Giraph by imagining how big Facebook’s graph is.

titan-aurelius

Third is Titan by Aurelius. This is I would say moreless like Giraph. It is not a complete whole set of graph database but rather a tool to do CRUD, process and analyze graph data. For the data storage, Titan needs database. There are currently three databases that can be used as Titan storage. Those are Cassandra, Apache HBase and BerkeleyJS by Oracle. The difference between Titan and Giraph is that Titan is very handy in OLTP so it can serve insert and query operation quite reasonably fast. While Giraph is more for OLAP where it doesn’t have to deal with the data ingestion/insertion process. Giraph¬†just reads what already stored in its data storage.

cayley-google

Fourth is Cayley by Google. This is relatively new graph database that has just been released as open source by Google around mid-2014. It is written in Golang, language created by Google. Until now I haven’t found any prominent users that use Cayley. So you can say it is still in experimental stage.

tinkerpop

Besides those graph databases, there is also one interesting tool that we can use to use conjoint with those databases. That tool is Tinkerpop. Tinkerpop is a graph computing framework. Some database have its computing framework but Tinkerpop makes it universal to process them. It is still at incubator stage in Apache project but I think it will graduate soon and becomes full-fledge apache project. Tinkerpop consists of several tools like Gremlin for graph query and Rexster for Rest API for Graph Database.