Building Minimized Big Data with HDP

Fotolia_41498462_M1-1024x873

Hello again. This time I will tell story about my experienced building minimized big data. Minimized Big Data is actually also known as Little Big Data like in my previous post. I decided to rename it to Minimized Big Data cause it captures more what I mean in that post.

Just a recap, minimized Big Data is a platform to ingest, manage and analyze huge data (either batch or streaming). This platform comes from facts that most of the data we handled are either structured or semi-structured that is easily can be structured. Based on these facts, some components constitute the conventional big data platform can be removed in the beginning of big data implementation. Later on, these components can be added gradually based on the actual needs. This strategy can ensure smooth adoption which in the end will minimize cost.

The key implementation for minimized big data is not to use big data components that is used to handle unstructured data. In other words, data that is difficult to be structured. Thus, difficult to be analyzed. Based on this then Hadoop Core component is not used. What? Yes, Hadoop core is not used. Therefore, HDFS, Map Reduce, and YARN are not included in this platform. It means other components that is using Hadoop-core is not used like HBase, Pig, Hive, etc are not used.

As audacious as it sounds, this minimized big data platform is using only selected softwares with the following criteria:

  1. Scalable
  2. Easy to monitor
  3. Good and easy connectivity among each others

From my experiences I selected several softwares that can constitute minimized Big Data Platform:

  1. Apache Kafka
  2. Apache Storm
  3. Apache Cassandra
  4. Apache Spark (optional)
  5. ElasticSearch (optional)
  6. Apache Solr (optional)
  7. Apache Sqoop (optional)

storm

Apache Kafka is used as the gateway to enter all data realtime or non real time. Although real time streaming is more preferable as that is what Kafka was built for. Apache Storm is used to ingest the data from Kafka. Storm can also used to do simple analysis and reformat data from semi-structured to structured for further analysis. The results will be stored in Apache Cassandra. Cassandra is a great NoSQL tool because it is scalable, reliable and easy to manage. Thanks to OpsCenter tool from Datastax. Cassandra even now have Cassandra File System (CFS) that is intended to replace HDFS. Apache Spark is used to do deep analyzing on the data stored in Cassandra that Storm cannot analyze on the fly. The result can be stored in Cassandra again in different database or keyspace.

kafka

Elasticsearch and Apache Solr is a search platform that is optional in case search feature is required. Another case that needs to include this feature is for handling geospatial query. This is because Cassandra has no built-in geospatial feature. Apache Sqoop is used to transfer data or files between RDBMS, filesystems and NoSQL or Cassandra.

cassandra

My experience setting this platform is using HDP 2.2, Datastax OpsCenter 5.0 , and ElasticSearch 1.4.2. HDP2.2 is used to install Kafka and Storm. Besides these two softwares, HDP 2.2 also will install apache Zookeeper, Nagios, Ganglia dan Apache Ambari. Zookeeper is required for parallel computing by Kafka and Storm. The other three is just for monitoring to easily manage the cluster. Datastax OpsCenter 5.0 is for Cassandra 2.0.9. Unfortunately, ElasticSearch alone has no monitoring like OpsCenter or Ambari. It has Kibana but I never used it before. ElasticSearch alone is quite easy enough to handle and also scalable. ElasticSearch can also be replaced by Apache Solr when it has a Dashboard for monitoring.

Hope this is entertaining enough 🙂

Howdy Apache Spark?

spark

My first interaction with Apache Spark happened few months ago in the beginning of 2014. Was not so impressive at that time for several reasons. Some of them like Spark introduce new terms like RDD where its too much to learn at that time, spark also requires its own cluster or at least machine to run. My next stumble upon Spark happens after I learned about Apache Mahout in depth. Mahout is some kind of “closing down”. The site mentioned that they will some how merge with Spark. This fact makes me reconsider again about Spark. So this post will mainly about Apache Spark.

Spark, the closest and most basic definition I could think of, is basically a parallel computing software. It is more less like map reduce in hadoop. Spark does not have HDFS like hadoop. It can use HDFS and YARN from Hadoop cluster to run its process. One of the main differentiator between Spark and Hadoop Map Reduce or other YARN application is its ability to make a data processing on memory. So basically they will load all data to memory to be analyzed, thus, giving a faster result. If the memory size is insufficient, Spark will do a spill over of the data to the disk. Although this is similar to yarn application or map reduce, Spak developers, like Databricks, claim Spark is still faster than Hadoop even when there is a spillover.

Beyond parallel computing, Spark also provides a myriad of library for data processing such as MLLib for machine learning, Spark streaming for data stream processing, and GraphX for graph processing. Using all these libraries, we can make a data-based application. From the Mahout’s web site it seems that Spark perfoms much better than Hadoop. So they decided to support Spark instead.

Together with other software like Mesos, Kafka, Storm and Cassandra, I considered that Spark can make the minimized, yet powerful mini Big Data. This is another idea pops out of my minds after exploring big Data technologies. Hadoop is great. But Its complexity to handle many kinds of data (structured, semi-structured, and unstructured), I believe is not required by many companies, especially companies starting to dive in to Big Data ocean. This is because most of the data they have and they want to analyze are structured. Even for non-structured data can be transformed into structured ones to ease the analysis. Spark role in the minimized big data architecture would be like Map reduce where it performs data processing and analytics.

Spark process data using a data structured called Resilient Distributed Dataset (RDD). RDD is used to store the data in memory to be transformed and analyze. Spark can handle data coming from files in HDFS, files in non HDFS, NoSQL database like Cassandra. General steps in spark application flow consists of three steps.

  1. The first one is loading the Data from data source into the RDD.
  2. Second step is transforming the RDD. The transformation purpose is to get the information we wanted from the RDD. This step can consists of several transformations as needed to get the desirable information.
  3. The last step is the action. Action step is basically start the entire process and write the results. Results can be printed on screen, stored in files or to NoSQL database like Cassandra.

To evaluate the spark program and all the steps (also sub-steps), Spark provide a command prompt. They call it REPL. The prompt is in Scala and Python.

Another additional component supporting Spark is Spark SQL. Formerly known as Shark. Spark SQL is a component to query data. Its role is similar to Hive in Hadoop.Spark SQL will also uses RDD in its background just like Hive is also transformed into Map Reduce in Hadoop.

I hope this post is quite entertaining 🙂

 

Story of Cassandra

cassandra

For this post I am using English. This is to accommodate my audiences who come from outside Indonesia. Its true they could be just Indonesians who happen to be in other countries and happen to read my blog. So, I am writing in English to test whether they are really Indonesians or non-Indonesians. Sorry for those who couldn’t understand but I am pretty sure my audiences whether you are coming from Indonesia or outside Indonesia are quite English-savvy individuals 🙂

This post is all about Apache Cassandra, I will refer this as just “Cassandra” throughout this post. Cassandra was first developed by Facebook to handle their data that increase exponentially fast. After a while, Facebook donated Cassandra to Apache Foundation. Thus, Cassandra is known as one of the Apache project. Facebook is using Cassandra to store their messaging data from Facebook Messenger.

Cassandra is one of the NoSQL database. For those who are familiar with CAP theorem (Consistency, Availability, and Partitioning), Cassandra is one of NoSQL database which is more oriented toward AP (Availability and Partitioning). This means the data we get from any client points do not always consistent. Despite of this drawbacks, Cassandra is great for fast writing. It means Cassandra is suitable for storing data that is coming fast and big. It is also quite scalable meaning we can add new nodes easily when the storage is maxed out.

Cassandra is available through manual download from Apache website or you can download and install it using package. By package I mean linux-based package. It is some file in a repository where we can install using certain command on Linux, e.g. apt-get in ubuntu or yum in RedHat. One company call Datastax provide a special package of Cassandra. Datastax is doing a quite extensive development on Cassandra. They don’t only provide the same Cassandra as we can obtain from Apache website but also they make tools and some enhancement to make Cassandra easy to use.

Some enhancements including a web-based application for managing Cassandra cluster called Opscenter. Having opscenter makes it super easy to create and manage Cassandra clusters (yes its plural, you can manage multiple cluster with this thing). They also provide DevCenter which some kind of workbench (like MySQL workbench) where you can run your query on Cassandra. One enhancement, or ambitious enhancement I would say, they are doing is to create Cassandra File System (CFS) which is aiming to replace HDFS in Hadoop. By doing so, Cassandra alone can replace data storage for Hadoop (HDFS and NoSQL database).

Screenshot from 2014-12-14 18:34:34

 

Tampilan OpsCenter DataStax

 

The products of Datastax come in two flavours. The Datastax Enterprise (DSE) and Datastax Community (DSC). DSE will give you capability such as CFS and Spark. While DSC just cassandra. You can have OpsCenter and DevCenter for free, though.

One of nice features from Cassandra is that there is no single point of error. This means any node in Cassandra Cluster are equal. Failure in one node, any node, will not bring the entire cluster down. This is different than the first generation of Hadoop where the NameNode becomes the single point of error. If Name node is down, the cluster becomes useless. In Hadoop 2, it gets better by introducing active/passive name node for failover, but still if those name nodes are down, the hadoop cluster is useless. This does not happen with Cassandra.

Another nice feature would be Cassandra Query Language or CQL. It is the language where we use to make query. CQL is very close to SQL. I would say 95% of CQL comes from SQL. Therefore, it is much easier for SQL-savvy persons to use Cassandra.

When it comes to Cassandra, I personally prefer to use Datastax, at least DSC. And then put some other software for performing analytics, e.g. Apache Spark. By doing so, I believe a lightweight, but yet powerful, Big Data infrastructure can be created. In fact, some people mentioned that it will be the next Big Data architecture.

Membuat Program Map Reduce Dengan Data dari File Excel

download

Tulisan kali ini mengulas program Map Reduce yang menganalisis secara sederhana data dari file excel. Setelah file teks, banyak data-data yang disimpan dalam bentuk file excel. Sehingga rasanya cukup masuk akal jika Map Reduce dari Hadoop menerima input file excel. Sebelumnya minta maaf jika kode yang ditampilkan tidak bagus formatnya karena plugin kode belum diinstall 😀

Pertama adalah kita membuat project Java dengan IDE pilihan Anda bisa Eclipse, NetBeans atau lainnya. selanjutnya masukkan library untuk membuat Map Reduce dan untuk mengakses file excel. Library mapreduce bisa diambil dari Hadoop. Silakan download di Apache Hadoop. Library untuk mengakses excel bisa diambil di Apache POI.

Kedua adalah membuat InputFormat untuk Excel yang menggunakan Apache POI untuk membaca setiap baris dari file excel. Untuk membuat InputFormat untuk Excel ada tiga class yang diperlukan.

  1. Class Parser yang melakukan parsing file excel menggunakan library dari apache POI. Di class ini didefinisikan bagaimana setiap kolom dalam satu baris file Excel, umumnya dipisahkan dengan tab.
  2. Class RecordReader adalah kelas yang mengubah input dari Parser dan mengubahnya menjadi file teks. Setiap baris di excel file diubah menjadi satu baris teks.
  3. Class InputFormat yang merupakan turunan dari FileInputFormat. Class ini hanya mengembalikan RecordReader yang nantinya akan digunakan di Mapper.

File Parser untuk Excel:

import java.io.IOException;
import java.io.InputStream;
import java.util.Iterator;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;

public class ParserExcel {

private static final Log LOG = LogFactory.getLog(ParserExcel.class);
private StringBuilder currentString = null;
private long bytesRead = 0;

public String parseExcelData(InputStream is) {
try {
HSSFWorkbook workbook = new HSSFWorkbook(is);

HSSFSheet sheet = workbook.getSheetAt(0);

Iterator<Row> rowIterator = sheet.iterator();
currentString = new StringBuilder();
while (rowIterator.hasNext()) {
Row row = rowIterator.next();

Iterator<Cell> cellIterator = row.cellIterator();

while (cellIterator.hasNext()) {

Cell cell = cellIterator.next();

switch (cell.getCellType()) {
case Cell.CELL_TYPE_BOOLEAN:
bytesRead++;
currentString.append(cell.getBooleanCellValue() + "\t");
break;

case Cell.CELL_TYPE_NUMERIC:
bytesRead++;
currentString.append(cell.getNumericCellValue() + "\t");
break;

case Cell.CELL_TYPE_STRING:
bytesRead++;
currentString.append(cell.getStringCellValue() + "\t");
break;

}
}
currentString.append("\n");
}
is.close();
} catch (IOException e) {
LOG.error("IO Exception : File not found " + e);
}
return currentString.toString();

}

public long getBytesRead() {
return bytesRead;
}

}

 

 

File RecordReader:

import java.io.IOException;
import java.io.InputStream;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import parser.ExcelParser;

public class ExcelRecordReader extends RecordReader<LongWritable, Text> {

private LongWritable key;
private Text value;
private InputStream is;
private String[] strArrayofLines;

@Override
public void initialize(InputSplit genericSplit, TaskAttemptContext context)
throws IOException, InterruptedException {

FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
final Path file = split.getPath();

FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());

is = fileIn;
String line = new ParserExcel().parseExcelData(is);
this.strArrayofLines = line.split("\n");
}

@Override
public boolean nextKeyValue() throws IOException, InterruptedException {

if (key == null) {
key = new LongWritable(0);
value = new Text(strArrayofLines[0]);

} else {

if (key.get() < (this.strArrayofLines.length - 1)) {
long pos = (int) key.get();

key.set(pos + 1);
value.set(this.strArrayofLines[(int) (pos + 1)]);

pos++;
} else {
return false;
}

}

if (key == null || value == null) {
return false;
} else {
return true;
}

}

@Override
public LongWritable getCurrentKey() throws IOException,
InterruptedException {

return key;
}

@Override
public Text getCurrentValue() throws IOException, InterruptedException {

return value;
}

@Override
public float getProgress() throws IOException, InterruptedException {

return 0;

}

@Override
public void close() throws IOException {
if (is != null) {
is.close();
}

}

}

 

File ExcelInputFormat:

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

public class ExcelInputFormat extends FileInputFormat<LongWritable,Text>{

@Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException, InterruptedException {

return new ExcelRecordReader();
}

}

 

Langkah ketiga adalah membuat Mapper. Mapper disini tidak akan terpengaruh pakah inputnya dari excel atau bukan karena begitu sampai di mapper, maka Mapper akan melihatnya sama seperti jika data itu berasal dari file teks. Yang paling penting dari Mapper adalah menemukan mapping yang cocok untuk data-data dari excel. File Mapper  bisa dilihat berikut.


import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class ExcelMapper extends
Mapper<LongWritable, Text, Text, IntWritable> {

private static Logger LOG = LoggerFactory.getLogger(ExcelMapper.class);
/**
* Excel Spreadsheet is supplied in string form to the mapper.
* We are simply emitting them for viewing on HDFS.
*/
public void map(LongWritable key, Text value, Context context)
throws InterruptedException, IOException {
//If more than one word is present, split using white space.
String[] words = value.toString().split(" ");

//Only the first word is the candidate name
output.write(new Text(words[0]), new IntWritable(Integer.getInteger(words[1]).intValue()));

}
}

 

Langkah keempat adalah membuat Reducer. Reducer ini menerima input dari Mapper dan melakukan pengolahan data. Untuk Reducer berikut, pengolahannya hanyalah penjumlahan value dari semua key yang sama.

import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class ExcelReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

@Override
public void reduce(Text key, Iterable<IntWritable> values, Context output)
throws IOException, InterruptedException {
int count = 0;
for(IntWritable value: values){
count+= value.get();
}
output.write(key, new IntWritable(count));
}
}

Langkah kelima adalah membuat class Job yang akan merangkai Mapper dan Reducer supaya bisa dijalankan. Class Job yang saya buat kurang lebih seperti ini.

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class ExcelDriver {

private static Logger logger = LoggerFactory.getLogger(ExcelDriver.class);
/**
* Main entry point for the example.
*
* @param args arguments
* @throws Exception when something goes wrong
*/
public static void main(String[] args) throws Exception {
logger.info("Driver started");

Job job = Job.getInstance();
job.setJarByClass(ExcelDriver.class);
job.setJobName("Excel Record Reader");

job.setMapperClass(ExcelMapper.class);
//job.setNumReduceTasks(0);
job.setReducerClass(ExcelReducer.class);

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);

job.setInputFormatClass(ExcelInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);
}

}

Setelah itu langkah keenam tinggal export project java yang diciptakan diatas lalu mendeploy ke cluster Hadoop dan menjalankannya. Perintah untuk manjalankan adalah:

$> hadoop jar ExcelMapReduce.jar input/ output/

ExcelMapReduce.jar adalah file hasil export project java. Sedangkan input/ adalah direktori di hdfs dimana semua file excel yang ingin dianalisis. Dan output/ adalah direktori tempat hasil dari eksekusi MapReduce.

Maaf lago kodenya gak keliatan jelas arena plugin belum di support di wordpress gratisan 😀 Kaloa da yang kurang jelas dan ada pertanyaan silakan tulis di komentar, sebisa mungkin saya akan jawab.

 

Langkah Pembuatan Program Map Reduce

download

Tulisan minggu ini akan sedikit mengupas langkah teknis dalam pembuatan program Map Reduce. Sedikit recap, program Map Reduce adalah program yang berjalan di Hadoop untuk mengekstrak, menganalisis data dari Hadoop / HDFS. Program Map Reduce bisa diciptakan dengan menggunakan Apache Pig, Apache Hive ataupun dengan membuat program sendiri dengan bahasa pemrograman Java. Map Reduce bisa juga diciptakan dengan bahasa selain Java misalnya Python. Tetapi karena Java adalah bahasa umum yang digunakan di Hadoop dan yang paling lengkap librarynya di Hadoop maka lebih mudah menggunakan Java.

Meskipun dengan menggunakan Pig atau Hive lebih gampang dalam mengambil data dari HDFS, tetapi Pig dan Hive memiliki keterbatasan jika diharuskan melakukan analisis data dengan menerapkan algoritma machine learning. Beberapa library machine learning seperti Apache Mahout dan Apache Spark juga bisa diimplementasikan di Java.

Untuk membuat program Map Reduce maka ada beberapa langkah yang harus dilakukan.

  1. Copy library mapreduce yang dari hadoop yang diperlukan.
  2. Kita harus tahu format file yang akan kita ambil data dan analisis. Format file disini menentukan cara kita mengambil. Untuk file-file teks (csv, tsv, txt, dll) maka library Map Reduce dari Hadoop sudah menyediakan cara untuk mengambil data dari file teks dari hdfs. Untuk format selain teks, misalnya. Excel, maka perlu dibuat class untuk mengambil data dan mengekstrak dari file yang non-teks tersebut. Class tersebut merupakan turunan dari InputFormat
  3. Implementasi fungsi Mapper yang membaca data dari InputFormat dan melakukan pemetaan dari key dan value dari data. Kita harus menentukan key dari data-data yang nantinya akan kita kurangi di kelas Reducer. Oleh karena itu, sangat wajar, bahkan dianjurkan, ada duplikat key dari semua record yang dibaca dari InputFormat. Karena semua record yang mempunyai duplikat key tersebut akan dianalisis dan di extract datanya menjadi hanya satu record.
  4. Implementasi fungsi Reducer yang menerima semua data dalam bentuk key-value dari Mapper dan mengolah data tersebut. Analisis dan pengolahan data diaplikasikan pada record yang memiliki duplicate key. Misalnya menjumlahkan value dari semua record yang key-nya sama, atapun mencari rata-rata nilai value dari semua record yang key-nya sama. Hasil dari Reducer adalah data dengan key dan value hasil pengolahan data tersebut. Semua data hasil Reducer tidak memiliki duplicate key.
  5. Membuat Program Job yang akan merangkaikan program Mapper dan Reducer. Program Job ini adalah program java biasa yang bisa diesksekusi (mempunyai method public static void main(String args[] args)).
  6. Export program map reduce menjadi file Jar dan silakan dikirimkan ke cluster Hadoop Anda. Setelah dicopy di cluster dan data sudah tersimpan di HDFS, maka program MapReduce bisa dijalankan. Untuk menjadwalkan MapReduce bisa kita gunakan apache oozie.

Demikian langkah-langkah umum dalam membuat program Map Reduce. Ditulisan berikutnya akan diberikan contoh program membuat Map Reduce.

Peran R di Big Data

R adalah salah satu dari sekian banyak bahasa pemrograman yang kurang populer. R sebenarnya satu keluarga dengan bahasa pemrograman MathLab. R adalah bahasa pemrograman yang erat dipakai para scientist/ilmuwan, terutama statistician (ahli statistik). R sebenarnya sudah cukup lama bahkan lebih tua dari Java. R diperkenalkan pada tahun 1993.  R adalah bahasa yang bersifat open source dan ada tools/ IDE untuk R yang biasa digunakan yaitu R Studio.

Karena usianya inilah R sebenarnya sudah digunakan jauh lebih dulu daripada teknologi Big Data seperti Hadoop dan teman-temannya. R sebelumnya digunakan untuk membantu melakukan proses statistik yang sering disebut statistical inference. Statistical inference adalah proses membuktikan suatu hipotesis. Contoh hipotesis misalnya untuk mengetahui usia pemakaian sebuah smartphone. Dengan mengetahui usia ini maka pembuat smartphone bisa menentukan lamanya garansi yang diberikan dan menentukan rentang waktu produksi smartphone sampai dibuat smartphone seri yang baru. R juga bisa digunakan untuk menentukan tren dari data dan melakukan prediksi  untuk masa yang akan datang sehingga langkah antisipasi bisa dilakukan.

Dengan hadirnya teknologi Big Data, seperti Hadoop dkk, maka kelebihan R ini banyak digunakan untuk melakukan analysis data yang sangat besar tersebut. Untuk memfasilitasi hal ini ada tool R yang khusus digunakan untuk analisis Big Data yaitu RHadoop. Dengan menggunakan RHadoop maka proses pembuatan predictive model, yang merupakan salah satu bagian dari Machine Learning, bisa dilakukan.

Bagi yang sudah membaca tulisan sebelumnya tentang Machine Learning dengan Apache Mahout atau Spark, bisa jadi ada pertanyaan apa bedanya dengan Mahout dan Spark? memang dengan menggunakan kedua tool Machine tersebut ada library untuk membuat tren dan, at some extent, membuat prediksi.

Kelebihan R adalah keluwesannya dalam menganalisis data untuk membentuk model atau algoritma untuk prediksi sendiri. Algoritma dari mahout dan Spark sangat general walaupun bisa dilatih untuk menganalisis data yang cukup spesifik tidak seluwes R karena kode untuk membentuk predictive model bisa ditulis sendiri dan disesuaikan langsung dengan data.

Misal untuk membentuk predictive model dari data supermarket tentu berbeda dengan membuat predictive model untuk cuaca. Data scientist yang berlatar belakang statistik lebih gampang dan mudah menggunakan R daripada bahasa pemrograman lain.

Semoga tulisan ini bermanfaat.