Hello Folks. Sorry it was busy week for me and this week post got delayed for a while. this week post is about Amazon. Amazon Web Service to be exact.
Since Amazon create AWS around a decade ago, it has increase in the number of service it offers. From just simple hosting and web service for amazon e-commerce now they hosted a variety services including Big Data. This post will gives some big Data services that AWS has to offer.
I will cover five components in aws that can serve you as Big Data infrastructure on the cloud. It can be an alternative to setup Big Data infrastructure.
Amazon Simple Storage Service (S3). S3 is basically a network file system. It resemblances with Hadoop’s HDFS. The most striking difference is that S3 does not need you to install Hadoop, define namenode, datanode, etc. It is much simple. In fact it is dead simple. Just specify the S3 point name, they call it bucket name, and specify the size and that is it. S3 provides API for storing, modifying and reading file from S3 bucket. Since S3 is a network file system like HDFS, so map reduce program can use the data in it. As I mention in my previous post about Netflix, Netflix use S3 instead of HDFS for their storage.
Amazon Elastic Map Reduce (EMR). EMR, as the name implies, is the Map Reduce service for Amazon’s Hadoop. EMR can read data from S3, write data to S3 or from/to any storage that Amazon provide like for example Dynamo DB. Hadoop EMR is quite effective and efficient. EMR is run on top of Amazon elastic Cloud computing (EC2). EC2 in brief, is an instance of nodes in AWS cluster. We can fire EMR in short period of time and terminate the EC2 nodes we use to run it. It saves lot of money since for some cases, no need for continuous or running EMR all the time. It is actually has been done by New York Time to transform all of their news articles from TIFF format to PDF format. It takes less than 24 hours and cost around 240 USD.
Amazon DynamoDB. DynamoDB is the twin sister of Cassandra. In fact, Cassandra was built based on a paper by amazon about DynamoDB. DynamoDB is column-based NoSQL database that Amazon offers as part of its Database As A Service. Make no mistake, although Amazon offers DynamoDB, we still can install Cassandra on EC2 and use it. There is AMI for Cassandra provided by Datastax. So what is the difference? Again, its simplicity. with cassandra on AWS you need to install the AMI on EC2. However, DynamoDB is a service. we just store our data there without having to setup the database cluster nor knowing how many nodes are having our data. For those who wants such service with less complexity than Cassandra can use DynamoDB.
Amazon Kinesis is another service offered by Amazon. It is service to handle streaming, unbounded data. The most similar open source Big Data technology that looks like Kinesis is Apache Storm or Spark Streaming. Using Kinesis it can run ‘map reduce’ on streaming data such as counting hashtags from tweets coming from twitter. It is relatively new and also very simple. No need to setup Zookeeper or Storm cluster. It offers API for Kinesis.
Amazon Redshift is Amazon component to do data warehousing. The similar open source Big Data component would be Apache Hive, Apache Pig, Apache Spark and Map Reduce program. However, Redshift works slightly different. it provides integration with proprietary data warehouse tools like Pentaho.
AWS provide good stuffs for creating Big Data infrastructure on cloud. Many big companies, like Netflix, trust AWS for their Big Data. However, many people also don’t use AWS big data infrastructure. Most of their reason is because there are not many people can operate AWS’ ‘BigData’ tools and also these tools from AWS are quite specific to AWS and they cannot have the same tools in other cloud computing service like Rackspace or DigitalOcean. That’s why they prefer the open source version of Big Data like Hadoop, Cassandra, etc.