I am very late on two things. I missed last week post and the technology I am going to discuss here has been popular in the last few years. Today post is about PrestoDB or Presto for short. For those of you who are like me few months ago, Presto is NOT a database. I know it ends with DB in it but it is not a database. It is a query engine to query database(s).
Query engine is just a tool that acts like middleware between client and the real data storage or database. But if it is just an extra layer would it be just slowing down the query time? We’ll find out.
Presto is a query engine that was created by Facebook. it is quickly adopted by many companies such as AirBnB and Netflix.
Logically it will make query time slower, because by using Presto it introduces a new layer. It also introduces an extra complexity since we also need to setup Presto. Not to mention extra cost because we need to have nodes to host Presto.
Ok so now we get what extra efforts we need to sacrifice if we wat to adopt Presto. But what are the advantages of adopting Presto? Presto enables us to query accross different data storages. So in the past if we can only join tables within the same database, by using Presto we can query tables that each are in different databases or data storages. For example on table in MySQL and the other table is in HDFS or Hive. Pretty cool, huh?
Currently the data storages that are firmly supported by Presto are MySQL, PostgreSQL, Cassandra, MongoDB, Hive, and Redis. There are other data sources that can be supported such as Apache Kafka, and JMX. There is option to include more data sources by developing our own Presto connector.
Now let’s evaluate regaring the speed. A benchmark testing done by Brandon Harris shown that with correct tune upp on the Presto’s JVM, Presto can beat the native Hive query in terms of query time. So althought Presto introduce extra layer, it is noe necessarily will slow things down.
In conclusion, indeed adopting Presto DB will introduce extra efforts, but with the correct tune up and expertise. It will bring a lot more benefits than querying each data source one by one and aggregate the results at application level. Otherwise Facebook, AirBnB, Netflix won’t adopt it, wouldn’t they?
Hope this helps 🙂