Comparison of Engines Used in Hadoop: Tez vs Impala vs Drill vs Spark vs Flink

 Comparison of Engines Used in Hadoop: Tez vs Impala vs Drill vs Spark vs Flink


apacheTez Hadoop

Impala, Tez and Drill are all developed for Hadoop.   Both Tez and Impala claimed to have improved Hive/MapReduce speed by 10-100 times, set asides biases in the benchmark like a 384 GB memory machine used for Impala.

Spark and Flink were brought to Hadoop not only as more powerful data processing engines, but also with other capabilities in real-time data processing, complex query processing and machine learning.  Below is a quick comparison among all of these engines.

Impala:  Shipped by Cloudera, MapR, Oracle and Amazon since 2013, Impala is an open source tool developed by Cloudera to combat the slowness of Hive/MapReduce and to work as a promising interactive SQL-on-Hadoop solution. Impala includes a processing engine that is derived from Google Dremel and does not build on MapReduce.

Impala process data in memory and is faster than Hive/MapReduce.  It initially lacked Hive’s breadth of capabilities, but has added many functions over time such as UDFs, COMPUTE STATS and window functions for aggregation.  Impala does not support mid-query fault tolerance. It supports data stored in HDFS, Apache HBase and Amazon S3.

Impala is best used with Parquet.  Depends on who you are talking to, some believe that Impala may be better than Hive on Tez.  Others believe that Hive on Tez is better than Impala.

Tez:   It was originated from Microsoft’s research paper and implemented mainly by Hortonworks.  In July 2014, Tez became a top level Apache project.  Its main goal is to improve Hive and Pig’s MapReduce jobs.

Leave a Reply