Spark vs Mapreduce

Spark vs Mapreduce

MapReduce is batch oriented in nature. So, any frameworks on top of MR implementations like Hive and Pig are also batch oriented in nature. For iterative processing as in the case of Machine Learning and interactive analysis, Hadoop/MR doesn’t meet the requirement. Here is a nice article from Cloudera on Why Spark which summarizes it very nicely.

It’s not an end of MR. As of this writing Hadoop is much mature when compared to Spark and a lot of vendors support it. It will change over time. Cloudera has started including Spark in CDH and over time more and more vendors would be including it in their Big Data distribution and providing commercial support for it. We would see MR and Spark in parallel for foreseeable future.

Also with Hadoop 2 (aka YARN), MR and other models (including Spark) can be run on a single cluster. So, Hadoop is not going anywhere.

Step aside, MapReduce. You have had a good run, but today’s big data developers are hungry for speed and simplicity. So, when it comes to picking a processing framework for new workloads to run on their Hadoop environments, they are increasingly favouring a nimble young rival called Spark.

So what’s so great about Spark, anyway? The main advantage it offers developers is speed. Spark applications are an order of magnitude faster than those based on MapReduce – as much as 100-fold, according to creator Mathei Zaharia, now CTO at Databricks, a company that offers Spark in the cloud, running not on Hadoop, but on the Cassandra database.

It is important to note that Spark can run on a variety of file systems and databases, among them the Hadoop Distributed File System, (HFDS).

Spark vs Mapreduce

What gives Spark the edge over MapReduce is that it handles most of its operations ‘in memory’, copying data sets from distributed physical storage into far faster logical RAM memory. By contrast, MapReduce writes and reads from hard drives. While disk access can be measured in milliseconds to access 1MB of data, in-memory accesses data at sub-millisecond rates. In other words, Spark can give organisations a major time-to-insight advantage.

Gartner analyst Nick Heudecker says: “One client I recently spoke to, with a very large Hadoop cluster, did a Spark pilot in which it was able to take a job from four hours [using MapReduce] to 90 seconds [using Spark].”

MapReduce’s greatest strength is processing lots of large text files. Hadoop’s implementation is built around string processing, and it’s very I/O heavy.



Spark vs Mapreduce

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.