Apache Storm vs. Apache Spark
- Storm is a distributed real-time computation system
- Apache Storm is a task parallel continuous computational engine.
- Storm defines its workflows in Directed Acyclic Graphs (DAG’s) called “topologies” which run until shutdown by the user or encountering a failure.
- Storm does not natively run on top of typical Hadoop clusters, it uses Apache ZooKeeper and its own master/ minion worker processes to coordinate topologies, master and worker state, and the message guarantee semantics.
- both Yahoo! and Hortonworks are working on providing libraries for running Storm topologies on top of Hadoop 2.x YARN clusters
- Storm can run on top of theMesos scheduler as well, natively and with help from the Marathon framework.
- Regardless though, Storm can certainly still consume files from HDFS and/ or write files to HDFS.
- Apache Spark is a fast and general purpose engine for large-scale data processing
- Apache Spark is adata parallel general purpose batch processing engine.
- Workflows are defined in a similar and reminiscent style of MapReduce, however, is much more capable than traditional Hadoop MapReduce.
- Apache Spark has its Streaming API project that allows for continuous processing via short interval batches.