Frank Kane's Taming Big Data with Apache Spark and Python

上QQ阅读APP看书，第一时间看更新

Spark is fast

Why do people use Spark? Well, it has a lot in common with MapReduce, really, it solves the same sort of problems, so why are people using Spark instead of MapReduce? MapReduce has been around a lot longer and the ecosystems and the tools surrounding it are more mature at this point. Well, one of the main reasons is that Spark is really fast. On the Apache website, they claim that Spark can Run Programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Now, that's a little bit of hyperbole to be honest, I mean that's in a very contrived example. In my own experiments, if you compare some of the tasks that we run on Spark with the same tasks we run using MapReduce, it's not 100 times faster, it's definitely faster, but maybe two to three times faster. Spark definitely has that going for it.

The way that it achieves that performance is using what it calls the directed acyclic graph engine. The fancy thing about Spark is that it doesn't actually do anything until you ask it to deliver results, and at that point it creates a graph of all the different steps that it needs to put together to actually achieve the results you want. It does that in an optimal manner, so it can actually wait until it knows what you're asking for, and then figure out the optimal path to answering the question that you want.