Spark FAQ Q: How does Spark achieve high performance? A: At a high level, the answer is parallelism. Spark partitions (splits up, or shards) the data over many worker machines, and executes "narrow" transformations in parallel on those machines. Q: Why is the purpose of Spark's RDD abstraction? A: Applications describe to Spark the sequence of transformations they want Spark to apply to the original input in order to produce the final output. It's convenient in these descriptions for the application to name some of the stages, much as one names variables in ordinary programming, and RDDs are a way to do this naming. Q: What does the paper mean when it says transformations are lazy operations? A: Suppose your program calls a transformation function like filter() or map() on an RDD. What the executing program actually does at that point is add to a lineage graph that it's building that describes how to process your data. A call like filter() or map() does *not* actually filter or map the data when you call it. When your program calls an action function like count() or collect(), at that point Spark analyzes the lineage graph your program has built, and applies the computation it describes to your data (reads it from HDFS on a bunch of worker machines, actually does the filtering and mapping &c). The fact that nothing much happens when you call a transformation, and work only happens later, is why the paper says "lazy". The lazyness helps with performance because Spark gets to know about all the transformations, the whole lineage graph, before it starts doing any work. It can analyze the graph to decide things like how to partition the data over the workers. And, if a worker fails, Spark uses the graph to help it efficiently reconstruct the data that was lost with the worker. Q: Is Spark currently in use in any major applications? A: Yes: https://databricks.com/customers Q: How common is it for PhD students to create something on the scale of Spark? A: Unusual! It is quite an accomplishment. Matei Zaharia won the ACM doctoral thesis award for this work. Q: Should we view Spark as being similar to MapReduce? A: You can think of Spark as MapReduce and more. Spark can express computations that are difficult to express in a high-performance way in MapReduce, for example iterative algorithms. There are systems that are better than Spark in incorporating new data that is streaming in, instead of doing batch processing. For example, Naiad (https://dl.acm.org/citation.cfm?id=2522738). Spark also has streaming support, although it implements it in terms of computations on "micro-batches" (see Spark Streaming, SOSP 2013). There are also systems that allow more fine-grained sharing between different machines (e.g., DSM systems) or a system such as Picolo (http://piccolo.news.cs.nyu.edu/), which targets similar applications as Spark. Q: Why are RDDs called immutable if they allow for transformations? A: A transformation produces a new RDD as output; it doesn't modify the input RDD. Q: Do distributed systems designers worry about energy efficiency? A: Not directly. They care about efficient use of compute, memory, and storage resources; minimizing use (and waste) of these resources tends also to reduce use of energy. Sometimes the concern is direct; for example see http://www.cs.cmu.edu/~fawnproj/. Big distributed systems often run in carefully designed datacenters, and the people who design datacenters care a lot about minimizing energy costs, both to run the computers and to cool them. Hardware designers pay a lot of attention to power use. For example, if some area of a chip isn't being intensively used, the hardware will reduce its clock rate or shut it down altogether. Q: How do applications figure out the location of an RDD? A: Application code isn't directly aware of where RDDs live. Instead, the Spark library decides on which servers to place the partitions of each RDD. The RDD objects that applications manipulate contain private fields in which Spark records location information (or, for RDDs not yet computed, the recipe for how to compute them from their input RDDs). Q: How does Spark achieve fault tolerance? Q: Spark remembers the sequence of transformations (the lineage) that led to each RDD. If a worker fails, typically some partitions of a bunch of RDDs will be lost along with the worker. Spark re-computes the missing partitions on other workers, using the lineage information to tell it what transformations to re-run with what input RDDs. The failure may have caused input RDDs to be lost as well, so Spark may have to start as far back as the original input files. If the programmer knows that re-computing an RDD would be too expensive (e.g. if it has "wide" inputs or is the result of many iterations), the programmer can ask Spark to replicate each partition of the RDD's data on multiple workers. Q: Why is Spark developed using Scala? What's special about the language? A: In part because Scala was fashionable when the project started. One good reason is that Scala provides the ability to serialize and ship user-defined code ("closures") as discussed in ยง5.2. This is fairly straightforward in JVM-based languages (such as Java and Scala), but tricky to do in C, C++ or Go, partly because of shared memory (pointers, mutexes, etc.) and partly because the closure needs to capture all variables referred to inside it (which is difficult unless a language runtime can help with it). Q: Does anybody still use MapReduce rather than Spark, since Spark seems to be strictly superior? If so, why do people still use MR? If the computation one needs fits the MapReduce paradigm well, there is no advantage to using Spark over using MapReduce. For example if the computation does a single scan of a large data set (map phase) followed by an aggregation (reduce phase), the computation will be dominated by I/O and Spark's in-memory RDD caching will offer no benefit since no RDD is ever re-used. Spark does very well on iterative computations with a lot of internal reuse, but has no architectural edge over MapReduce or Hadoop for simple jobs that scan a large data set and aggregate (i.e., just map() and reduceByKey() transformations in Spark speak). On the other hand, there's also no reason that Spark would be slower for these computations, so you could really use either here. Q: Is the RDD concept implemented in any systems other than Spark? Spark and the specific RDD interface are pretty intimately tied to each other. However, two key ideas behind RDDs -- deterministic, lineage-based re-execution and the collections-oriented API -- pre-date Spark and have been widely used. For example, DryadLINQ, FlumeJava, and Cloud Dataflow offer similar collection-oriented APIs; and the Dryad and Ciel systems referenced by the paper also keep track of how pieces of data are computed, and re-execute that computation on failure, similar to lineage-based fault tolerance. Spark has recently moved away from RDDs towards "DataFrames", which implement a more column-oriented representation while maintaining the good ideas from RDDs. https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes Q: What is hash partitioning? Hash partitioning is an idea from databases to implement database joins efficiently. Splitting an RDD into partitions using a hash means hashing the primary key of an RDD and storing all the records with the same hash in the same partition. If you have two RDDs with the same primary keys, and you hash-partition them, then the partitions of the two RRDs with the same keys will end up on the same machine. This is nice if you need to compute a join() on these two RRDs because the join() will not require communication with other machines, because the partitions with the same key are on the same machine. This shows up, for example, in the PageRank example with the links and ranks RRDs. Their keys are URLs and hash partitioning ensure that entry for "mit.edu" for both RRDs end up on the same machine. So, when joining the two RDDs on "mit.edu", no communication is necessary with another machine.