This post is by Marc Limotte, on work done during his ‘Sprintbatical’. The Climate Corporation has “sprintbaticals,” two-week breaks to work on something a little different….
After a week of investigating Spark, I wanted to share my initial thoughts. As a disclaimer, this is not an exhaustive product evaluation. In particular, I have not done performance testing or reliability testing, but my investigation continues.
What is the Spark project?
“Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.
To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce.
To make programming faster, Spark provides clean, concise APIs in both Scala and Java. You can also use Spark interactively from the Scala console to rapidly query big datasets.”
As a side note, two projects which are a start at enabling a Clojure API for Spark exist. One is my own (https://github.com/TheClimateCorporation/clj-spark), and the other designed by Mark Hamstra (https://github.com/markhamstra/spark/tree/master/cljspark). Note that Climate Corporation is a heavy user of Clojure.
Efficiencies & How Spark Compares to Hadoop
Unlike Hadoop, Spark is not strictly a Map/Reduce framework; it is not limited to alternating Map and Reduce operations with an implicit group-by in between. For example, in Spark you can group on one key for one aggregation and then group by another key for a subsequent aggregation, without requiring a Map step in between. You can also do things like sampling or mapPartitions (i.e. apply a function to each split).
All sorts of algorithms can be written in a Map/Reduce style so this isn’t an advantage of functionality. It is a performance advantage. The extra Map step, for example, would require serialization, deserialization, disk IO, etc. By keeping data in memory, you avoid another ser/deser, IO and the overhead of spinning up the task instances. These distinctions allow a more efficient use of resources and are an important step forward for a certain class of Big Data problems. To me, the obvious fit is analytics, OLAP type queries, and late stage post-processing. Another use case is sampling of the data for basic statistics or to prototype a larger algorithm. The documentation suggests Machine Learning as another focus, but that’s not my interest at the moment.
Naturally, you’re limited to working with datasets that mostly fit in memory (I say mostly, because there are some options to spill to disk– I haven’t tested this feature, yet). Even with that, it’s reasonable to have hundred of GBs (or event TB’s) across your cluster, so this still allows for some interesting Big Data work. Also note that it doesn’t replicate data in memory (so you don’t need 3X storage as with HDFS), instead it save lineage for your splits which can be used to reconstruct lost data partitions. I imagine that we would continue to use Hadoop for a lot of the heavy lifting (and more static work flows), and would initially consider Spark for ad-hoc queries and post-processing on results.
The Spark API as a Query Language
I will first admit that I do not believe that query language investigation is a goal of the Spark project. They provide a useable API, but their focus (as I see it) is on improving performance to enable real time queries by pinning data in memory and enabling more efficient workflows.
The API allows for an imperative style… do this to the dataset, then do this, then this. Each step is an operation, often taking some anonymous function to describe the details of what should be done. This is much like Pig or Cascading, but not like Hive (SQL) which is a declarative style, nor like Cascalog which is a form of logic programming.
Writing programs using the Spark API, while not terrible, is a bit frustrating. You first have to get used to the different operation types; reduce vs reduceByKey, groupBy, map, flatMap, mapPartitions, glom, filter, sample, union, foreach, fold, collect, pipe, aggregate and more. Many of these offer interesting new options vs just map and reduce, but the breadth of options can be overwhelming at first, and some of these options seem redundant.
However, the hardest part in my experience is all the book-keeping. Similar to the Hadoop java api, each operation only deals with one (a value) or two (a key and value) objects. You don’t have the benefit of a higher order api/language (a la Cascading, Cascalog or Pig) to keep track of field names and types for you. Therefore you need to do a lot of book-keeping to assemble and disassemble your data elements from these opaque objects. This is probably a solvable problem, but I’m reporting on the out-of-the-box experience.
I should also point out that, as an API, it is possible to build more useable query languages or APIs on top. I suspect this is even the intention of the authors. Shark is a prime example. Shark is a Hive SQL interface built on top of Spark. I believe the same could be done for Cascalog/Cascading. A big question is whether these higher order languages, which were originally designed for the Hadoop Map/Reduce framework can take advantage of the extra operations type (beyond map and reduce functions) for more efficient workflows. I believe that with an intelligent planner/optimizer, they can.
I expect to experiment more with Shark. I would also love to see a Cascading planner for Shark.
In the end, I’m very excited about this direction for big data. There are several other projects working in this area; including Apache’s Drill (similar to Google Big Query) and Cloudera’s Impala. Collectively, this is a movement for Big Data toward interactive and real-time querying (aka low-latency analysis). I spent a fair amount of time on Big Query… there were definitely some quirks and bugs; but my two main concerns were 1) the difficulty of moving our data from S3 to Google Storage, and 2) the lack of extensibility. Drill likely addresses those two points, but I have not tried it. Likewise, I have not tried Impala.
Who knows which project will win, perhaps they will each have their own niche. In any case, I believe Spark is a competent player and I look forward to monitoring their progress.