Modern Data-Intensive Scalable Computing (DISC) systems are designed to process data through batch jobs that execute programs (e.g., queries) compiled from a high-level language. These programs are often developed interactively by posing ad-hoc queries over the base data until a desired result is generated. We observe that there can be significant overlap in the structure of these queries used to derive the final program. Yet, each successive execution of a slightly modified query is performed anew, which can significantly increase the development cycle. Vega is an Apache Spark framework that we have implemented for optimizing a series of similar Spark programs, likely originating from a development or exploratory data analysis session. Spark developers (e.g., data scientists) can leverage Vega to significantly reduce the amount of time it takes to re-execute a modified Spark program, reducing the overall time to market for their Big Data applications.
Categories and Subject Descriptors: H.2.4 [Information Systems]: Database Management—query processing, parallel databases
General Terms: Languages, Performance, Theory Keywords: Query Rewriting, Incremental Evaluation, Spark, Interactive Development, Big DataData scientists report spending the majority of their time writing code to ingest data from several sources, transforming it into a common format, cleaning erroneous entries, and performing exploratory data analysis to understand its structure [22]. These tasks span the development cycle of a Big Data application. For instance, data cleaning and exploratory data analysis are typically performed in an iterative process: programmers start with an initial query and iteratively improve it until the output is in a desired form.
Despite this common pattern, existing Data-Intensive Scalable Computing (DISC) systems, such as Apache Hadoop [3] and Apache Spark [4], do not optimize for these scenarios. Rather, they run each query refinement anew, ignoring the work done in previous executions. Due to the immense scale of today’s datasets and the aforementioned steps involved, developing Big Data applications is very time consuming; it is not uncommon for data scientists to wait hours, only to find that they should have filtered some unforeseen outliers. A common fallback approach is to develop against a sample of the data. However, this approach is incomplete in that it does not guard against outliers not in the sample.
Our goal is to support interactive development of data-intensive applications by optimizing the execution of query refinements. There have been several prior works on optimizing data-intensive applications, but they do not meet this need. Some works can provide large speedups in the face of changes to the input data, for example via a form of incremental computation [25, 26, 29] or targeted optimizations for recurring workloads [23, 28] and iterative (recursive) queries [9, 26, 31]. These approaches all assume that the query itself is unchanged. Other systems provide the ability to cache and reuse the results of sub-computations [15, 24]. In an interactive development setting, these systems would allow unchanged portions of a query to reuse old results. However, any parts of a query that are downstream of a code change must still be executed from scratch, thereby limiting the ability to obtain interactive speeds.
In this paper we introduce Vega: an Apache Spark framework that automatically optimizes a series of similar Spark programs, likely originating from a development or exploratory data analysis session. Vega automatically reuses materialized intermediate results from the previous run when executing the new version of a program. As a starting point, we can reuse the results from the latest materialization point before any code modification, as prior cache-based systems would do [15, 24]. Vega significantly improves upon this baseline by automatically rewriting the dataflow to push the code modifications as late as possible, thereby allowing the execution to start from a later materialization point. This optimization is driven by an analysis that determines when and how two successive operations can be reordered without changing program semantics.
In addition to the rewriting optimization, Vega employs a complementary technique that adapts the prior work on incremental computation mentioned above (i.e., [25, 26, 29]) to our setting. Specifically, Vega can perform an incremental computation rather than ordinary re-execution of the operations downstream of the modified portion of the program, thereby computing only data changes (deltas) relative to the previous execution. We detail how Vega determines when such incremental computation is more profitable than ordinary computation.
We have implemented Vega both at the Spark SQL level (referred to as Vega SQL) as well as at the RDD transformation level (Vega RDD), in order to optimize programs written directly in Spark. Thanks to the high-level semantics of Spark SQL, query rewriting performed by Vega SQL is completely transparent to the user, i.e., no additional information is required from the programmer. Vega RDD instead trades off transparency for additional optimizations: Vega RDD comes with a specifically tailored API enabling rewrites that leverage the lower-level physical plan information provided by the RDD representation. Vega RDD also supports the complementary incremental computation optimization.
Experimental evaluations show that Vega is able to achieve up to three orders-of-magnitude performance gains by reusing work from prior Spark program executions for several real-world scenarios. Figure 1 previews the performance of Vega SQL compared to normal Spark SQL for reexecuting a modified query that measures how many links in the Common Crawl dataset [2] point to a certain domain; the modification refines the query by returning only links that point to Wikipedia pages. In the case of Spark SQL, the modified program is executed from scratch, whereas Vega SQL is able to rewrite the modification to operate over the output of the previous execution. As a result, the response time of the re-executed query is significantly lower than native Spark SQL. A more complete description of this experiment is given in Section 5.