Distributed CBO: Cost Models in Apache Spark and BigQuery Architecture

Relational Query Optimization Mechanics represent the intersection of algebraic theory and distributed systems engineering. This discipline focuses on the identification and selection of the most efficient execution plan for a given SQL query, a task that becomes exponentially more complex as data scales across distributed nodes. The introduction of the Cost-Based Optimizer (CBO) in Apache Spark 2.2 and the architectural paradigms of Google BigQuery illustrate two distinct approaches to solving the challenge of high-latency data retrieval in large-scale environments.

While traditional relational database management systems (RDBMS) primarily optimized for local disk I/O and CPU cycles, distributed systems must focus on the mitigation of network overhead. Modern optimization engines analyze query graphs to determine join ordering, evaluate predicate pushdown opportunities, and use statistical metadata to estimate the cardinality of intermediate result sets. These mechanics are governed by cost models that assign numerical values to various execution strategies, ultimately selecting the path with the lowest predicted resource consumption.

What changed

Introduction of Statistics-Driven Optimization:With the release of Apache Spark 2.2, the framework transitioned from a purely rule-based optimizer (RBO) to a cost-based model. This allowed for more intelligent join selection based on column-level statistics such as cardinality, null counts, and histograms.
Shift to Shuffle-Centric Cost Models:Cloud-native systems like Google BigQuery began prioritizing the minimization of data shuffling across the network over traditional local storage bottlenecks, reflecting the architecture of decoupled compute and storage.
Automation of Broadcast Thresholds:Spark's CBO enabled the dynamic adjustment of broadcast join thresholds, allowing the engine to automatically decide whether to replicate a small table across all worker nodes or perform a more expensive shuffle join.
Evolution of the Iterator Model:The 'Volcano' iterator model, originally designed for single-node RDBMS, was adapted to handle distributed streams, necessitating new approaches to pipelining and batching to maintain performance across networked clusters.

Background

The foundations of modern query optimization were established by Patricia Selinger and the IBM System R team in the late 1970s. Selinger’s work introduced the concept of the cost-based optimizer, which utilizes a dynamic programming approach to explore the space of possible join orders and access paths. This model relies on three primary components: an exhaustive set of equivalent algebraic transformations, a cost model to estimate the resource requirements of each transformation, and a set of statistics describing the data distribution.

In the decades following Selinger's research, the 'Volcano' optimizer generator, proposed by Goetz Graefe, provided a framework for implementing these ideas. The Volcano model uses an iterator-based approach where each operator in a query plan (e.g., Scan, Filter, Join) implements aNext()Method. This pulls tuples through the plan tree in a demand-driven fashion. However, as data moved from single-server disks to distributed clusters, the overhead of individual tuple processing and the high cost of network latency required a fundamental re-evaluation of how these models were applied.

The Apache Spark 2.2 Cost-Based Optimizer

Prior to version 2.2, Apache Spark’s Catalyst optimizer relied almost exclusively on heuristics. These rules, while effective for basic transformations like constant folding and predicate pushdown, could not accurately determine the most efficient join order for complex multi-way join queries. The introduction of the CBO provided the engine with a mechanism to collect and use table and column statistics.

In Spark 2.2, theANALYZE TABLECommand became the primary tool for gathering metadata, including the number of rows, the total size of the table, and detailed column statistics such as min/max values and number of distinct values (NDV). These statistics allow the optimizer to estimate the selectivity of filters and the size of join outputs. A critical application of this data is the optimization of join algorithms. For example, if the CBO determines that one side of a join is significantly smaller than theSpark.sql.autoBroadcastJoinThreshold, it will favor a Broadcast Hash Join. This avoids the heavy network shuffle required by a Sort Merge Join by sending the small table to every executor, thereby performing the join locally.

Google BigQuery and Cloud-Native Cost Models

Google BigQuery represents a departure from the traditional cluster-based approach seen in Spark. Built on the Dremel execution engine, BigQuery utilizes a serverless architecture where compute resources (slots) are separated from the storage layer (Colossus). In this environment, the cost model is uniquely focused on the "shuffle" phase—the process of redistributing data across the network during join or aggregation operations.

BigQuery's optimizer is designed to minimize the volume of data that traverses the network. Because the underlying storage is highly parallel and remote, local disk I/O is less of a factor than the efficiency of the network interconnect. The BigQuery optimizer employs dynamic query execution, which allows it to modify the execution plan at runtime based on the actual size of intermediate results. This is particularly useful in distributed environments where data skew can cause initial cardinality estimations to be wildly inaccurate. By re-partitioning data on the fly and adjusting join strategies during execution, BigQuery compensates for the limitations of static cost models.

The Volcano Model in Distributed Execution

The Volcano iterator model remains a cornerstone of database engine design, but its application in distributed systems like Spark and BigQuery requires modification. In a distributed context, the simpleNext()Call of a Volcano operator may result in high CPU overhead due to virtual function calls and poor cache locality. To address this, Spark implemented "Whole-Stage Code Generation," which collapses the Volcano-style operator tree into a single, optimized Java function at runtime. This maintains the logical structure of the Volcano model while achieving the performance of hand-written code.

In distributed execution, the Volcano model must also account for data serialization and network buffers. Each operator no longer just processes tuples; it must manage the transition between local processing and remote data exchange. This necessitates the use of exchange operators (shuffles) that act as synchronization points within the iterator tree, effectively breaking the query into stages that can be executed in parallel across the cluster.

Join Algorithms and Cardinality Estimation

At the heart of Relational Query Optimization Mechanics is the selection of join algorithms: nested loop, merge, and hash joins. The optimizer's choice depends heavily on cardinality estimation—the prediction of how many rows will satisfy a specific condition. In distributed systems, the cost of a wrong estimation is amplified by network latency. If an optimizer underestimates the size of a result set and chooses a broadcast join for a table that is actually several gigabytes in size, the resulting memory pressure can lead to Out-of-Memory (OOM) errors and job failure.

Advanced cost models now incorporate histogram data and sketch-based algorithms (like HyperLogLog) to improve the accuracy of NDV estimations. These tools allow the optimizer to detect data skew—where certain keys are over-represented in the dataset—and choose specialized join strategies, such as skew joins, that distribute the workload more evenly across the worker nodes. This granular level of analysis is what distinguishes modern distributed CBOs from their predecessors.

What sources disagree on

There is ongoing debate within the database engineering community regarding the trade-offs between static cost-based optimization and dynamic, or adaptive, query execution. Some practitioners argue that as datasets grow more volatile, the utility of static statistics decreases, making the case for systems that focus exclusively on runtime adaptation. Others contend that without a strong static cost model to provide an initial plan, the overhead of re-optimizing a query during execution can negate any performance gains.

Furthermore, the value of the Volcano iterator model in the era of vectorization and SIMD (Single Instruction, Multiple Data) processing is frequently questioned. Proponents of vectorized execution engines argue that the row-by-row approach of the Volcano model is inherently incompatible with modern CPU architectures, suggesting that batch-oriented or columnar processing should entirely replace the traditional iterator-based framework in high-performance distributed systems.