The Cascades Framework: Evolution of Query Optimization Mechanics

In 1995, computer scientist Goetz Graefe published "The Cascades Framework for Query Optimization," a seminal paper that redefined how relational database management systems (RDBMS) translate high-level SQL queries into efficient physical execution plans. This framework served as the successor to Graefe’s earlier Volcano model, addressing critical limitations in search space management and rule-based extensibility. By introducing a top-down, task-oriented approach to optimization, Cascades provided a rigorous mathematical and structural foundation for modern database engines.

The Cascades framework is characterized by its strict separation of logical expressions (what data is requested) from physical expressions (how data is retrieved). This distinction allows database optimizers to explore a vast array of algebraic transformations and physical implementations without conflating the two. The framework’s adoption by major commercial systems, most notably Microsoft SQL Server, established it as a standard in the field of relational query optimization mechanics, enabling the processing of increasingly complex join graphs and nested subqueries.

What changed

The transition from the Volcano optimizer model to the Cascades framework represented a significant shift in the architecture of query search engines. While both models use rule-based transformations, Cascades introduced several structural innovations that optimized the search process itself.

Integration of Search and Pruning:In the Volcano model, transformation rules and cost-based pruning were often handled in distinct phases. Cascades integrated these into a single, unified search process, allowing the engine to discard inefficient paths earlier in the cycle.
Task-Based Execution:Cascades moved away from a simple recursive call structure toward a task-based stack. This allowed the optimizer to focus on specific optimization tasks, such as exploring certain join orders before others, based on potential cost savings.
The Memo Structure:Cascades refined the "Memo" data structure—a form of dynamic programming—to store equivalent expressions efficiently. This prevented the redundant optimization of the same sub-expression across different parts of the query plan.
On-Demand Transformation:Unlike earlier systems that might apply all possible rules to an expression, Cascades applies rules only when necessary to satisfy a specific optimization goal, significantly reducing the memory footprint during the optimization of complex queries.

Background

The origins of query optimization date back to the 1970s with the IBM System R project and Patricia Selinger’s pioneering work on cost-based optimization. Selinger’s model introduced the concept of using statistics to estimate the cost of different join orders, primarily using a bottom-up dynamic programming approach. While effective for small queries, this approach struggled with the exponential growth of search spaces as queries grew in complexity.

In the late 1980s and early 1990s, Goetz Graefe developed the Exodus and Volcano optimizer generators. These were designed to be extensible, allowing developers to define their own operators and rules. However, the Volcano model faced challenges with "re-optimization" and difficulty in handling complex physical properties like interesting orders (e.g., data already sorted by a specific column). The 1995 Cascades paper sought to resolve these issues by providing a more flexible, object-oriented framework that could handle the demands of increasingly sophisticated relational and object-relational databases.

The Architecture of the Memo

The core of the Cascades framework is theMemo. This data structure acts as a forest of trees, where each node represents aGroup. A group contains all equivalent expressions that produce the same result set, regardless of their physical implementation. For example, a group representing the join of tables A and B would contain the logical expressionJoin(A, B)As well as physical expressions likeHashJoin(A, B)AndNestedLoopJoin(A, B).

By organizing expressions into groups, Cascades ensures that if the optimizer finds the best way to join table A and table B, it can reuse that information whenever that specific join is required in a larger query. This memoization is critical for managing the combinatorial explosion of plans in queries involving dozens of tables. Within the Memo, logical rules transform one logical expression into another (e.g.,A join BToB join A), while implementation rules map logical expressions to physical ones.

Separation of Logic and Physics

One of the most durable contributions of Cascades is the formal decoupling of the logical query space from the physical implementation space. Logical operators define the relational algebra—projections, selections, joins, and unions—without regard for how data is stored on disk or processed in memory. Physical operators, conversely, represent specific algorithms like index scans, sort-merge joins, or bitmap filters.

This separation allows database engineers to add new physical algorithms (such as a new type of spatial index) without modifying the logical transformation rules. It also enables the optimizer to handle "enforcers." An enforcer is a physical operator, such asSort, that does not change the logical content of the data but modifies its physical properties to satisfy the requirements of a subsequent operator (like aMerge JoinThat requires sorted input).

Algebraic Identities and Search Space Pruning

The Cascades framework relies on a set of algebraic identities to handle the search space. These identities are the rules that allow the optimizer to "rewrite" the query into different but mathematically equivalent forms. Common identities utilized within the framework include:

Commutativity:A ⨝ B ≡ B ⨝ A. This allows the optimizer to swap the inner and outer tables in a join.
Associativity:(A ⨝ B) ⨝ C ≡ A ⨝ (B ⨝ C). This is fundamental for reordering join sequences to minimize intermediate result sizes.
Predicate Pushdown:Σ_p(A ⨝ B) ≡ σ_p(A) ⨝ B(if the predicatePOnly involves columns from A). Applying filters as early as possible reduces the volume of data passed to subsequent join operators.
Projection Pruning:Removing columns that are not required by higher-level operators as early as possible to reduce I/O and memory usage.

Cascades uses these identities to populate the Memo. However, exploring every possible transformation is computationally expensive. To manage this, the framework employsCost-based pruning. During the search, if a partial plan is already more expensive than the best complete plan found so far (the "upper bound"), the optimizer can stop exploring that branch. This branch-and-bound technique is essential for keeping optimization times within acceptable limits for interactive applications.

Implementation in Microsoft SQL Server

The most prominent real-world application of the Cascades framework occurred during the complete rewrite of the Microsoft SQL Server query optimizer for version 7.0 (internally codenamed "Sphinx"). Prior versions of SQL Server used a simpler, less flexible optimizer. By adopting Cascades, SQL Server gained a top-tier engine capable of handling highly complex decision-support queries and large-scale data warehousing workloads.

In the SQL Server implementation, the optimizer is referred to as the "Query Optimizer" or "QO." It utilizes the Cascades task-based architecture to move through various stages of optimization, from initial heuristic simplification to full cost-based search. The engine uses statistics—histograms and density vectors—to estimate the cardinality (number of rows) of intermediate results. These estimations are fed into the Cascades cost model to determine which algebraic transformations are most likely to yield the fastest execution plan.

Impact on Modern Database Research

Beyond its commercial success in SQL Server, Cascades has become the blueprint for several other modern query optimizers. TheApache CalciteFramework, used by many big data systems like Apache Hive and Apache Flink, employs a transformation engine heavily influenced by Cascades. Similarly, theOrcaOptimizer used in Greenplum and HAWQ is a standalone Cascades-based engine designed for massively parallel processing (MPP) environments.

The longevity of Graefe’s 1995 paper is attributed to its flexibility. By defining optimization as a search problem over a set of rules and a memoized state space, Cascades provided a framework that could adapt to new hardware realities, such as multi-core CPUs and fast NVMe storage, simply by updating the cost models and physical operators while leaving the core search engine intact.

"The Cascades framework's primary advantage is its ability to handle complex optimization tasks by breaking them down into small, manageable rules that can be applied selectively."

As relational databases continue to evolve toward autonomous and self-tuning systems, the mechanics of Cascades remain relevant. Machine learning models are now being integrated into Cascades-based optimizers to improve cardinality estimation and cost modeling, further extending the utility of the framework Goetz Graefe introduced nearly three decades ago.