Evolution of Relational Query Optimization: From System R to Modern CBO

In the late 1970s, the IBM San Jose Research Laboratory produced a foundational framework for modern database systems. Relational query optimization mechanics emerged as a critical discipline during this period, addressing the challenge of efficiently executing declarative queries on large datasets. This process involves the automated selection of the most efficient execution plan from a vast search space of possible algebraic transformations, a task that was previously managed manually by programmers using procedural languages.

The shift toward automated cost-based optimization (CBO) was codified in the 1979 paper, "Access Path Selection in a Relational Database Management System," authored by Patricia Selinger and several colleagues. This work established the principles for evaluating query graphs, determining join ordering, and utilizing statistical estimations to minimize resource consumption. Today, the legacies of these early experiments are visible in the sophisticated optimization engines of commercial relational database management systems (RDBMS) like IBM DB2 and Oracle, which continue to refine the trade-off between CPU cycles and input/output (I/O) operations.

What changed

Prior to the advancements of the late 1970s, data retrieval was primarily a procedural try. Application developers were required to know the physical layout of the data on disk and write specific code to traverse paths between records. The introduction of the relational model by E.F. Codd changed the model to a declarative one, where users described what data they wanted rather than how to get it. However, this abstraction introduced a performance risk: without an intelligent intermediary to decide on the access path, relational systems were often slower than their hierarchical counterparts.

Shift from Rule-Based to Cost-Based:Early systems relied on fixed heuristics (rule-based optimization), such as always using an index if one existed. Selinger’s work introduced the cost-based model, which uses statistics to predict the actual resource cost of different paths.
Automated Join Ordering:Instead of executing joins in the order they were written in a query, the optimizer began calculating the permutations of join sequences to find the one with the smallest intermediate results.
Statistical Dependency:Database engines began collecting metadata—such as cardinality, selectivity, and data distribution—to inform the optimization process, making the engine responsive to the actual content of the tables.
Dynamic Programming in Search:To manage the exponential growth of possible query plans, the optimizer adopted dynamic programming techniques to prune suboptimal sub-plans early in the decision-making process.

Background

The development of query optimization was intrinsically linked to the System R project at IBM. System R was an experimental database system designed to prove that the relational model was not only theoretically sound but also practically performant. At the heart of System R was the requirement to handle the newly developed Structured English Query Language (SEQUEL), later shortened to SQL. Because SQL did not specify access paths, the burden of performance fell entirely on the database engine's optimizer.

Before Selinger's paper, the computational cost of evaluating every possible way to execute a complex join was considered prohibitive. A join involving five tables could be executed in 120 different orders, and for each order, multiple access methods (such as index scans or sequential table scans) existed. The System R team needed a way to handle this search space without exhausting system memory or taking longer to optimize the query than to execute it. This led to the creation of the first strong cost model, which factored in both I/O operations and CPU overhead, weighted by the estimated number of records processed at each step.

The 1979 Framework: Access Path Selection

The Selinger algorithm fundamentally changed the architecture of database engines by introducing two major concepts: the cost function and the dynamic programming search. The cost function was expressed as a combination of page fetches and a weighted factor of the number of instructions executed. This formula allowed the system to compare disparate operations, such as a nested loop join versus a merge scan, on a common mathematical footing.

The search strategy employed a bottom-up approach. The optimizer first determined the best way to access each individual table. Then, it looked at all possible pairs of tables, then sets of three, and so on. At each stage, the optimizer would keep only the cheapest path for a specific set of tables, unless a more expensive path produced a result in a specific sorted order that might benefit a later operation. This concept, known as "interesting orders," ensured that the optimizer did not discard a plan that might eventually lead to a cheaper overall cost due to the avoidance of a subsequent sort operation.

Relational Query Optimization Mechanics

Modern optimization mechanics involve a multi-layered analysis of a query. When a SQL statement is submitted, it is first parsed and transformed into a query graph. This graph represents the tables as nodes and the join predicates as edges. The optimizer then performs a series of latent algebraic transformations. These includePredicate pushdown, where filter conditions are applied as early as possible to reduce the size of intermediate data sets, andView merging, where complex subqueries are flattened into the main query to allow for a broader search of join permutations.

Join ordering remains one of the most critical aspects of this discipline. The optimizer must identify dependencies and evaluate the efficacy of various join algorithms.Nested loop joinsAre often preferred for small inner tables or when a highly selective index is available.Merge joinsAre utilized when data is already sorted or can be sorted efficiently.Hash joins, which became more prevalent after the original System R work, are used for large, unsorted datasets by creating an in-memory hash table of the smaller relation. The selection of these algorithms is driven byCardinality estimations—the predicted number of rows that will satisfy a given predicate.

The Role of Indexing and Statistics

The efficacy of an execution plan is heavily dependent on the available indexing structures and the accuracy of the statistical estimator. Practitioners evaluate the utility of B-trees for range queries and hash indexes for point lookups. In modern environments, bitmap indexes may also be considered for columns with low cardinality. The statistical estimator relies on histograms that store information about data distribution. If statistics are stale or inaccurate, the optimizer may choose a plan that is orders of magnitude slower than the optimal one, a phenomenon known as "plan regression."

Evolution into Modern Commercial Engines

The principles established by Selinger were integrated into IBM DB2 and later influenced the development of the Oracle database optimizer. In the 1980s and 90s, these engines expanded upon the System R foundations by introducing support for parallel processing and distributed database environments. As queries became more complex—driven by business intelligence and data warehousing—optimizers had to account for star schema designs, where a large fact table is joined to multiple smaller dimension tables.

Oracle, for instance, introduced features such as theCost-Based OptimizerAs the default in its later versions, eventually phasing out the olderRule-Based Optimizer. It also introducedAdaptive Query Optimization, which allows the engine to change the execution plan mid-stream if it discovers that the actual cardinality of a result set differs significantly from the initial estimate. Similarly, IBM DB2 developed the Starburst optimizer, which utilized a Query Graph Model (QGM) to provide a more modular and extensible approach to query transformation and optimization rules.

Contemporary Advancements

While the core dynamic programming approach remains the standard, the scale of modern data has required further advancements. Modern optimizers now useCascading rulesAnd transformation engines that can handle hundreds of tables in a single query. Machine learning is also being explored as a method to improve cardinality estimations where traditional histograms fail, such as when columns have complex correlations. Despite these advancements, the fundamental objective remains unchanged: minimizing I/O and CPU cycles by intelligently reducing the size of intermediate results through the application of cost-based mathematical models. The legacy of Selinger’s epochal work continues to be the bedrock upon which the entire relational database industry is built.