In May 1979, a research team at the IBM San Jose Research Laboratory published a seminal paper titled "Access Path Selection in a Relational Database Management System." Led by Patricia Selinger, the authors described the internal mechanics of the optimizer developed for System R, one of the earliest implementations of the relational model. This paper introduced the foundational framework for cost-based optimization (CBO), a methodology that uses statistical estimations to select the most efficient execution plan for a given SQL query.
Relational Query Optimization Mechanics, as established by the Selinger model, transformed database interaction from a manual, procedural exercise into a declarative process. Instead of programmers explicitly defining the physical path to data, the database engine assumed the responsibility of analyzing algebraic transformations and hardware costs. Today, the core principles outlined in the 1979 paper remain central to the operation of modern relational database management systems (RDBMS) such as PostgreSQL, Oracle, and Microsoft SQL Server.
At a glance
- Publication Date:May 1979 (ACM SIGMOD Conference).
- Principal Author:Patricia Selinger.
- Associated Project:IBM System R.
- Core Innovations:Dynamic programming for join ordering, "interesting orders," and cost-based selection of access paths.
- Hardware Focus:Optimization of I/O operations and CPU cycles relative to available system resources.
- Legacy:Direct influence on the PostgreSQL query planner and the design of commercial SQL optimizers.
Background
Before the widespread adoption of the relational model, database systems primarily followed the CODASYL (Conference on Data Systems Languages) or hierarchical models. These systems were "navigational," meaning that application developers had to write code that specifically navigated pointers and physical storage paths to retrieve data. If the underlying data structure changed, the application code frequently required extensive rewrites to maintain performance or basic functionality.
Edgar F. Codd proposed the relational model in 1970, advocating for a declarative approach where users specified *what* data they wanted without specifying *how* to retrieve it. However, early critics argued that a high-level language like SQL could never be as efficient as hand-tuned navigational code. The burden fell upon the database engine to prove that it could automatically generate execution plans that rivaled human-written logic. The System R project was IBM's attempt to build a production-grade relational system, and Patricia Selinger's optimizer was the engine designed to solve the problem of automated path selection.
The Selinger Optimizer Framework
The 1979 paper established three primary components of query optimization that define Relational Query Optimization Mechanics to this day: cost estimation, access path selection, and join ordering.
Cost Estimation Formulas
The original Selinger model proposed that the "cost" of a query should be a weighted combination of I/O and CPU usage. The 1979 formula was relatively simple, focusing on the number of disk pages fetched and the number of calls to the Relational Storage Interface (RSI). The cost (C) was expressed as:
Cost = PAGE_FETCHES + W * (RSI_CALLS)
In this context,WRepresented a weighting factor between I/O and CPU. By estimating the number of pages to be read from disk and the number of tuples to be processed in memory, the optimizer could assign a numeric value to different potential execution plans. The plan with the lowest total cost was selected for execution.
Access Path Selection
The optimizer evaluated two primary ways to access data in a single table: the sequential scan and the index scan. Selinger identified that while an index scan is often faster for retrieving a small subset of rows, a sequential scan is more efficient for large data sets due to the reduction in disk head movement. The decision was based onSelectivity—the estimated fraction of rows that satisfy a given predicate. If the selectivity was low (meaning few rows returned), the optimizer preferred an index; if selectivity was high, a sequential scan was chosen.
The Innovation of "Interesting Orders"
One of the paper's most durable contributions is the concept of "interesting orders." An execution plan might produce results that are sorted by a specific column. Even if a particular step in the plan (like an index scan) has a higher raw cost than a sequential scan, the fact that it produces sorted output might be highly beneficial for subsequent operations, such as aMerge JoinOr aGROUP BYClause. The Selinger optimizer was designed to retain these "sub-optimal" paths if they provided a sorted order that could reduce the total cost of the overall query.
Join Ordering and Dynamic Programming
Determining the order in which multiple tables are joined is the most computationally expensive part of query optimization. For a query involvingNTables, there areN!(n-factorial) possible join permutations. A query with 10 tables would result in over 3.6 million possible orders, making an exhaustive search impossible for complex queries.
The Dynamic Programming Approach
To manage this complexity, Selinger applied aDynamic programmingAlgorithm. The optimizer breaks the problem down into smaller sub-problems. It first calculates the best way to access each individual table. Then, it determines the best way to join pairs of tables, then sets of three, and so on. By discarding expensive paths at each step and only keeping the cheapest path for each subset of tables (while accounting for interesting orders), the optimizer reduces the search space from factorial to exponential complexity.
Left-Deep Trees
The System R optimizer primarily exploredLeft-deep trees. In a left-deep tree, the result of a join is always used as the left input for the next join, while the right input is always a base table. This restriction further reduced the search space and was particularly effective forNested Loop Joins, which were common in early relational systems. While modern optimizers sometimes explore "bushy trees" (where the result of a join is joined with the result of another join), the left-deep approach remains the default for many engines due to its balance of efficiency and search depth.
Comparison with Modern PostgreSQL Costs
While the Selinger paper laid the groundwork, the specific values used in cost formulas have evolved to reflect the divergence in hardware capabilities. Modern systems like PostgreSQL use more granular cost constants to account for the massive gap between CPU speeds and disk latency.
| Cost Variable | Description | Modern Default Value (Postgres) |
|---|---|---|
Seq_page_cost | Cost of a sequential disk page fetch. | 1.0 |
Random_page_cost | Cost of a non-sequential disk page fetch. | 4.0 |
Cpu_tuple_cost | Cost to process a single row. | 0.01 |
Cpu_index_tuple_cost | Cost to process an index entry. | 0.005 |
Cpu_operator_cost | Cost to execute an operator or function. | 0.0025 |
In the original 1979 model, the weighting between I/O and CPU was a single variable (W). In contrast, PostgreSQL allows administrators to tune these constants. For example, on systems using Solid State Drives (SSDs), theRandom_page_costIs often lowered to 1.1 because the latency penalty for seek operations is significantly lower than on the traditional spinning platters available in 1979.
Evolution of Join Algorithms
Selinger’s paper focused primarily onNested Loop JoinsAndSort-Merge Joins. In a Nested Loop Join, the engine iterates through every row of the outer table and searches for matches in the inner table. In a Sort-Merge Join, both tables are sorted by the join key and then scanned in parallel.
As memory capacity increased,Hash JoinsBecame a third pillar of optimization. A Hash Join builds a hash table of the smaller join relation in memory, allowing for O(1) lookups for rows from the larger relation. While not detailed in the 1979 paper, Hash Joins use the same cost-based selection logic pioneered by Selinger, where the optimizer estimates the cardinality of the input sets to decide if the hash table will fit in the allocatedWork_mem.
Challenges in Statistical Estimator Accuracy
The effectiveness of the Selinger model is entirely dependent on the accuracy of the statistics available to the database engine. If the optimizer assumes a uniform distribution of data but the data is heavily skewed, it may choose a disastrously inefficient plan.
To mitigate this, modern systems have expanded upon the basic row-count statistics used in System R. Contemporary Relational Query Optimization Mechanics involve:
- Histograms:Dividing data ranges into buckets to track frequency and distribution.
- Most Common Values (MCV):Explicitly tracking the frequency of the most repeated entries in a column.
- Extended Statistics:Tracking correlations between different columns (e.g., city and zip code) to avoid underestimating the selectivity of multi-column filters.
Without accurate cardinality estimations, the dynamic programming algorithm may prune the truly optimal plan early in the process, leading to what is known as "optimizer regression."
The lasting effect of 1979
The principles of predicate pushdown—moving filters as close to the data source as possible—and view merging—flattening subqueries into the main query—were all either present or anticipated in the Selinger model. Even as the industry moves toward distributed databases and cloud-native architectures, the fundamental challenge of minimizing I/O and CPU cycles through algebraic analysis remains unchanged.
The Selinger paper did more than provide a set of algorithms; it established the legitimacy of the relational model by proving that an automated system could reliably find efficient access paths. This allowed developers to focus on data logic rather than hardware constraints, ushering of data management. The cascading application of rules derived from this epochal work continues to govern how data is retrieved across nearly every major SQL engine in use today.