Subquery Unnesting: Kim’s Algorithms and Modern Optimization

Won Kim’s 1982 paper, "On Optimizing an SQL-like Query," established the foundational framework for transforming nested SQL subqueries into equivalent join operations. This process, known as subquery unnesting or decorrelation, is a critical component of relational query optimization mechanics. It allows database engines to apply set-oriented join algorithms, such as hash joins or merge joins, instead of relying on the computationally expensive tuple-by-tuple evaluation associated with nested-loop execution.

In the decades since its publication, Kim's work has been both celebrated for its pioneering logic and scrutinized for specific algorithmic flaws, most notably the "Count Bug." Modern relational database management systems (RDBMS) like PostgreSQL, Microsoft SQL Server, and Oracle continue to build upon these principles, utilizing more sophisticated algebraic transformations and cost-based optimization models to ensure the efficient execution of complex, multi-level queries.

What changed

Shift from Nested Evaluation to Joins:Before 1982, many relational systems processed subqueries by executing the inner query repeatedly for every row in the outer query, a method that scaled poorly with large datasets.
Formal Classification of Subqueries:Kim introduced four distinct categories (Type A, N, J, and JA) to define how subqueries interact with their parent queries based on correlation and aggregation.
Discovery of the Count Bug:Subsequent researchers identified that Kim’s original transformations for correlated aggregate subqueries could produce incorrect results when the subquery returned an empty set, leading to the refinement of unnesting logic.
Introduction of Semi-joins and Anti-joins:Modern optimizers now frequently transformEXISTSAndNOT EXISTSClauses into semi-joins and anti-joins, specialized relational operators that Kim’s original work helped conceptualize.
Cost-Based Decorrelation:Contemporary engines do not always unnest subqueries; they use statistical estimators to decide if unnesting is cheaper than maintaining a correlated execution plan.

Background

The early development of the relational model, led by E.F. Codd, emphasized a declarative approach to data retrieval. Users specified what data they wanted without defining the physical path to retrieve it. However, early implementations of SQL, such as IBM’s System R, faced significant performance bottlenecks when queries contained nested blocks. A subquery nested within aWHEREOrHAVINGClause often forced the database engine into a procedural execution style, where the inner query acted as a subroutine called by the outer loop.

Relational query optimization mechanics emerged to bridge the gap between declarative SQL and efficient physical execution. The objective was to find the most cost-effective retrieval strategy by analyzing the query graph. One of the most effective ways to reduce I/O operations and CPU cycles is to minimize the number of times a table is scanned. Won Kim’s research was the first major attempt to systematically convert these procedural-style subqueries into declarative-style joins, which could then be manipulated by the optimizer’s existing join-ordering algorithms.

The Kim Taxonomy of Subqueries

Kim’s 1982 paper categorized subqueries into five types based on their structural relationship to the outer query. This taxonomy remains a standard reference for understanding how database engines approach decorrelation.Type ASubqueries involve no correlation and return a single value via an aggregate function. Because the result is a constant relative to the outer query, it can be calculated once and replaced in the outer predicate.Type NSubqueries are also non-correlated but return a set of values, typically used with theINOrNOT INOperators. These are the simplest to transform into joins.

The complexity increases withType JSubqueries, which are correlated. In these cases, the inner query references a column from the outer query, creating a dependency. Kim’s algorithm for Type J involved creating a temporary relation from the inner query and joining it with the outer table.Type JASubqueries are the most complex, involving both correlation and aggregation. These require the inner query to be grouped by the join key before being linked back to the outer table. Finally,Type D(Divisional) subqueries involve theALLOrNOT EXISTSOperators, requiring transformations that simulate relational division.

The Count Bug and Correctness Challenges

In 1984 and 1987, researchers including Werner Kiessling and the team of Richard Ganski and Harry Wong identified a significant flaw in Kim’s JA transformation, which became known in the industry as the "Count Bug." The issue arises when a subquery contains aCOUNTAggregate. In standard SQL, aCOUNTOn an empty set must return zero. However, Kim’s original transformation into a join would cause the entire row to be filtered out of the result set if no matching rows existed in the subquery. This resulted in a NULL result where a 0 was expected, leading to incorrect query outputs.

To resolve this, Ganski and Wong proposed the use of outer joins. By using a Left Outer Join between the outer query and the result of the aggregated subquery, the optimizer ensures that rows from the outer table are preserved even if the subquery returns no matches. In such cases, the engine can then use aCOALESCE-like function to turn the resulting NULL into a 0. This advancement was critical for the reliability of automated query rewrite systems.

Modern Iterations in Database Engines

Modern database engines have moved beyond simple heuristic-based rewrites to sophisticated cost-based transformations. These engines evaluate multiple potential execution plans and choose the one with the lowest estimated cost, factoring in data distribution statistics and indexing structures like B-trees or hash indexes.

PostgreSQL: SubLinks and SubPlans

PostgreSQL treats subqueries as "SubLinks" during the initial parsing phase. During the planning stage, the optimizer attempts to "flatten" these links. If a subquery is non-correlated or can be safely converted, PostgreSQL will pull it up into the join tree, effectively turning it into a sub-select within theFROMClause or a direct join. However, if the optimizer determines that decorrelation would result in a massive intermediate result set (a "Cartesian product" effect), it may choose to keep the subquery as a "SubPlan," which is executed using a nested loop with parameters.

Microsoft SQL Server: The Apply Operator

SQL Server uses a unique internal relational operator calledAPPLY(specificallyCROSS APPLYAndOUTER APPLY) to handle subqueries. TheAPPLYOperator acts as a bridge between a traditional join and a correlated subquery. It allows the engine to represent the dependency of the inner query on the outer row explicitly. The SQL Server optimizer then uses a set of rules to transform theseAPPLYOperations into standard joins where possible. This approach provides a flexible framework for handling complex decorrelation, including cases involving lateral references and table-valued functions.

Advanced Optimization Mechanics

Today’s optimizers also use predicate pushdown and view merging to further refine unnesting.Predicate pushdownInvolves moving filters as close to the data source as possible, often inside the unnested subquery, to reduce the size of intermediate result sets.View mergingAllows the optimizer to dissolve the boundaries between the outer query and the subquery, treating them as a single flattened query graph. This enables the engine to reorder joins more freely, potentially finding a join order that was not possible in the original nested structure.

The Role of Statistical Estimators

The success of subquery unnesting in modern systems depends heavily on the accuracy of statistical estimators. When an engine converts a subquery into a join, it must estimate the cardinality of the resulting join. If the estimation is wrong, the engine might choose a hash join when a nested loop would have been faster, or vice-versa. These estimations are based on histograms and frequent-value lists stored in the database metadata. As subqueries become more complex—involving multiple levels of nesting and diverse predicates—the margin for error in these statistics increases, which remains a primary challenge in relational query optimization mechanics.

What sources disagree on

While the technical mechanics of unnesting are well-documented, there is ongoing debate among database theorists regarding the universal benefit of decorrelation. Some argue that with the advent of high-speed NVMe storage and massive memory caches, the overhead of correlated subqueries is often negligible for small to medium datasets. These specialists suggest that the complexity added to the optimizer by aggressive unnesting rules can sometimes lead to "optimization regressions," where the optimizer spends more time analyzing the query than it saves during execution.

Furthermore, there is disagreement on the best way to handle "Always False" or "Always True" subqueries. Some modern research suggests that many unnesting algorithms are still too rigid, failing to recognize when a subquery will return a static result regardless of the outer row. This has led to the development of "Magic Sets" and other advanced decorrelation techniques that attempt to blend the benefits of both nested execution and join-based execution by passing filters between the two layers dynamically.