During the 1980s and 1990s, the development of IBM DB2 marked a significant era for relational database management systems (RDBMS). At the heart of this evolution was the Starburst project at IBM Almaden Research Center, which introduced sophisticated query rewriting mechanisms that remain foundational to modern data processing. These advancements focused on transitioning from rigid, rule-based execution to a flexible, cost-based optimization framework capable of handling complex SQL structures.
The engineering efforts led by Hamid Pirahesh and his colleagues introduced the Query Graph Model (QGM), a high-level representation of SQL queries that allowed for extensive algebraic transformations. This architectural shift enabled the DB2 optimizer to perform subquery unrolling and view merging, techniques designed to eliminate the performance bottlenecks associated with deeply nested or modular SQL code. By converting hierarchical query structures into flattened relational operations, the optimizer could use efficient join algorithms and reduce the computational overhead of iterative row processing.
Timeline
- 1974–1979:IBM develops System R, introducing the first implementation of SQL and the foundational cost-based optimizer principles established by Patricia Selinger.
- 1983:IBM releases DB2 for MVS, establishing a commercial platform for relational data at scale.
- 1980s (Mid-to-Late):The Starburst research project commences at IBM Almaden, aiming to create an extensible database engine.
- 1991–1992:Hamid Pirahesh and the Almaden team publish seminal papers on the Query Graph Model (QGM) and the semantics of complex query rewrite rules.
- 1990s (Early):DB2 integrates advanced subquery decorrelation and view merging capabilities, allowing for the optimization of TPC-D and other complex analytical workloads.
- 1995:The introduction of the 'Starjoin' optimization technique significantly improves performance for data warehousing schemas involving large fact tables and multiple dimension tables.
Background
In the early days of relational databases, the performance of SQL was often criticized in comparison to hierarchical or network database models. The primary challenge resided in the declarative nature of SQL; users specified what data they wanted, but the database engine had to determine how to retrieve it. Early optimizers were frequently limited by their inability to look beyond the literal structure of a query. If a developer wrote a nested subquery, the engine often executed it as a nested loop—a process sometimes referred to as 'Row-By-Agonizing-Row' (RBAR) processing.
This iterative approach was particularly inefficient for large datasets. Every row in the outer query would trigger a full or partial scan of the inner query's tables. To address this, IBM researchers sought to decouple the logical intent of the query from its physical execution. The goal was to transform complex, nested expressions into equivalent, flattened join operations that could be optimized using the sophisticated join-ordering and indexing strategies developed during the System R project.
The Query Graph Model (QGM)
The Query Graph Model served as the essential intermediate representation for DB2's query rewrite engine. Unlike traditional parse trees, which closely mirror the syntax of the SQL statement, QGM provides a more abstract, graph-based view of the data flow. Each node in a QGM graph represents a semantic operation, such as a selection, a join, or a grouping. The edges represent the flow of records between these operations.
The power of QGM lay in its ability to support 'n-way' transformations. Because the graph captured the underlying relational algebra rather than just the syntax, the optimizer could move predicates across nodes, merge redundant nodes, and reorder operations with high precision. Hamid Pirahesh's work emphasized that a strong optimizer must understand the 'null-producing' properties of joins and the distinction between existential and universal quantification to ensure that query transformations remained semantically equivalent to the original SQL.
Subquery Unrolling and Decorrelation
Subquery unrolling is the process of transforming a nested subquery into a join operation. Correlated subqueries, which reference columns from the outer query, are particularly difficult to optimize. Without unrolling, the database might execute the inner query once for every single row in the outer table. In the 1990s, DB2 introduced advanced decorrelation algorithms to convert these correlations into semi-joins or anti-joins.
AnExistential subquery(e.g., using theEXISTSKeyword) can often be rewritten as a semi-join. In a semi-join, the database identifies which rows in the left table have at least one match in the right table but does not actually return data from the right table. This allows the optimizer to use efficient hash joins or merge joins rather than nested loops. Conversely, anAnti-joinIs used forNOT EXISTSOrEXCEPTClauses, identifying rows that have no match in the target set. By 'unrolling' these structures, DB2 could apply standard join-ordering logic, potentially choosing a join order that drastically reduced the size of intermediate results.
View Merging and Rule-Based Rewrites
View merging is another critical optimization mechanic where the definition of a virtual table (a view) is 'folded' into the main query that references it. In early systems, views were often treated as 'materialized' or 'black-box' entities; the database would resolve the view first and then run the main query against the result. This prevented the optimizer from pushing predicates into the view definition.
Through the cascading application of rewrite rules, the DB2 optimizer could performPredicate pushdown. If a user queried a view but included aWHEREClause in the outer query, the optimizer would attempt to move thatWHEREClause into the internal view query. This allowed the engine to filter data at the earliest possible stage, minimizing I/O and CPU usage. Advanced view merging also allowed for 'complex view merging,' where views containingDISTINCTOrGROUP BYClauses could be merged under specific conditions, further simplifying the execution plan.
The Starjoin Optimization
As data warehousing became more prevalent in the mid-1990s, the 'Starjoin' became a focal point of IBM's optimization research. A star schema typically consists of a massive 'fact' table surrounded by several smaller 'dimension' tables. Joining the fact table to multiple dimensions simultaneously can be computationally expensive if the optimizer tries to join them one by one in a linear fashion.
The DB2 Starjoin technique utilized bitmap indexing and specialized join algorithms to evaluate multiple dimension constraints at once. Instead of performing sequential nested loops, the engine could use the dimensions to filter the fact table's indexes before ever touching the data pages. This 'multi-way' join approach relied heavily on accurate cardinality estimations—the optimizer's ability to predict how many rows would remain after each filtering step. The work on Starjoin optimization in DB2 showcased the transition toward 'set-oriented' processing, moving away from the row-at-a-time limitations of earlier relational engines.
Algebraic Transformations and Cost Estimation
The success of subquery unrolling and view merging depended on two factors: the legality of the transformation and the cost of the result. The DB2 optimizer utilized a 'search space' of possible plans, applying algebraic rules to generate candidates. Each candidate was then evaluated by a cost model that considered disk I/O, CPU cycles, and memory usage.
"The challenge is not just finding a way to execute the query, but finding the most efficient way among millions of possibilities."
This necessitated deep research into statistical estimators. DB2's optimizer used histograms and frequency statistics to determine the distribution of data within columns. If the optimizer estimated that a subquery would return only a single row, it might choose a nested loop. If it estimated the subquery would return millions of rows, it would focus on a hash join or a sort-merge join. This marriage of high-level algebraic rewriting (QGM) and low-level physical cost estimation (Selinger's model) became the standard blueprint for relational database engineering.
Impact on Modern Systems
The techniques pioneered in DB2 documentation and research papers from the 1980s and 90s are now ubiquitous. The Query Graph Model influenced the design of optimizers in various other commercial and open-source systems, including PostgreSQL and Microsoft SQL Server. The transition from RBAR processing to set-based algebraic optimization allowed relational databases to scale to the massive datasets seen in contemporary big data environments. By treating SQL not as a list of instructions but as a set of logical requirements, Hamid Pirahesh and the IBM team enabled the database engine to act as a sophisticated reasoning agent, constantly reorganizing internal logic to achieve peak efficiency.