The Starburst Project: Extensible Query Optimization in the 1990s

The Starburst project, initiated at IBM Research - Almaden in the mid-1980s and reaching its peak influence during the 1990s, represented a fundamental shift in the architecture of relational database management systems (RDBMS). It was designed to address the limitations of early experimental systems, such as System R, by introducing a modular and extensible framework for query processing. The primary objective was to create a database engine capable of supporting complex data types and advanced optimization techniques without necessitating a complete rewrite of the system kernel.

The most significant contribution of Starburst was the formalization of the Query Rewrite (QR) phase as a distinct, rule-based stage of query optimization. By separating the high-level semantic transformation of SQL statements from the low-level physical plan selection, Starburst enabled database engines to perform sophisticated algebraic manipulations. This architecture allowed for the simplification of complex queries and the merging of views before a cost-based optimizer evaluated specific access paths, such as index scans or join algorithms.

What changed

The Starburst project introduced several architectural model shifts that transitioned query optimization from a monolithic process into a structured pipeline. The following list details the core technical evolutions introduced by the project:

Two-Phase Optimization Architecture:Starburst decoupled query processing into the Query Rewrite (QR) phase and the Plan Optimization (PO) phase. The QR phase focused on semantic transformations, while the PO phase handled physical resource costing.
Query Graph Model (QGM):A high-level intermediate representation was developed to replace traditional parse trees. QGM nodes represented relational operators (e.g., SELECT, GROUP BY, UNION), allowing the optimizer to manipulate queries as graphs rather than text.
Rule-Based Rewrite Engine:The system utilized a production-style rule engine where specific "Query Rewrite Rules" (QRRs) were applied iteratively to the QGM. This allowed for extensible optimization strategies such as predicate pushdown and subquery unnesting.
Extensible Storage Management:Beyond the optimizer, Starburst introduced the concept of "attachments" and "core extensions," allowing for the integration of new index types and storage methods without altering the core database logic.
Formalization of View Merging:The project established rigorous methods for collapsing nested views into a single query block, significantly reducing the overhead of intermediate result sets.

Background

Prior to the Starburst project, the industry standard for query optimization was largely defined by the Selinger optimizer, developed for IBM's System R. The Selinger model introduced cost-based optimization using dynamic programming to determine the most efficient join order and access paths. While highly effective for simple queries, the Selinger model was often overwhelmed by the increasing complexity of SQL applications in the late 1980s, particularly those involving multiple nested views, subqueries, and large-scale data distributions.

The need for Starburst arose from the realization that the search space for a physical optimizer grew exponentially as queries became more complex. By the early 1990s, relational databases were no longer used solely for simple transaction processing; they were increasingly tasked with decision support and analytical workloads. These workloads relied heavily on views and complex joins that required algebraic simplification before any physical execution plan could be intelligently drafted. Researchers at IBM Almaden sought a way to reduce this complexity by transforming the query into its most efficient canonical form prior to the cost-estimation stage.

The Query Graph Model (QGM)

At the heart of the Starburst engine was the Query Graph Model (QGM). Unlike previous systems that used flat representations, QGM provided a semi-procedural internal representation of the query. Each node in the graph represented a table-level operation, and the edges represented the flow of data. This abstraction was critical because it allowed the rewrite engine to see the relationships between different parts of a query, such as identifying where a filter applied to a view could be safely pushed down to the underlying base tables.

The QGM nodes contained information about columns, predicates, and ordering requirements. By manipulating this graph, the Starburst engine could perform "view merging," where the definition of a view is expanded directly into the main query, eliminating the need to materialize temporary tables. This was particularly vital for SQL applications that used layers of abstraction, as it allowed the optimizer to see through the views to the raw data structures.

Mechanics of the Query Rewrite Engine

The Query Rewrite phase in Starburst functioned as a rule-based system. Each rule consisted of a condition and an action. The condition checked the QGM for specific patterns—such as a subquery inside a WHERE clause—and the action transformed that pattern into a more efficient structure, such as a join. These transformations were algebraic in nature, meaning they were guaranteed to produce the same result set while potentially reducing the computational cost.

One of the most notable transformations pioneered by Starburst was "subquery unnesting" or "decorrelation." In many early SQL engines, correlated subqueries were executed repeatedly for every row of the outer query, a process known as a nested loop execution. The Starburst rewrite engine could often transform these subqueries into joins, allowing the subsequent plan optimizer to use more efficient algorithms like hash joins or merge joins. This significantly improved performance for complex analytical queries that were previously considered computationally prohibitive.

Plan Optimization and Costing

Once the Query Rewrite phase produced a simplified and optimized QGM, the Plan Optimization (PO) phase began. This stage was a direct descendant of the Selinger model but enhanced to handle the richer structures provided by Starburst. The PO phase explored various physical execution strategies, such as choosing between different types of indexes (B-trees or hash indexes) and selecting the join order based on cardinality estimations.

The PO phase utilized a "bottom-up" approach, building the most efficient plan for sub-sections of the query and then combining them. Because the QR phase had already simplified the query structure, the PO phase could focus on the physical realities of the hardware, such as I/O costs, CPU cycles, and memory availability for sort buffers. This separation of concerns ensured that the system did not waste time calculating costs for inefficiently structured queries.

Legacy and Integration in DB2

The innovations of the Starburst project did not remain confined to research papers; they formed the architectural foundation for the IBM DB2 family of products, specifically the DB2 Common Server (now known as DB2 for Linux, UNIX, and Windows). The Starburst optimizer's modularity allowed IBM to maintain a competitive edge as SQL complexity increased through the late 1990s and 2000s.

Contemporary DB2 optimization still utilizes the core concepts of the Starburst QR and PO phases. The Query Graph Model evolved into what is now referred to as the query internal representation in modern IBM database products. Furthermore, the extensible nature of the Starburst design allowed for the later integration of specialized features such as Materialized Query Tables (MQTs) and Multi-Dimensional Clustering (MDC), as the rewrite engine could be updated with new rules to recognize and exploit these structures.

The influence of Starburst also extended beyond IBM. The principles of extensible, rule-based rewrite engines and sophisticated query graph representations influenced the development of other major database systems, including Microsoft SQL Server and various open-source projects. The project's emphasis on "orthogonality"—the idea that different features should be able to operate independently and be combined without conflict—remains a guiding principle in database theory today.

"The Starburst project proved that a database system could be both highly optimized and highly extensible, a balance that was previously thought to be unattainable in high-performance environments."

The Starburst project defined the mechanics of modern relational query optimization. By dissecting the latent algebraic possibilities of SQL through a dedicated rewrite engine, it enabled databases to handle the transition from simple transactional systems to complex analytical engines. The meticulous dissection of join dependencies and the intelligent application of heuristic algorithms developed during the 1990s at IBM Almaden continue to underpin the performance of modern enterprise-grade relational database systems.