SQL Server 2000 Materialized View Rewriting and Optimization

Microsoft SQL Server 2000 introduced a significant advancement in relational database technology through the implementation of automated query rewriting using indexed views. This feature, synonymous with materialized views in other database ecosystems, allowed the database engine to store the result set of a view physically on disk, complete with its own unique clustered index. The primary innovation was not merely the storage of these results, but the ability of the Query Optimizer to automatically recognize and substitute these pre-computed summaries for complex operations against base tables, even when the query did not explicitly reference the view.

This implementation relied heavily on foundational research in relational query optimization mechanics, particularly the work of Per-Åke Larson and Jonathan Goldstein. By integrating their 'General Algorithm for Query Rewriting' into the SQL Server Query Optimizer, Microsoft enabled the engine to perform sophisticated algebraic matching. This logic determined if a query’s selection, join, and aggregation requirements could be satisfied by an existing indexed view, thereby reducing the computational burden on the system during execution.

What changed

The introduction of indexed views in SQL Server 2000 marked a transition from manual query tuning to automated, cost-based optimization strategies. Prior to this release, developers often had to manually redirect queries to summary tables to achieve performance gains in Large-scale Online Analytical Processing (OLAP) environments. The new architecture changed the following aspects of database management:

Automated Transparency:The Query Optimizer became capable of detecting when an indexed view covered a portion of a query tree, allowing for transparent substitution without modifying the application's SQL code.
Consistency Management:Because indexed views were automatically updated by the database engine whenever the underlying base tables were modified, the risk of data drift—common in manually managed summary tables—was eliminated.
Query Plan Flexibility:The optimizer could now choose between scanning a large base table or performing a much smaller I/O operation on a pre-aggregated index, based strictly on the estimated cost of execution.
Algebraic Transformation:The engine moved beyond simple text-based matching to a deep structural analysis of query graphs, enabling the substitution of views that were not exact matches but contained enough data to satisfy the request through additional filtering (compensation).

Background

Relational query optimization mechanics is a discipline rooted in the early development of System R and the pioneering work of Patricia Selinger. The objective of any cost-based optimizer (CBO) is to transform a declarative SQL statement into the most efficient procedural execution plan. This involves handling a vast search space of possible join orders, access paths, and algorithmic implementations for operators. By the time SQL Server 2000 was developed, the industry was grappling with increasingly complex data sets where traditional B-tree indexes on base tables were insufficient for maintaining performance in high-aggregation scenarios.

The concept of materialized view rewriting emerged as a solution to the 'aggregation bottleneck.' In this framework, the database treats a view not as a virtual macro, but as a persistent data structure. However, the challenge of materialized views lies in the 'rewriting problem': determining if a queryQCan be computed using a viewV. This requires a formal algebraic proof that the data inVIs a superset of the data required byQ, and thatQCan be derived fromVThrough a set of valid relational transformations.

The Larson-Goldstein General Algorithm

The technical backbone of the SQL Server 2000 implementation was derived from the 'General Algorithm for Query Rewriting' formulated by Per-Åke Larson and Jonathan Goldstein. This algorithm provided a rigorous mathematical framework for the Query Optimizer to identify matches between query sub-expressions and indexed views. The algorithm functions by decomposing both the query and the view into their constituent relational components: the Project list, the Select predicates, and the Join conditions (often referred to as the SPJ block).

The algorithm evaluates whether the view contains all the necessary columns required by the query and whether the view’s join conditions are a subset or a match of the query's join conditions. Crucially, it handles 'compensation predicates.' If a view contains more rows than the query requires—for instance, if the view covers an entire year of data while the query only asks for a single month—the algorithm generates a compensation filter to be applied to the view's output, ensuring the result set remains accurate.

Algebraic Matching Logic and Query Graphs

To perform this matching, SQL Server 2000 utilized internal representations known as query graphs. These graphs represent the tables as nodes and the join conditions as edges. The optimization process involves a series of algebraic transformations, such as predicate pushdown and view merging, to normalize the query and the view into a comparable state. The optimizer looks for isomorphism between sub-graphs of the query and the graph of the indexed view.

The matching logic also considers grouping and aggregation. If a query requires a sum of sales by region, and an indexed view contains a sum of sales by city and region, the optimizer can 'roll up' the view's data. It recognizes that the finer-grained aggregation in the view can be further aggregated to satisfy the coarser-grained requirement of the query. This necessitates a deep understanding of functional dependencies and the properties of aggregate functions likeSUM,COUNT, andMIN/MAX.

Relational Mechanics and Execution Costs

Once a potential rewrite is identified, it is not automatically used. SQL Server’s optimizer evaluates the rewritten plan against the original plan using a cost model. This model estimates the number of I/O operations and CPU cycles required for each path. The estimation relies on distribution statistics—histograms that track the cardinality and frequency of values within the data set. The optimizer prefers the indexed view only if the estimated cost of reading the view's clustered index and applying any necessary compensation is lower than the cost of joining the original base tables.

For example, in a three-way join between large tables, the optimizer might find an indexed view that already performs two of the joins. The choice then becomes whether to execute a nested loop or hash join between the third table and the pre-computed results of the first two. If the cardinality estimation suggests that the intermediate result set from the view is significantly smaller than the base tables, the view-based plan is selected.

Technical Constraints and Schemabinding

The implementation in SQL Server 2000 required strict adherence to certain constraints to ensure the integrity of the rewriting process. Most notably, views had to be created withSCHEMABINDING, a property that prevents the underlying base tables from being altered in a way that would invalidate the view. Furthermore, all functions used within the view had to be deterministic—meaning they return the same result for the same input every time—to prevent the pre-computed data from becoming obsolete or incorrect due to environmental factors like the current system time or locale settings.

The optimization mechanics also had to account for the 'Expression Matching' problem. This occurs when a query uses an expression that is not identical in text but is algebraically equivalent to an expression in an indexed view. The SQL Server 2000 optimizer was designed to recognize these equivalencies through a canonicalization process, where expressions are moved into a standard form before comparison. This ensured that small variations in SQL syntax did not prevent the engine from utilizing high-performance indexed views.

Legacy and Advancement

The SQL Server 2000 implementation of indexed views set the stage for subsequent advancements in automated tuning. It demonstrated that by applying the principles of relational query optimization mechanics—specifically the transformation rules of Larson and Goldstein—a database could bridge the gap between normalized storage and denormalized performance. This work proved that the cascading application of algebraic rules could effectively manage the complexity of modern data retrieval, laying the groundwork for the more advanced query rewrite features found in later versions of SQL Server and other enterprise relational database systems.