The Evolution of Cardinality Estimation: Histograms and Statistics (1985-2024)

Relational Query Optimization Mechanics is a specialized discipline within database engineering focused on the translation of declarative SQL into efficient execution plans. The origins of this field date back to the late 1970s and early 1980s, primarily driven by the IBM System R project and the seminal research of P.G. Selinger. Since the publication of Selinger's 1979 paper, database engines have moved from primitive Rule-Based Optimizers (RBO) to sophisticated Cost-Based Optimizers (CBO) that rely on complex mathematical models and statistical metadata to minimize computational overhead.

By 2024, the field has evolved to manage petabyte-scale datasets where the accuracy of cardinality estimation determines the success or failure of query execution. Modern systems use advanced structures such as hybrid histograms and top-frequency buckets to address the challenges of data skew. These mechanics function by meticulously evaluating join order dependencies, selecting between algorithms like nested loop or hash joins, and applying algebraic transformations such as predicate pushdown and view merging to ensure the lowest possible I/O and CPU utilization.

Timeline

1979:P.G. Selinger and the IBM System R team publish the first detailed framework for cost-based optimization, introducing the concept of query cost as a function of I/O and CPU usage.
1984:M. Muralikrishna and David DeWitt introduce the concept of multi-dimensional histograms to better estimate the selectivity of complex predicates in relational systems.
1990s:Commercial vendors including Oracle, IBM (DB2), and Microsoft (SQL Server) transition from rule-based models to cost-based models, making statistical collection a standard database maintenance task.
2003:The refinement of equi-depth histograms becomes the industry standard for handling non-uniform data distributions in relational databases.
2013:Oracle Database 12c introduces Hybrid and Top-Frequency histograms, significantly improving accuracy for data with high-frequency values and skewed distributions.
2017-2024:The introduction of Adaptive Query Processing and machine-learning-based cardinality estimators allows database engines to adjust execution plans in real-time based on observed versus estimated row counts.

Background

The core objective of Relational Query Optimization Mechanics is to handle the exponentially large search space of possible execution plans for a given SQL statement. When a database receives a query, it first parses the text into an Abstract Syntax Tree (AST), which is then converted into a logical query plan represented by relational algebra. The optimizer's role is to transform this logical plan into a physical execution plan. This transformation process involves rearranging join orders, pushing filters closer to the data source, and selecting the most efficient access paths, such as index seeks or full table scans.

Cardinality estimation is the most critical component of this process. It is the prediction of the number of rows that will be returned by each operation in the plan. If the cardinality estimate is inaccurate, the optimizer may choose a join algorithm that is inefficient for the actual data volume. For instance, a Nested Loop join is highly effective for small datasets, but if the optimizer underestimates the row count, it may incorrectly select it over a Hash Join, leading to a massive increase in I/O operations and execution time. To prevent such regressions, modern database engines rely on detailed statistics, including null counts, distinct values, and histograms.

The Shift to Equi-Depth Histograms

In the early stages of relational database development, equi-width histograms were the primary tool for modeling data distribution. These histograms divided the range of values into intervals of equal size. However, this approach failed to provide accurate estimates for skewed data, where a small number of intervals might contain the vast majority of the rows. To resolve this, researchers developed equi-depth (or equi-height) histograms. In an equi-depth model, the bucket boundaries are adjusted so that each bucket contains approximately the same number of rows. This allows the optimizer to gain much higher resolution for those parts of the data range where the values are most densely packed.

The efficacy of equi-depth histograms is particularly evident when dealing with columns that have a large number of distinct values but high frequency in specific ranges, such as postal codes or transaction dates. By maintaining a constant number of rows per bucket, the engine can estimate selectivity more accurately, even when the data does not follow a normal distribution. Throughout the 1990s and early 2000s, this became the foundational logic for Cardinality Estimators in almost all major Relational Database Management Systems (RDBMS).

Oracle Database 12c Case Study: Hybrid and Top-Frequency Models

The release of Oracle Database 12c represented a major milestone in the evolution of histogram technology. Recognizing that traditional equi-depth histograms still struggled with certain types of data skew, Oracle introduced two new histogram types: Top-Frequency and Hybrid. A Top-Frequency histogram is designed for columns where a small number of values (the popular values) represent a significant portion of the total rows. Instead of spreading these values across multiple buckets, the engine isolates them into their own buckets, providing near-perfect estimation for the most frequently queried data.

Hybrid histograms combined the benefits of both frequency and equi-depth models. In a hybrid histogram, a value can span multiple buckets, but the engine stores additional metadata about the frequency of the value that ends each bucket. This ensures that the optimizer does not lose information about the popularity of values that fall on the bucket boundaries. This innovation addressed the 'bucket boundary' problem that had plagued equi-depth models for decades, where the optimizer often had to resort to a uniform distribution assumption for values that were partially contained in a bucket. The implementation of these models resulted in significantly more stable execution plans for enterprise applications with complex, non-uniform datasets.

The Impact of Stale Statistics on Execution Plans

Despite the sophistication of modern histogram models, the accuracy of the Cost-Based Optimizer is entirely dependent on the currency of the statistics it uses. Stale statistics occur when the data in a table changes significantly (through INSERT, UPDATE, or DELETE operations) without a corresponding update to the metadata stored in the data dictionary. Most major RDBMS vendors, including Microsoft and Oracle, provide automatic maintenance windows to gather statistics, typically triggered when a table's data changes by more than 10%.

Documented maintenance records from major vendors highlight that stale statistics are a primary cause of plan regression. When statistics are outdated, the optimizer may believe a table is small when it has actually grown to millions of rows. This leads to the selection of inefficient access paths, such as choosing an index scan over a parallel full table scan. Furthermore, stale statistics can lead to the 'correlation problem,' where the optimizer assumes two columns are independent when they are actually related, further compounding the error in cardinality estimation. Advanced systems now include features to identify stale statistics in real-time and use 'dynamic sampling' to gather a small amount of data during query compilation to verify the accuracy of the existing histograms.

What sources disagree on

Within the field of Relational Query Optimization Mechanics, there is ongoing debate regarding the trade-offs between the cost of collecting statistics and the accuracy of the resulting execution plans. Some experts argue that the overhead of maintaining high-resolution histograms for every column in a large database is prohibitive and that engines should rely more heavily on sampling-based estimation at runtime. This approach, often called 'Dynamic Sampling' or 'Adaptive Statistics,' gathers data on the fly but can increase query latency for short-running queries.

Another point of contention is the future of machine learning in this space. While some researchers advocate for the total replacement of traditional histogram-based estimators with neural-network-based models, others maintain that the transparency and predictability of the Selinger-style cost models are more important for enterprise stability. Critics of machine-learning models point to the 'black box' nature of these estimators, noting that it becomes difficult for database administrators to tune or predict how the optimizer will react to changes in data distribution when the underlying logic is not based on traditional algebraic rules.