Analysis of Hash vs. Merge Join Algorithms in Early SQL Systems

Relational query optimization mechanics emerged as a primary technical discipline following the publication of E.F. Codd’s relational model in 1970. In early relational database management systems (RDBMS), the efficiency of the join operation became the defining factor for system performance. The primary challenge resided in the translation of declarative SQL statements into procedural execution plans that minimized costly input/output (I/O) operations and central processing unit (CPU) cycles.

By the late 1970s and early 1980s, researchers at IBM’s San Jose Research Laboratory and those involved in the development of INGRES at UC Berkeley focused on the mathematical complexity of joining disparate data sets. These efforts culminated in the definition of three fundamental join strategies: the Nested Loop Join, the Sort-Merge Join, and the Hash Join. The selection between these operators depended on the physical storage of data, the availability of memory buffers, and the accuracy of statistical estimators maintained by the database engine.

At a glance

Nested Loop Join:The most basic join algorithm, with a time complexity of O(N * M), where N and M are the number of tuples in the outer and inner relations, respectively.
Sort-Merge Join:An algorithm that sorts both relations on the join key (O(N log N + M log M)) and then merges them in a single pass (O(N + M)).
Hash Join:A strategy that builds an in-memory hash table for the smaller relation (O(N)) and probes it using the larger relation (O(M)).
Cardinality Estimation:The process by which the optimizer predicts the number of rows resulting from a filter or join, which directly influences the choice of join operator.
Teradata Architecture:Noted for its early adoption of parallel hash joins, leveraging a shared-nothing architecture to handle large-scale data warehousing.
IBM Mainframe (DB2):Historically prioritized sort-merge and nested loop joins, optimized for the high-performance sorting capabilities of System/370 and its successors.

Background

The conceptual foundation for query optimization was established by P. Griffiths Selinger and her colleagues in the 1979 paper ‘Access Path Selection in a Relational Database Management System.’ This work introduced the Cost-Based Optimizer (CBO), which utilized statistics about data distribution to evaluate the total cost of various execution paths. In this era, hardware limitations, specifically the high cost of random-access memory (RAM) and the slow seek times of magnetic disk drives, dictated the evolution of join mechanics.

Early implementations of the relational model struggled with performance parity against established hierarchical and network databases. The flexibility of SQL allowed users to write complex multi-way joins, but without sophisticated optimization, these queries often resulted in Cartesian products or exhaustive table scans. The development of join algorithms was therefore a direct response to the need for scalable data retrieval. While Nested Loop joins were sufficient for small lookups, the growth of enterprise data necessitated the development of algorithms that could scale linearly or log-linearly with the size of the input relations.

Mathematical Complexity and Algorithmic Efficiency

The complexity of join algorithms is typically measured in terms of I/O operations, as disk access was the primary bottleneck in early systems. TheNested Loop Join(NLJ) requires reading the entire inner relation for every row found in the outer relation. If the inner relation is indexed, the complexity drops significantly, but for unindexed scans, the cost is prohibitive for large datasets. Mathematically, the I/O cost is defined asBr + (Nr * Bs), where B represents the number of blocks and N represents the number of records.

TheSort-Merge Join(SMJ) was long considered the superior choice for large-scale joins in systems with limited memory. The algorithm consists of two phases. First, both relations are sorted on the join attribute. In early VLDB proceedings, the use of external sort algorithms was emphasized, as datasets often exceeded available RAM. Once sorted, the merge phase requires only a single pass through each relation. The SMJ is particularly effective when the data is already physically ordered by the join key or when a clustered index is present. Its complexity is dominated by the sort phase: O(N log N + M log M).

TheHash Join(HJ), specifically the Grace Hash Join and the Hybrid Hash Join, offered a different approach. By partitioning relations into buckets based on a hash function, the system could perform joins on a per-bucket basis, ensuring that only small portions of the data needed to be in memory at any given time. The mathematical ideal for a hash join is O(N + M) I/O operations, assuming sufficient memory to hold the hash table of the smaller relation. However, if the hash table exceeds memory, the system must resort to multi-pass partitioning, which increases the I/O overhead.

Historical Adoption: Teradata vs. IBM

The adoption of these algorithms followed the diverging architectures of early database vendors.Teradata, founded in 1979, designed its system for parallel processing from the outset. Its shared-nothing architecture was uniquely suited for Hash Joins. By hashing rows across multiple Access Module Processors (AMPs), Teradata could perform joins in parallel, minimizing the need for the massive global sorts required by Sort-Merge joins. This architectural choice allowed Teradata to dominate the early data warehousing market, where large-scale aggregation and joins were common.

Conversely,IBM’s DB2And its predecessors on the MVS and VM operating systems were designed for the mainframe environment. These systems featured highly optimized sorting utilities (such as DFSORT) and were often used for transactional workloads where data was frequently accessed via indexes. IBM developers initially favored Sort-Merge joins because they were more predictable in memory-constrained environments. While Hash joins were eventually integrated into DB2, the legacy of the SMJ remained strong due to the inherent stability of the algorithm when dealing with skewed data distributions that could cause hash buckets to overflow.

The Role of Statistical Estimators

The choice between a Hash Join and a Sort-Merge Join is rarely hard-coded; it is determined by the optimizer’s cost model. This model relies onStatistical estimatorsTo predict the cardinality of the input sets. If the optimizer underestimates the size of a relation, it might choose a Hash Join, only to have the hash table spill to disk, resulting in a ‘hash thrashing’ scenario that severely degrades performance. If it overestimates, it might choose a more conservative Sort-Merge join, incurring unnecessary sorting overhead.

Early VLDB research highlighted the ‘selectivity estimation’ problem. Optimizers used simple histograms or even uniform distribution assumptions to guess how many rows would satisfy a predicate. Advanced techniques like end-biased histograms and sampling were later introduced to improve accuracy. The efficacy of a join algorithm is therefore inseparable from the accuracy of the statistics provided to the optimizer. In modern systems, the cascade of rules derived from Selinger’s work has been augmented by adaptive query execution, where the engine can switch join strategies at runtime if it detects that the initial cardinality estimates were incorrect.

What sources disagree on

While the mathematical benefits of Hash Joins are well-documented, early database literature contains conflicting views on their reliability compared to Sort-Merge Joins. Some researchers argued that Hash Joins were overly sensitive to data skew—situations where a single join key value appears in a disproportionately large number of rows. In such cases, the hash bucket for that value becomes a bottleneck, potentially exceeding memory limits and forcing the system into expensive recursive partitioning.

Proponents of Sort-Merge joins contended that sorting provided a more stable performance profile regardless of data distribution. They also noted that Sort-Merge joins produce sorted output, which is often beneficial for subsequent operations like ‘GROUP BY’ or ‘ORDER BY’ clauses in the SQL statement. This debate persisted throughout the 1980s, eventually leading to the development of ‘skew-resistant’ hash join variations that use dynamic re-partitioning to handle uneven data distributions.

The Impact of Memory Availability

Another point of contention in early technical reports involved the trade-off between CPU cycles and I/O. Hash joins are generally more CPU-intensive due to the calculation of hash values and the management of in-memory buckets. In the early 1980s, when CPU cycles were at a premium, some engineers argued that the simpler logic of Sort-Merge was preferable. However, as the ratio of CPU speed to disk I/O speed increased, the efficiency of Hash joins in reducing disk reads became the more critical metric. This shift in hardware economics eventually led to the Hash Join becoming the default choice for large, unsorted equijoins in most modern relational database systems.

‘The selection of a join order and the choice of join method are the most critical decisions an optimizer makes. A wrong choice can lead to a query taking hours instead of seconds.’

This reality necessitated the use of query execution plan visualizations, where practitioners could inspect the cascading application of algebraic rules. By analyzing the query graph, database administrators could identify whether a join ordering dependency was causing a performance bottleneck and manually intervene using ‘hints’ to force a specific join algorithm or join order, though the goal of the relational model was always to make such manual tuning unnecessary.