The Science of Database Guessing: Cardinality and Costs

Have you ever tried to cook a meal for a group of people without knowing how many were coming? If you prepare for five and fifty show up, you’re in trouble. If you prepare for fifty and only five show up, you’ve wasted a lot of food. This is exactly the problem database engines face every second. When you run a query, the engine has to guess how many rows of data it’s going to find before it even starts looking. This guess is called a cardinality estimation, and it is the heartbeat of Relational Query Optimization Mechanics. If the guess is wrong, the entire execution plan falls apart, and your computer starts chugging.

Relational query optimization is basically a high-stakes game of 'How many?' The engine needs to know if a specific filter—like searching for people named 'Zebulon'—will return two people or two thousand. If it’s only two, the engine might use a quick index search. If it’s two thousand, it might decide to just read the whole list from start to finish. The engine uses a cost-based model, which means it assigns a price tag to every action. The goal is to find the plan with the lowest total price. But since the price depends on the number of rows, a bad guess at the start leads to a very expensive mistake later on.

What happened

The foundation of this entire field was laid back in the late 1970s by a researcher named Patricia Selinger. Before her work, databases were much more rigid. She helped create a system where the computer could actually 'reason' about the best way to access data. This changed the world of software forever.

The Selinger Model:It introduced the idea of looking at the cost of disk access and CPU usage together.
Statistics Collection:Databases began keeping 'histograms'—little charts that show how data is spread out—so they could make better guesses.
Predicate Pushdown:This is a trick where the database applies filters as early as possible. If you want 'Blue' cars from 'California,' it filters for 'California' first to shrink the pile before looking for the color.
Dynamic Planning:Modern systems can sometimes change their plan in the middle of a search if they realize their original guess was wrong.

The Puzzle of Join Ordering

One of the hardest parts of this job is figuring out which tables to link first. Imagine you have three lists: Customers, Orders, and Products. You want to see which customers in Oregon ordered a specific brand of toaster. You could link Customers to Orders first, or Orders to Products first. If there are only ten customers in Oregon but ten million orders for toasters, starting with the customers is much faster. It shrinks the 'intermediate result set.' That’s just a way of saying it makes the pile of paper you're carrying smaller as you move to the next desk.

When the database has to join ten or twenty tables, the number of possible orders is astronomical. It’s more than the number of stars in the sky. The engine can't check every single one, so it uses 'heuristics'—educated shortcuts—to find a 'good enough' plan quickly. It’s like a chef who doesn't check every single recipe in the world before making dinner; they just use the three or four methods they know work best for the ingredients on hand. This is where the 'mechanics' part of optimization really shines. It's a balance of pure math and practical shortcuts.

The Battle Against Stale Stats

Why do databases suddenly get slow for no reason? Often, it's because the stats are 'stale.' Think of it like a grocery store. If the manager thinks they have 100 gallons of milk (the statistics), they’ll tell the staff to put them on sale. But if they actually only have 2 gallons (the reality), the plan to have a big sale is a disaster. In a database, if you add a million new records but don't tell the optimizer, it will keep using a plan designed for a small table. This is why 'rebuilding statistics' is a common chore for people who manage data. It’s like giving the database a fresh pair of glasses.

Sometimes, the smartest thing a database can do is admit it doesn't know the answer and try a different path.

We also have to think aboutView Merging. Sometimes, users create 'views,' which are like saved searches. When you run a query against a view, the optimizer has to 'unfold' that view and mix it in with your new request. It’s like taking a pre-made sauce and figuring out how to adjust the spices when you add it to a new dish. If the optimizer isn't careful, it can get confused by the layers of logic. Good optimization mechanics ensure that these layers don't slow things down, keeping the data flowing smoothly from the hard drive to your screen.

Query optimization is about efficiency. We live in a world where we generate more data than ever before. Without these clever algorithms and historical models, our apps would be unusable. It’s a quiet, invisible science, but it’s what keeps our digital lives moving at the speed of thought. Next time your weather app or bank balance pops up in a second, you can thank a cost-based optimizer for doing the heavy lifting for you.