Chapter 19: Query Optimization

Advanced Database Management Systems - Final Term Elite Preparation

1. Introduction & Query Representations

Query optimization is conducted by the DBMS Query Optimizer. Its goal is to select the best available strategy (least costly) for executing a query based on system information.

Timing of Optimization: Compiled vs. Interpreted Queries
  • Compiled Queries: The optimizer estimates and chooses the lowest cost strategy before runtime. Highly efficient for repeated queries.
  • Interpreted Queries: The entire optimization and estimation process occurs at runtime. Calculating the cost estimate dynamically may slow down actual response time.

Query Trees vs. Query Graphs

2. Heuristic Optimization of Query Trees

The parser's initial tree (canonical tree) is often highly inefficient (e.g., executing a massive Cartesian Product before filtering). The optimizer transforms it into an equivalent, faster tree using Heuristics.

🧠 The Golden Heuristic Rule

Apply operations that reduce the size of intermediate results first! Perform SELECT (σ) and PROJECT (π) as early as possible to reduce the number of tuples and attributes before expensive JOIN operations.

Key Transformation Rules

3. Choice of Query Execution Plans

Once the logical tree is optimized, the DBMS makes execution decisions at the physical level.

Materialized vs. Pipelined Evaluation

Physical Optimization Approaches

Cost-based physical optimization can be approached in two ways:

  • Top-down approach: Starts with the overall goal and breaks it into sub-goals.
  • Bottom-up approach: Computes the optimal solution for base relations first, moving up the tree. (Used by Dynamic Programming).

Note: Certain physical heuristics make cost calculations unnecessary (e.g., always use an index scan for selections whenever possible).

4. Subqueries, Views & Advanced Joins

Nested Subqueries & View Merging

Semi-Join and Anti-Join Cost Formulas

Unnesting specific queries leads to Semi-Joins or Anti-Joins. The optimizer calculates their Selectivity (js) and Cardinality (jc).

Semi-Join (For IN / EXISTS clauses) js = MIN(1, NDV(Y, T2) / NDV(X, T1)) jc = |T1| * js Anti-Join (For NOT IN / NOT EXISTS clauses) js = 1 - MIN(1, NDV(T2.y) / NDV(T1.x)) jc = |T1| * js

5. Cost-Based Optimization: SELECT Operations

Cost metric evaluates alternatives based on: Access cost to secondary storage (Disk), Computation cost, Memory usage, and Communication.

Catalog Information & Histograms

The DBMS catalog stores File size, Organization, Index levels, and Number of Distinct Values (NDV). For highly skewed data, the RDBMS stores Histograms to accurately calculate Attribute Selectivity and Selection Cardinality.

SELECT Cost Formulas (Disk Block Accesses)

6. Cost Functions: JOIN Operations & Ordering

Standard Join Selectivity & Cardinality js = 1 / max(NDV(A, R), NDV(B, S)) jc = |(R ⋈ S)| = js * |R| * |S|

Join Execution Strategies

Join Ordering Choices (Dynamic Programming)

When computing multi-relation queries, the optimizer uses Dynamic Programming (where optimal subproblems are solved bottom-up, only once) to evaluate tree shapes.

7. Advanced Issues & Data Warehouses

Advanced Optimization Issues
  • Size Estimation of Other Operations: Optimizers must also estimate sizes for Projections, Set operations, Aggregation, and Outer joins.
  • Plan Caching: A compiled plan is stored by the optimizer for later use by the identical query running with different parameters.
  • Top k-results optimization: If a user only wants LIMIT 10, this limits strategy generation to avoid fully sorting massive tables.

Query Optimization in Data Warehouses

Data warehouses heavily rely on Star Transformation Optimization.

8. Oracle Specifics & Semantic Optimization

Displaying Query Execution Plans

-- Oracle Syntax:
EXPLAIN PLAN FOR <SQL query>

-- IBM DB2 Syntax:
EXPLAIN PLAN SELECTION [additional options] FOR <SQL-query>

-- SQL Server Syntax:
SET SHOWPLAN_TEXT ON or SET SHOWPLAN_XML ON or SET SHOWPLAN_ALL ON

Overview of Query Optimization in Oracle

Semantic Query Optimization

Uses constraints specified on the database schema (Primary keys, Check constraints, Foreign Keys). Goal: modify one query into another that is drastically more efficient to execute, or instantly return an empty set if a query logically violates a CHECK constraint.

🔥 Core Theory Q&A Preparation

High-yield theoretical concepts tested frequently in finals.

Concept: Compiled vs Interpreted

Q: How does Cost-Based optimization differ between Compiled and Interpreted queries?

A: For compiled queries, the optimizer estimates and compares costs at compile time, choosing the best strategy before execution, which is highly efficient. For interpreted queries, this entire complex estimation process occurs dynamically at runtime. This overhead can actually slow down the immediate response time of the query.

Concept: Execution Trees

Q: Why does Dynamic Programming heavily prefer Left-Deep join trees over Bushy trees?

A: Left-deep trees are structurally designed so that the right-hand input of every join is a base table. This allows the DBMS to generate fully pipelined plans, streaming intermediate results directly in RAM without materializing (writing) temporary tables to disk. Bushy trees often force expensive disk materialization.

Concept: Oracle Plan Stability

Q: What are "Outlines" and "SQL Plan Management" used for in Oracle?

A: They are tools used to provide plan stability. Sometimes, an optimizer might mistakenly choose a worse execution plan after a system update or statistic change. DBAs use Outlines and SQL Plan Management to "lock in" or preserve a historically optimal execution plan, preventing the optimizer from altering it.

🏆 10-Mark Scenario Questions

These advanced scenarios require synthesis of math, logic, and physical architecture.

Scenario 1: Join Permutation Analysis (Math Solve)

A database has three tables: PROJECT, DEPARTMENT, and EMPLOYEE. The optimizer restricts its search space to Left-Deep trees.
Based on Dynamic Programming, evaluate the structural difference between generating PROJECT ⋈ DEPARTMENT ⋈ EMPLOYEE versus DEPARTMENT ⋈ EMPLOYEE ⋈ PROJECT. Why must the optimizer evaluate all permutations?

Elite Answer Formulation:

By restricting to Left-Deep trees, a 3-table join yields 3! = 6 permutations (See Slide 37, Table 19.1). The optimizer evaluates all permutations bottom-up because join selectivity radically alters the intermediate file size.

Permutation A: (PROJECT ⋈ DEPARTMENT) ⋈ EMPLOYEE

If PROJECT ⋈ DEPARTMENT results in 50 rows (high selectivity), the subsequent join with the massive EMPLOYEE table is very fast, requiring only 50 inner-loop iterations.

Permutation B: (DEPARTMENT ⋈ EMPLOYEE) ⋈ PROJECT

If DEPARTMENT ⋈ EMPLOYEE results in 10,000 rows (low selectivity), the system must pipeline a massive 10,000-row intermediate table into the final join with PROJECT, destroying memory buffers and CPU efficiency.

Conclusion: The optimizer evaluates all left-deep permutations to find the order that produces the smallest intermediate relations first, applying the core heuristic of reducing data volume early.

Scenario 2: Data Warehouse Star Transformation

A Data Warehouse contains a 10-Billion row SALES_FACT table, connected to small TIME, STORE, and PRODUCT dimension tables. A query requests total sales for "Store X" during "December" for "Product Y".
Explain why standard Cost-Based Optimization fails here, and how Bitmap Index Star Transformation solves it.

Elite Answer Formulation:

  • CBO Failure: Standard CBO might attempt to join the massive SALES_FACT table with STORE first. Even filtered, this intermediate table might contain 100 million rows, forcing a catastrophic full table scan and massive memory consumption.
  • Bitmap Star Transformation: The optimizer avoids touching the Fact table directly. Instead:
    1. It queries the small dimensions (Store X, Dec, Prod Y).
    2. It retrieves the Bitmap Indexes associated with those dimension keys on the Fact table.
    3. It performs a rapid, CPU-only logical AND intersection of these bitmaps in memory.
    4. Joining Back: It uses the resulting intersected bitmap to fetch only the exact matching rows directly from the SALES_FACT table, completely bypassing a full table scan.
Scenario 3: Advanced Optimization Parameters

A web application features a search bar that displays "Top 5 results". Users frequently search for the exact same term (e.g., "iPhone"). The backend query involves complex Subqueries and Anti-Joins.
Identify and explain three specific advanced optimization techniques the DBMS will use to ensure instant response times.

Elite Answer Formulation:

  1. Plan Caching: Because users repeatedly search the exact same query structure (just different parameters), the optimizer compiles the complex execution plan once, and stores it in the Plan Cache. Subsequent searches bypass the expensive compile-time cost analysis entirely.
  2. Top K-Results Optimization: Because the app only requests 5 results (LIMIT 5), the optimizer limits strategy generation. Instead of generating a plan that executes a full Sort-Merge join on millions of rows, it uses an unblocked pipeline strategy that stops execution the millisecond the 5th match is found.
  3. Anti-Join Unnesting: If the subqueries use NOT EXISTS, the optimizer unnesting logic converts them into Anti-Joins, allowing the DBMS to use highly efficient Hash Anti-Joins rather than evaluating the subquery row-by-row.
Scenario 4: Semantic vs Heuristic Rewrite

An enterprise schema enforces a Foreign Key constraint: every Order must belong to a valid Customer.
A developer writes: SELECT o.OrderID, c.CustomerID FROM Orders o JOIN Customers c ON o.CustomerID = c.CustomerID;
Explain how Semantic Optimization transforms this query compared to standard Heuristics.

Elite Answer Formulation:

  • Standard Heuristics: A heuristic optimizer sees a JOIN and projection. It will ensure projections are pushed down, but it will still physically execute the JOIN between Orders and Customers, consuming CPU and memory to match the keys.
  • Semantic Optimization: The semantic optimizer analyzes the database schema constraints. It realizes that because of the strict Foreign Key constraint, it is mathematically impossible for an Order to exist without a matching Customer. Furthermore, the query only asks for o.OrderID and c.CustomerID (which is identical to o.CustomerID).
  • The Rewrite: Semantic optimization completely eliminates the JOIN. It rewrites the query internally to: SELECT OrderID, CustomerID FROM Orders; saving massive amounts of disk I/O and processing time.