Chapter 19: Query Optimization

Advanced Database Management Systems - Final Term Elite Preparation

1. Introduction & Query Representations

Query optimization is conducted by the DBMS Query Optimizer. Its goal is to select the best available strategy (least costly) for executing a query based on system information.

Timing of Optimization: Compiled vs. Interpreted Queries

Compiled Queries: The optimizer estimates and chooses the lowest cost strategy before runtime. Highly efficient for repeated queries.
Interpreted Queries: The entire optimization and estimation process occurs at runtime. Calculating the cost estimate dynamically may slow down actual response time.

Query Trees vs. Query Graphs

Query Tree: Represents a Relational Algebra expression. It dictates a specific order of operations (executed bottom-up). This makes it the preferred representation for optimizers.
Query Graph: Represents a Relational Calculus expression. Relation nodes are displayed as single circles, constants as double circles, and join/selection conditions as edges. Attributes to retrieve are in square brackets.

2. Heuristic Optimization of Query Trees

The parser's initial tree (canonical tree) is often highly inefficient (e.g., executing a massive Cartesian Product before filtering). The optimizer transforms it into an equivalent, faster tree using Heuristics.

🧠 The Golden Heuristic Rule

Apply operations that reduce the size of intermediate results first! Perform SELECT (σ) and PROJECT (π) as early as possible to reduce the number of tuples and attributes before expensive JOIN operations.

Key Transformation Rules

Cascade of σ: A conjunctive selection condition (AND) can be broken up into a sequence (cascade) of individual σ operations.
Commutativity of σ: The order of selections does not matter. Always apply the most restrictive selection first!
Cascade of π: In a sequence of projections, all but the last (outermost) one can be ignored.
Replacing Cartesian Product: A Cartesian Product followed by a Selection condition should be converted into a standard JOIN.

3. Choice of Query Execution Plans

Once the logical tree is optimized, the DBMS makes execution decisions at the physical level.

Materialized vs. Pipelined Evaluation

Materialized Evaluation: The result of an operation is physically stored on disk as a temporary relation.
Pipelined Evaluation: Operation results are forwarded directly in memory to the next operation in the sequence, avoiding disk writes.

Physical Optimization Approaches

Cost-based physical optimization can be approached in two ways:

Top-down approach: Starts with the overall goal and breaks it into sub-goals.
Bottom-up approach: Computes the optimal solution for base relations first, moving up the tree. (Used by Dynamic Programming).

Note: Certain physical heuristics make cost calculations unnecessary (e.g., always use an index scan for selections whenever possible).

4. Subqueries, Views & Advanced Joins

Nested Subqueries & View Merging

Unnesting: Removing the nested query and converting inner/outer queries into one block. (Always possible for IN or ANY).
Inline Views (FROM clause subqueries): Can be subjected to View Merging, where tables in the view merge with outer block tables.
Group-By View-Merging: The optimizer decides via cost whether to Group early (reducing data for joins) or Group late (if joins have low selectivity).

Semi-Join and Anti-Join Cost Formulas

Unnesting specific queries leads to Semi-Joins or Anti-Joins. The optimizer calculates their Selectivity (js) and Cardinality (jc).

Semi-Join (For IN / EXISTS clauses) js = MIN(1, NDV(Y, T2) / NDV(X, T1)) jc = |T1| * js Anti-Join (For NOT IN / NOT EXISTS clauses) js = 1 - MIN(1, NDV(T2.y) / NDV(T1.x)) jc = |T1| * js

5. Cost-Based Optimization: SELECT Operations

Cost metric evaluates alternatives based on: Access cost to secondary storage (Disk), Computation cost, Memory usage, and Communication.

Catalog Information & Histograms

The DBMS catalog stores File size, Organization, Index levels, and Number of Distinct Values (NDV). For highly skewed data, the RDBMS stores Histograms to accurately calculate Attribute Selectivity and Selection Cardinality.

SELECT Cost Formulas (Disk Block Accesses)

S1: Linear Search: Cost = b. (If searching for a unique key with equality, average cost is b / 2).
S2: Binary Search: Cost = log₂(b) + ⌈ s / bfr ⌉ - 1.
S3a: Primary Index (Single Record): Cost = x + 1.
S3b: Hash Key: Cost = 1.
S4: Ordering Index (Multiple Records): Cost = x + (b / 2).
S5: Clustering Index: Cost = x + ⌈ s / bfr ⌉.
S6: Secondary B+ Tree Index: Cost = x + 1 + s (Worst Case).

6. Cost Functions: JOIN Operations & Ordering

Standard Join Selectivity & Cardinality js = 1 / max(NDV(A, R), NDV(B, S)) jc = |(R ⋈ S)| = js * |R| * |S|

Join Execution Strategies

J1: Nested-Loop Join: Costly. Formulas vary depending on available memory buffer blocks (nB).
J2: Index-Based Nested-Loop: Extremely fast if a secondary index exists on the inner table.
J3: Sort-Merge Join: Requires adding the cost of sorting if files aren't already sorted on join attributes.
J4: Partition-Hash Join: Cost formula: 3 * (bR + bS) + Output Cost.

Join Ordering Choices (Dynamic Programming)

When computing multi-relation queries, the optimizer uses Dynamic Programming (where optimal subproblems are solved bottom-up, only once) to evaluate tree shapes.

Left-Deep Join Trees: Generally preferred! They work perfectly with common join algorithms and are able to generate fully pipelined plans.
Right-Deep & Bushy Trees: Offer more permutations but are harder to pipeline. (E.g., 5 relations have 120 left-deep permutations, but 1,680 bushy permutations).

7. Advanced Issues & Data Warehouses

Advanced Optimization Issues

Size Estimation of Other Operations: Optimizers must also estimate sizes for Projections, Set operations, Aggregation, and Outer joins.
Plan Caching: A compiled plan is stored by the optimizer for later use by the identical query running with different parameters.
Top k-results optimization: If a user only wants LIMIT 10, this limits strategy generation to avoid fully sorting massive tables.

Query Optimization in Data Warehouses

Data warehouses heavily rely on Star Transformation Optimization.

Goal: Access a reduced set of data from the massive central "Fact" table, completely avoiding a full table scan.
Techniques: Classic star transformation, Bitmap index star transformation (using bitmaps for ultra-fast intersections), and Joining back.

8. Oracle Specifics & Semantic Optimization

Displaying Query Execution Plans

                -- Oracle Syntax:

                EXPLAIN PLAN FOR <SQL query>

                -- IBM DB2 Syntax:

                EXPLAIN PLAN SELECTION [additional options] FOR <SQL-query>

                -- SQL Server Syntax:

                SET SHOWPLAN_TEXT ON or SET SHOWPLAN_XML ON or SET SHOWPLAN_ALL ON

Overview of Query Optimization in Oracle

Global Query Optimizer: Integrates logical transformations and physical optimization phases into one engine.
Adaptive Optimization: Uses a feedback loop to improve on previous execution decisions.
Array Processing: Supported for fetching multiple rows efficiently.
Hints: Specified by application developers, embedded in SQL text (e.g., /*+ INDEX */) to force access paths or join methods.
Outlines & SQL Plan Management: Used by DBAs to strictly preserve known good execution plans, preventing the optimizer from changing them.

Semantic Query Optimization

Uses constraints specified on the database schema (Primary keys, Check constraints, Foreign Keys). Goal: modify one query into another that is drastically more efficient to execute, or instantly return an empty set if a query logically violates a CHECK constraint.

🔥 Core Theory Q&A Preparation

High-yield theoretical concepts tested frequently in finals.

Concept: Compiled vs Interpreted

Q: How does Cost-Based optimization differ between Compiled and Interpreted queries?

A: For compiled queries, the optimizer estimates and compares costs at compile time, choosing the best strategy before execution, which is highly efficient. For interpreted queries, this entire complex estimation process occurs dynamically at runtime. This overhead can actually slow down the immediate response time of the query.

Concept: Execution Trees

Q: Why does Dynamic Programming heavily prefer Left-Deep join trees over Bushy trees?

A: Left-deep trees are structurally designed so that the right-hand input of every join is a base table. This allows the DBMS to generate fully pipelined plans, streaming intermediate results directly in RAM without materializing (writing) temporary tables to disk. Bushy trees often force expensive disk materialization.

Concept: Oracle Plan Stability

Q: What are "Outlines" and "SQL Plan Management" used for in Oracle?

A: They are tools used to provide plan stability. Sometimes, an optimizer might mistakenly choose a worse execution plan after a system update or statistic change. DBAs use Outlines and SQL Plan Management to "lock in" or preserve a historically optimal execution plan, preventing the optimizer from altering it.

🏆 10-Mark Scenario Questions

These advanced scenarios require synthesis of math, logic, and physical architecture.

Scenario 1: Join Permutation Analysis (Math Solve)

A database has three tables: `PROJECT`, `DEPARTMENT`, and `EMPLOYEE`. The optimizer restricts its search space to Left-Deep trees.
Based on Dynamic Programming, evaluate the structural difference between generating `PROJECT ⋈ DEPARTMENT ⋈ EMPLOYEE` versus `DEPARTMENT ⋈ EMPLOYEE ⋈ PROJECT`. Why must the optimizer evaluate all permutations?

Elite Answer Formulation:

By restricting to Left-Deep trees, a 3-table join yields 3! = 6 permutations (See Slide 37, Table 19.1). The optimizer evaluates all permutations bottom-up because join selectivity radically alters the intermediate file size.

Permutation A: (PROJECT ⋈ DEPARTMENT) ⋈ EMPLOYEE

If PROJECT ⋈ DEPARTMENT results in 50 rows (high selectivity), the subsequent join with the massive EMPLOYEE table is very fast, requiring only 50 inner-loop iterations.

Permutation B: (DEPARTMENT ⋈ EMPLOYEE) ⋈ PROJECT

If DEPARTMENT ⋈ EMPLOYEE results in 10,000 rows (low selectivity), the system must pipeline a massive 10,000-row intermediate table into the final join with PROJECT, destroying memory buffers and CPU efficiency.

Conclusion: The optimizer evaluates all left-deep permutations to find the order that produces the smallest intermediate relations first, applying the core heuristic of reducing data volume early.

Scenario 2: Data Warehouse Star Transformation

A Data Warehouse contains a 10-Billion row `SALES_FACT` table, connected to small `TIME`, `STORE`, and `PRODUCT` dimension tables. A query requests total sales for "Store X" during "December" for "Product Y".
Explain why standard Cost-Based Optimization fails here, and how Bitmap Index Star Transformation solves it.

Elite Answer Formulation:

CBO Failure: Standard CBO might attempt to join the massive SALES_FACT table with STORE first. Even filtered, this intermediate table might contain 100 million rows, forcing a catastrophic full table scan and massive memory consumption.
Bitmap Star Transformation: The optimizer avoids touching the Fact table directly. Instead:
1. It queries the small dimensions (Store X, Dec, Prod Y).
2. It retrieves the Bitmap Indexes associated with those dimension keys on the Fact table.
3. It performs a rapid, CPU-only logical AND intersection of these bitmaps in memory.
4. Joining Back: It uses the resulting intersected bitmap to fetch only the exact matching rows directly from the SALES_FACT table, completely bypassing a full table scan.

Scenario 3: Advanced Optimization Parameters

A web application features a search bar that displays "Top 5 results". Users frequently search for the exact same term (e.g., "iPhone"). The backend query involves complex Subqueries and Anti-Joins.
Identify and explain three specific advanced optimization techniques the DBMS will use to ensure instant response times.

Elite Answer Formulation:

Plan Caching: Because users repeatedly search the exact same query structure (just different parameters), the optimizer compiles the complex execution plan once, and stores it in the Plan Cache. Subsequent searches bypass the expensive compile-time cost analysis entirely.
Top K-Results Optimization: Because the app only requests 5 results (LIMIT 5), the optimizer limits strategy generation. Instead of generating a plan that executes a full Sort-Merge join on millions of rows, it uses an unblocked pipeline strategy that stops execution the millisecond the 5th match is found.
Anti-Join Unnesting: If the subqueries use NOT EXISTS, the optimizer unnesting logic converts them into Anti-Joins, allowing the DBMS to use highly efficient Hash Anti-Joins rather than evaluating the subquery row-by-row.

Scenario 4: Semantic vs Heuristic Rewrite

An enterprise schema enforces a Foreign Key constraint: every `Order` must belong to a valid `Customer`.
A developer writes: `SELECT o.OrderID, c.CustomerID FROM Orders o JOIN Customers c ON o.CustomerID = c.CustomerID;`
Explain how Semantic Optimization transforms this query compared to standard Heuristics.

Elite Answer Formulation:

Standard Heuristics: A heuristic optimizer sees a JOIN and projection. It will ensure projections are pushed down, but it will still physically execute the JOIN between Orders and Customers, consuming CPU and memory to match the keys.
Semantic Optimization: The semantic optimizer analyzes the database schema constraints. It realizes that because of the strict Foreign Key constraint, it is mathematically impossible for an Order to exist without a matching Customer. Furthermore, the query only asks for o.OrderID and c.CustomerID (which is identical to o.CustomerID).
The Rewrite: Semantic optimization completely eliminates the JOIN. It rewrites the query internally to: SELECT OrderID, CustomerID FROM Orders; saving massive amounts of disk I/O and processing time.