Chapter 19: Query Optimization
Advanced Database Management Systems - Final Term Elite Preparation
1. Introduction & Query Representations
Query optimization is conducted by the DBMS Query Optimizer. Its goal is to select the best available strategy (least costly) for executing a query based on system information.
- Compiled Queries: The optimizer estimates and chooses the lowest cost strategy before runtime. Highly efficient for repeated queries.
- Interpreted Queries: The entire optimization and estimation process occurs at runtime. Calculating the cost estimate dynamically may slow down actual response time.
Query Trees vs. Query Graphs
- Query Tree: Represents a Relational Algebra expression. It dictates a specific order of operations (executed bottom-up). This makes it the preferred representation for optimizers.
- Query Graph: Represents a Relational Calculus expression. Relation nodes are displayed as single circles, constants as double circles, and join/selection conditions as edges. Attributes to retrieve are in square brackets.
2. Heuristic Optimization of Query Trees
The parser's initial tree (canonical tree) is often highly inefficient (e.g., executing a massive Cartesian Product before filtering). The optimizer transforms it into an equivalent, faster tree using Heuristics.
Apply operations that reduce the size of intermediate results first! Perform SELECT (σ) and PROJECT (π) as early as possible to reduce the number of tuples and attributes before expensive JOIN operations.
Key Transformation Rules
- Cascade of σ: A conjunctive selection condition (AND) can be broken up into a sequence (cascade) of individual σ operations.
- Commutativity of σ: The order of selections does not matter. Always apply the most restrictive selection first!
- Cascade of π: In a sequence of projections, all but the last (outermost) one can be ignored.
- Replacing Cartesian Product: A Cartesian Product followed by a Selection condition should be converted into a standard JOIN.
3. Choice of Query Execution Plans
Once the logical tree is optimized, the DBMS makes execution decisions at the physical level.
Materialized vs. Pipelined Evaluation
- Materialized Evaluation: The result of an operation is physically stored on disk as a temporary relation.
- Pipelined Evaluation: Operation results are forwarded directly in memory to the next operation in the sequence, avoiding disk writes.
Cost-based physical optimization can be approached in two ways:
- Top-down approach: Starts with the overall goal and breaks it into sub-goals.
- Bottom-up approach: Computes the optimal solution for base relations first, moving up the tree. (Used by Dynamic Programming).
Note: Certain physical heuristics make cost calculations unnecessary (e.g., always use an index scan for selections whenever possible).
4. Subqueries, Views & Advanced Joins
Nested Subqueries & View Merging
- Unnesting: Removing the nested query and converting inner/outer queries into one block. (Always possible for
INorANY). - Inline Views (FROM clause subqueries): Can be subjected to View Merging, where tables in the view merge with outer block tables.
- Group-By View-Merging: The optimizer decides via cost whether to Group early (reducing data for joins) or Group late (if joins have low selectivity).
Semi-Join and Anti-Join Cost Formulas
Unnesting specific queries leads to Semi-Joins or Anti-Joins. The optimizer calculates their Selectivity (js) and Cardinality (jc).
IN / EXISTS clauses)
js = MIN(1, NDV(Y, T2) / NDV(X, T1))
jc = |T1| * js
Anti-Join (For NOT IN / NOT EXISTS clauses)
js = 1 - MIN(1, NDV(T2.y) / NDV(T1.x))
jc = |T1| * js
5. Cost-Based Optimization: SELECT Operations
Cost metric evaluates alternatives based on: Access cost to secondary storage (Disk), Computation cost, Memory usage, and Communication.
The DBMS catalog stores File size, Organization, Index levels, and Number of Distinct Values (NDV). For highly skewed data, the RDBMS stores Histograms to accurately calculate Attribute Selectivity and Selection Cardinality.
SELECT Cost Formulas (Disk Block Accesses)
- S1: Linear Search:
Cost = b. (If searching for a unique key with equality, average cost isb / 2). - S2: Binary Search:
Cost = log₂(b) + ⌈ s / bfr ⌉ - 1. - S3a: Primary Index (Single Record):
Cost = x + 1. - S3b: Hash Key:
Cost = 1. - S4: Ordering Index (Multiple Records):
Cost = x + (b / 2). - S5: Clustering Index:
Cost = x + ⌈ s / bfr ⌉. - S6: Secondary B+ Tree Index:
Cost = x + 1 + s(Worst Case).
6. Cost Functions: JOIN Operations & Ordering
Join Execution Strategies
- J1: Nested-Loop Join: Costly. Formulas vary depending on available memory buffer blocks (
nB). - J2: Index-Based Nested-Loop: Extremely fast if a secondary index exists on the inner table.
- J3: Sort-Merge Join: Requires adding the cost of sorting if files aren't already sorted on join attributes.
- J4: Partition-Hash Join: Cost formula:
3 * (bR + bS) + Output Cost.
Join Ordering Choices (Dynamic Programming)
When computing multi-relation queries, the optimizer uses Dynamic Programming (where optimal subproblems are solved bottom-up, only once) to evaluate tree shapes.
- Left-Deep Join Trees: Generally preferred! They work perfectly with common join algorithms and are able to generate fully pipelined plans.
- Right-Deep & Bushy Trees: Offer more permutations but are harder to pipeline. (E.g., 5 relations have 120 left-deep permutations, but 1,680 bushy permutations).
7. Advanced Issues & Data Warehouses
- Size Estimation of Other Operations: Optimizers must also estimate sizes for Projections, Set operations, Aggregation, and Outer joins.
- Plan Caching: A compiled plan is stored by the optimizer for later use by the identical query running with different parameters.
- Top k-results optimization: If a user only wants
LIMIT 10, this limits strategy generation to avoid fully sorting massive tables.
Query Optimization in Data Warehouses
Data warehouses heavily rely on Star Transformation Optimization.
- Goal: Access a reduced set of data from the massive central "Fact" table, completely avoiding a full table scan.
- Techniques: Classic star transformation, Bitmap index star transformation (using bitmaps for ultra-fast intersections), and Joining back.
8. Oracle Specifics & Semantic Optimization
Displaying Query Execution Plans
EXPLAIN PLAN FOR <SQL query>
-- IBM DB2 Syntax:
EXPLAIN PLAN SELECTION [additional options] FOR <SQL-query>
-- SQL Server Syntax:
SET SHOWPLAN_TEXT ON or SET SHOWPLAN_XML ON or SET SHOWPLAN_ALL ON
Overview of Query Optimization in Oracle
- Global Query Optimizer: Integrates logical transformations and physical optimization phases into one engine.
- Adaptive Optimization: Uses a feedback loop to improve on previous execution decisions.
- Array Processing: Supported for fetching multiple rows efficiently.
- Hints: Specified by application developers, embedded in SQL text (e.g.,
/*+ INDEX */) to force access paths or join methods. - Outlines & SQL Plan Management: Used by DBAs to strictly preserve known good execution plans, preventing the optimizer from changing them.
Uses constraints specified on the database schema (Primary keys, Check constraints, Foreign Keys). Goal: modify one query into another that is drastically more efficient to execute, or instantly return an empty set if a query logically violates a CHECK constraint.
🔥 Core Theory Q&A Preparation
High-yield theoretical concepts tested frequently in finals.
Q: How does Cost-Based optimization differ between Compiled and Interpreted queries?
A: For compiled queries, the optimizer estimates and compares costs at compile time, choosing the best strategy before execution, which is highly efficient. For interpreted queries, this entire complex estimation process occurs dynamically at runtime. This overhead can actually slow down the immediate response time of the query.
Q: Why does Dynamic Programming heavily prefer Left-Deep join trees over Bushy trees?
A: Left-deep trees are structurally designed so that the right-hand input of every join is a base table. This allows the DBMS to generate fully pipelined plans, streaming intermediate results directly in RAM without materializing (writing) temporary tables to disk. Bushy trees often force expensive disk materialization.
Q: What are "Outlines" and "SQL Plan Management" used for in Oracle?
A: They are tools used to provide plan stability. Sometimes, an optimizer might mistakenly choose a worse execution plan after a system update or statistic change. DBAs use Outlines and SQL Plan Management to "lock in" or preserve a historically optimal execution plan, preventing the optimizer from altering it.
🏆 10-Mark Scenario Questions
These advanced scenarios require synthesis of math, logic, and physical architecture.
A database has three tables: PROJECT, DEPARTMENT, and EMPLOYEE. The optimizer restricts its search space to Left-Deep trees.
Based on Dynamic Programming, evaluate the structural difference between generating PROJECT ⋈ DEPARTMENT ⋈ EMPLOYEE versus DEPARTMENT ⋈ EMPLOYEE ⋈ PROJECT. Why must the optimizer evaluate all permutations?
Elite Answer Formulation:
By restricting to Left-Deep trees, a 3-table join yields 3! = 6 permutations (See Slide 37, Table 19.1). The optimizer evaluates all permutations bottom-up because join selectivity radically alters the intermediate file size.
Permutation A: (PROJECT ⋈ DEPARTMENT) ⋈ EMPLOYEE
If PROJECT ⋈ DEPARTMENT results in 50 rows (high selectivity), the subsequent join with the massive EMPLOYEE table is very fast, requiring only 50 inner-loop iterations.
Permutation B: (DEPARTMENT ⋈ EMPLOYEE) ⋈ PROJECT
If DEPARTMENT ⋈ EMPLOYEE results in 10,000 rows (low selectivity), the system must pipeline a massive 10,000-row intermediate table into the final join with PROJECT, destroying memory buffers and CPU efficiency.
Conclusion: The optimizer evaluates all left-deep permutations to find the order that produces the smallest intermediate relations first, applying the core heuristic of reducing data volume early.
A Data Warehouse contains a 10-Billion row SALES_FACT table, connected to small TIME, STORE, and PRODUCT dimension tables. A query requests total sales for "Store X" during "December" for "Product Y".
Explain why standard Cost-Based Optimization fails here, and how Bitmap Index Star Transformation solves it.
Elite Answer Formulation:
- CBO Failure: Standard CBO might attempt to join the massive
SALES_FACTtable withSTOREfirst. Even filtered, this intermediate table might contain 100 million rows, forcing a catastrophic full table scan and massive memory consumption. - Bitmap Star Transformation: The optimizer avoids touching the Fact table directly. Instead:
1. It queries the small dimensions (Store X, Dec, Prod Y).
2. It retrieves the Bitmap Indexes associated with those dimension keys on the Fact table.
3. It performs a rapid, CPU-only logicalANDintersection of these bitmaps in memory.
4. Joining Back: It uses the resulting intersected bitmap to fetch only the exact matching rows directly from theSALES_FACTtable, completely bypassing a full table scan.
A web application features a search bar that displays "Top 5 results". Users frequently search for the exact same term (e.g., "iPhone"). The backend query involves complex Subqueries and Anti-Joins.
Identify and explain three specific advanced optimization techniques the DBMS will use to ensure instant response times.
Elite Answer Formulation:
- Plan Caching: Because users repeatedly search the exact same query structure (just different parameters), the optimizer compiles the complex execution plan once, and stores it in the Plan Cache. Subsequent searches bypass the expensive compile-time cost analysis entirely.
- Top K-Results Optimization: Because the app only requests 5 results (
LIMIT 5), the optimizer limits strategy generation. Instead of generating a plan that executes a full Sort-Merge join on millions of rows, it uses an unblocked pipeline strategy that stops execution the millisecond the 5th match is found. - Anti-Join Unnesting: If the subqueries use
NOT EXISTS, the optimizer unnesting logic converts them into Anti-Joins, allowing the DBMS to use highly efficient Hash Anti-Joins rather than evaluating the subquery row-by-row.
An enterprise schema enforces a Foreign Key constraint: every Order must belong to a valid Customer.
A developer writes: SELECT o.OrderID, c.CustomerID FROM Orders o JOIN Customers c ON o.CustomerID = c.CustomerID;
Explain how Semantic Optimization transforms this query compared to standard Heuristics.
Elite Answer Formulation:
- Standard Heuristics: A heuristic optimizer sees a JOIN and projection. It will ensure projections are pushed down, but it will still physically execute the JOIN between Orders and Customers, consuming CPU and memory to match the keys.
- Semantic Optimization: The semantic optimizer analyzes the database schema constraints. It realizes that because of the strict Foreign Key constraint, it is mathematically impossible for an Order to exist without a matching Customer. Furthermore, the query only asks for
o.OrderIDandc.CustomerID(which is identical too.CustomerID). - The Rewrite: Semantic optimization completely eliminates the JOIN. It rewrites the query internally to:
SELECT OrderID, CustomerID FROM Orders;saving massive amounts of disk I/O and processing time.