A Decision Theoretic Cost Model for Dynamic Plans
Abstract Since the classic optimization work in System R, query optimization has completely preceded query eval-uation. Unfortunately, errors in cost model parameters such as selectivity estimation compromise the op-timality of query evaluation plans optimized at compile time. The only promising remedy is to interleavestrategy selection and data access using run-time-dynamic plans. Based on the principles of decisiontheory, our cost model enables careful query analysis and prepares alternative query evaluation plans atcompile time, delaying relatively few, selected decisions until run time. In our prototype optimizer, theserun-time decisions are based only on those materialized intermediate results for which materializationcosts are expected to be less than the benefits from the improved decision quality.Introduction
Query processing can be divided into query optimization and query execution; this dichotomy has been followedalmost exclusively ever since the classic System R optimizer paper [SAC.79]. For queries with 10, 20, or 100operators, this clear-cut distinction is no longer applicable. If each operator's selectivity is consistently underes-timated by a mere 10%, the error after 10 operators is a factor of 2.8; the error after 20 operators is a factor of8.2; and after 100 operators, it is a factor of 37,649. Beyond a level of query complexity somewhere between10 and 20 operators, selectivity and cost estimates are more guesses than predictions, and plan choices are moregambles than strategic decisions [CVL.84, MCS88, IK91, Loh89]. Note that queries involving views, CASEtools, and data warehouse queries [KRRT98, Bri99] in particular often involve hundreds of operators. Thus, theproblem addressed here is of great interest to database system vendors and implementors.
Not only estimates for intermediate result sizes, but also forecasts for available resources may err. The re-
source situation, e.g., available memory, processors, temporary disk storage, net bandwidth, internet site avail-ability, and so on may change between compilation and execution, even from the time when a complex queryis started to the time when the query completes its last operation. Since cost calculations and decisions amongalternative plans depend on input sizes and available resources, selected optimization decisions and resource al-location decisions must be shifted from compile time into run time.
Research into this problem has taken several paths: improving the optimizer's statistical profile of the database
system [MCS88, MRL98]; building static plans using a least-expected-cost metric [CHS99]; building start-up-time-dynamic plans [CG94]; materializing all intermediate results, i.e., Ingres Decomposition [WY76]; usingintra-algorithm, i.e., record level adaptation [DG94, HRR99, AH00], decision theoretic techniques based on sam-pling [SBM93]; competitions between algorithms [Ant96]; heuristic approaches to materializing intermediateresults [UFA98, BFMV00]; and partial plan re-optimization at run time [KD98, NWMN99, IFF.99]. Note thatmany of these techniques are compatible with each other.
In the industrial arena, only the Informix R
Red Brick Decision ServerTM[Bri99] delays preselected opti-
mization decisions until run time. Red Brick applies information available at run time specifically to optimizationdecisions for star-schema queries [KRRT98, Bri99]. The server materializes the intermediate results of prelimi-nary plans, each one composed from a star-schema dimension table, i.e., a primary key defining table referencedby a foreign key defining fact table. (Note that fact tables may contain billions of records.) Based on the materi-alized intermediate results, cost based, run-time decisions are made between compile-time generated alternativeplans. Run-time decisions are also made between fact table orders, alternative join indexes, view maintenancealgorithms, as well as the final determination of data partitioning functions and degrees of query parallelism.
In combination with join algorithms specialized for star-schema query processing, the run-time techniques
employed by the Red Brick server enable efficient query evaluation of complex data warehouse queries involv-ing many fact and dimension tables. One successful application's schema design involves 33 fact tables and 160dimension tables. In this context, queries routinely join a many as 32 tables. Unfortunately, because of the diffi-culty in quantifying the advantages of these run-time techniques, their application is limited to star-schema queryplans and sub-plans. What we need is a generalized compile-time approach to quantifying the advantages of suchrun-time decisions.
In this short paper, we briefly describe an extension of our previous work on start-up-time-dynamic plans
[CG94], in order to address the problems posed by complex queries and uncertain cost model parameters. Start-up-time-dynamic query evaluation plans are dynamic plans implementing decisions typically made at compiletime that are delayed until the start of a query's evaluation. Run-time-dynamic query evaluation plans are plansthat make decisions between alternative, potentially optimal, algorithms, operator ordering, and plan shapes basedupon additional knowledge obtained while evaluating the plan. For example, in our prototype run-time-dynamicplan optimizer [Col99], this additional knowledge is obtained by materializing intermediate results, thus provid-ing exact cardinality and allowing accurate planning of subsequent query evaluation. Materializing intermediateresults may also supply additional information about the distributions of attributes contained in the result, e.g.,dynamically constructed histograms of join attribute values may facilitate improved costing of join algorithmalternatives [KD98].
The difficult aspect of this approach is determining which subsets of plan operators to materialize. Because
interrupting the plan's data-flow and intermediate data materialization has an associated cost, the compile-timeoptimizer must trade off this cost against the benefit of additional knowledge and subsequently improved queryevaluation. The techniques we employ to solve this problem are provided by statistical decision theory and it isthe generalized application of decision theory to query optimization that is the key contribution of this work. Cost Model
A cost model which optimizes run-time-dynamic plans must first capture the benefit of run-time decisions and,second, the cost of obtaining run-time information to make these decisions. Because we restrict ourselves tocomplete materialization of data, rather than partial sampling, we simplify the cost calculation of obtaining run-time information. It is simply the cost of the materialization itself, which is a function of the expected number andsize of records to be read and written to a temporary result. In order to capture the benefit of run-time decisionswe use the techniques of decision theory [RS68, Lin72]. Decision Theory
There are several elements involved in making a decision (largely from Raiffa and Schlaifer [RS68]): the set ofpossible actions,
is an action chosen from the domain of actions
be predicted with certainty; a set of possible experiments,
, that captures the benefit and cost of per-
forming an experiment, observing an outcome, and taking a particular action; and we have prior estimates of theprobabilities of the various states of the world before and after an experiment.
How do these principles apply to our problem? To optimize run-time-dynamic plans we apply them as fol-
is the set of alternative query evaluation algorithms and plans; the state of the world,
consists of known or estimated distributions for the parameters of the cost model, e.g., predicate selectivities orresources such as available memory; the set of possible experiments,
ing better estimates for distributions on
that are only known approximately, e.g, sampling techniques, or as in
our case, intermediate result materialization; the cost of the experiment, , depends on the potential materializedoutcome, , since its cost is a function of the number of records materialized; in decision theory one typically
, however, rather than defining utility as negative cost,
, a perspective compatible with typical query optimizers; and we
have prior probabilities for the uncertain parameters of the cost model, such as predicate selectivities or availableresources, that capture their accuracy. The Value of Information
Since the utility or cost of our experiments and actions are additive, we can think about the increase in the utility,or in our case decrease in optimizer cost, of an action for a given experiment. In decision theory, this increaseis called the value of information. For a given experimental outcome we can compute the conditional value ofinformation, but since we do not know these outcomes a priori we must resort to computing the expected valueof information. Before running an experiment, e.g., before materializing intermediate results, we typically havesome idea about the expected value of the cost model parameter, in this case intermediate result cardinality, andthe accuracy with which we know the expected value. That is, we have a probability distribution for the expectedvalue.½ Using this previously known distribution, referred to as the prior distribution, we compute the conditionalcost with original information (CCWOI), conditional with respect to
and , and the expected cost with originalinformation (ECWOI) as follows.
is the computation of CCWOI for cost that is a function of the action taken,
rithm used), and the cardinality of the algorithm's input, , weighted by the probability density function represent-ing our prior knowledge of the distribution of that cardinality,
. Integrating their product over the range of
(the state of the world), and minimizing over
(the algorithm set), gives the expected cost of making a particular
algorithm choice, i.e., the ECWOI is Ñ Ò
Even without further development, the expected cost with original information is already a powerful decision
theoretic concept when applied to a cost-based query optimizer. By minimizing the ECWOI, a query optimizercan build a static plan that has increased robustness without requiring start-up-time or run-time decisions. (Thisconcept is similar to that explored by Chu et al.[CHS99]).
If we materialize intermediate query results, then for decisions between algorithms that directly depend on
these results we will have perfect information for making decisions at run time. We can compute the condi-tional cost with perfect information (CCWPI), conditional on , and the expected cost with perfect information
these equations when compared to the ECWOI. From the ECWPI and ECWOI we may compute the expectedvalue of perfect information or EVPI as ECWOI
ECWPI providing an upper bound on the value of a perfect
In lieu of continuous probability distributions we could instead use discrete distributions, as provided by histograms, etc. The Value of Materialized Information
Materializing data may be an effective technique to obtain information for decision making immediately aftermaterialization, in which case we have perfect information about cardinality and perhaps other information aswell, e.g., domain histograms. However, the effectiveness of materializing data is reduced as subsequent oper-ators operate on the materialized result. Decisions between alternative algorithms situated more than a singleoperator above (in the graph of operators) become less than perfect as intervening operators may introduce addi-tional uncertainty, and the cost of materialization may ultimately outweigh its reduced benefit. The effect of thisuncertainty is captured by the expected cost with materialized information (ECWMI).
To develop the ECWMI we use the extensive form of statistical analysis [Lin72]. Given the decision tree in
Figure 1, where circles denote random variables and rectangles denote decisions, a formulation is developed tomaximize the value of the experiment. (Causality is in the direction of the arrows, from left to right and bottomto top.) First an experiment, , is chosen, which produces experimental data,. Based upon this information we
choose an action, , resulting in outcome and utility
. Working backwards by first averaging the util-
ity over where the applicable distribution is
, it is then maximized over the set of available actions , i.e.,
, followed by averaging over the values of , where the applicable distribution is
The optimization of run-time-dynamic plans is analogous to the decision tree of Figure 1. The experiment
performed is the materialization of intermediate query results, the result of this experiment, , is the size of
the preliminary plan's intermediate result, the set of actions, , are the set of alternative sub-plans available toChoose-Plan, and the final outcome, , is the actual result cardinality. Because we have chosen a priori to mate-rialize intermediate results, the experiment, , is fixed, i.e., we either materialize an intermediate result entirelyor not at all. Therefore, we may dispense with the
plification can be made because we know utility is not directly influenced by the experimental outcome. Thecost of an algorithm in this example is only dependent on the cardinality of its immediate input. Therefore the
and the expected value of materialized information (EVMI) is
An Extensive Form Example
Consider Figure 2 with static plans in diagrams (a) and (b), composed from generic algorithms Alg-A and Alg-Bboth having inputs Alg-Y and Alg-X, and in diagram (c) the run-time-dynamic plan having two alternatives, oneof which will be chosen at run time based upon the cardinality of preliminary plan Alg-X materialized by theChoose-Plan operator as part of its decision procedure. If algorithm Alg-Y also introduces uncertainty in the input
Figure 2: Run-Time-Dynamic Plan with Additional Uncertainty
cardinality and uncertainty in the costs of algorithms Alg-A and Alg-B, then its effect must be considered in thedesign of the cost model.¾ This additional uncertainty is exactly what the ECWMI captures. In this example, theexperiment performed is the materialization of the intermediate result produced by Alg-X, therefore we replace
in order to associate it with Alg-X. Similarly, the final outcome is the intermediate result
in our original definition of the ECWMI. With these modifications,
the ECWMI of the example run-time-dynamic plan in Figure 2 (c) is
after substituting for the cost of the specific alternative algorithms we have
A Prototype Plan and Preliminary Results
Figure 3 presents the graph of a dynamic plan produced by a prototype optimizer that applies the decision the-oretic concepts just described [Col99]. Each node in the graph represents a typical operator [Gra93], with itsname abbreviated as follows: EX - Execute (performs plan initialization), CP - Choose-Plan (evaluate prelim-inary plan(s) and choose from alternative plans) HJ - Hash-Join, FL - Filter (apply a predicate), FS - File-Scan(sequentially scan a file), FJ - Functional-Join (row identifier based join), and BS - Btree-Scan (with range re-stricting predicate). Each edge in a plan graph has an associated numeric label indicating the absolute order ofinputs for an operator. For example, label 0 indicates the first input of an operator, label 1 the second, and so on. Preliminary plans, i.e., materialized sub-plans, are identified by dashed edges proceeding directly from the rootnode of a preliminary plan to the Choose-Plan operator(s) benefiting from its materialization.
The plan is for a four table join query having equijoin predicates and other, local predicates on all base tables.
For one of these local predicates, the statistical profile of the database indicated much uncertainty in the predi-cate's selectivity. In this run-time-dynamic plan, there are many plan alternatives because of the decisions to bemade between different base data algorithms, Functional-Join:Btree-Scan versus Filter:File-Scan, and differentjoin orders for the Hash-Joins. Note that the operator that implements the uncertain predicate has been identifiedand its intermediate result is planned to be materialized. The materialized information is shown to be propagatedto all Choose-Plan operators and determines all run-time choices. In the decision theoretic sense, the evaluationof this plan will be optimal.
Such qualitative results are encouraging, as are preliminary quantitative results. Large cost improvements
have been realized for reasonable increases in compile-time optimization. For simple queries of a single table,experiments are easily constructed that produce relative cost improvements, by comparing the cost of dynamicplans to their analogous static plans, of more than a factor of three for very little additional optimization effort. More complex queries can produce similar improvements for an increase in optimization time that is reasonabledepending on the context. For example, a query having 20 tables, 39 predicates, and 60 operators derived from
Alg-Y need not be a single operator. It is representative of an arbitrarily complex sub-plan. Our decision analysis applies whether
or not Alg-Y is a single operator or a sub-plan composed of multiple operators.
Figure 3: Discovering Localized Uncertainty: A Prototype Dynamic Plan
equivalent physical forms was optimized in 20 seconds, compared to 1 second for the analogous static plan.
In absolute terms, 20 seconds is an acceptable optimization time, especially when the result of that optimizationis amortized over many query evaluations via precompiled plans. Conclusion
The application of decision theory to query optimization was first considered by Seppi et al.[SBM93]. The focusof Seppi's work was on the application of Bayesian decision theory to three selected database query optimiza-tion problems: making decisions between alternative algorithms for evaluating simple selection predicates, joinpredicates over distributed relations, and the transitive closure of binary relations. The emphasis was on prepos-terior analysis, used to compute the optimal amount of sampling information to obtain before making a decision. (A later application of decision theory to query optimization was proposed by Chu et al.[CHS99]; however, theresulting plans remain static and there appears to be no adaptation during run time.)
Rather than focusing on any one specific query algorithm, we take a general approach that is independent of
the algorithms under consideration, one that is useful for modeling the effective cost of inter-operator uncertainty. We also avoid dependency on any one specific distribution or family of data distributions, e.g., the exponentialfamily of distributions as in Seppi's work. Finally, our emphasis is on the terminal analysis, i.e., analysis of theexpected cost of algorithm decisions, rather than on the preposterior analysis, i.e., analysis of the expected effectof different sampling plans.
The accumulation of even small errors in the estimation of a predicate's selectivity can result in static query
evaluation plans that are poorly optimized. Run-time-dynamic query evaluation plans are a promising approachto solving this problem of uncertain cost model parameters. To our knowledge, there has been no other work thatproposes the generalized application of decision theory to query optimization for run-time adaptation. References
R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In to appear Proc. ACMSIGMOD, Dallas, TX, May 2000.
G. Antoshenkov. Query processing and optimization in Oracle RDB. The VLDB Journal, 5(4):229-237,January 1996.
L. Bouganim, F. Fabret, C. Mohan, and P. Valduriez. Dynamic query scheduling in data integration systems. In Proc. IEEE Int'l. Conf. on Data Eng., pages 425-434, San Diego, CA, February 2000.
Red Brick. Informix Red Brick Decision Server (Version 6.0): Administrator's Guide. Informix Software,Inc., Los Gatos, CA, 1999.
R. L. Cole and G. Graefe. Optimization of dynamic query execution plans. In Proc. ACM SIGMOD, Min-neapolis, MN, May 1994.
F. Chu, J. Y. Halpern, and P. Seshadri. Least expected cost query optimization: An exercise in utility. In Proc. ACM SIGACT-SIGMOD Symp. on Principles of Database Sys., Philadelphia, Pennsylvania, May 1999.
R. L. Cole. Optimizing dynamic query evaluation plans. Ph.D. Thesis, University of Colorado at Boulder,1999.
S. Christodoulakis, J. Vanderbroek, J. Li, T. Li, S. Wan, Y. Wang, M. Papa, and E. Bertino. Development ofa multimedia information system for an office environment. In Proc. Int'l. Conf. on Very Large Data Bases,page 261, Singapore, August 1984.
D. L. Davison and G. Graefe. Memory-contention responsive hash joins. In Proc. Int'l. Conf. on Very LargeDatabases, page 379, Santiago de Chile, September 1994.
G. Graefe. Query evaluation techniques for large databases. ACM Computing Surveys, 25(2):73-170, June
J. M. Hellerstein, V. Raman, and B. Raman. Online dynamic reordering for interactive data processing. Proc. Intl' Conf. on Very Large Data Bases, September 1999.
Z. G. Ives, D. Florescu, M. Friedman, A. Levy, and D. S. Weld. An adaptive query execution system for dataintegration. In Proc. ACM SIGMOD, pages 299-310, Philadelphia, Pennsylvania, May 1999.
Y. E. Ioannidis and Y. C. Kang. Left-deep vs. bushy trees: An analysis of strategy spaces and its implicationsfor query optimization. In Proc. ACM SIGMOD, page 168, Denver, CO, May 1991.
N. Kabra and D. J. DeWitt. Efficient mid-query re-optimization of sub-optimal query execution plans. InProc. ACM SIGMOD, pages 106-117, Seattle, WA, June 1998.
R. Kimball, L. Reeves, M. Ross, and W. Thornthwaite. The Data Warehouse Lifecycle Toolkit: Expert Meth-ods for Designing, Developing, and Deploying Data Warehouses. John Wiley & Sons, Inc., August 1998.
D. V. Lindley. Bayesian Statistics, A Review. SIAM, Philadelphia, PA, 1972.
G. M. Lohman. Is query optimization a 'solved' problem? In G. Graefe, editor, Proc. Workshop on DatabaseQuery Optimization, page 13, Beaverton, OR, May 1989. Oregon Graduate Center CS TR 89-005.
M. V. Mannino, P. Chu, and T. Sager. Statistical profile estimation in database systems. ACM ComputingSurveys, 20(3):191, September 1988.
G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass andwith limited memory. In Proc. ACM SIGMOD, pages 426-435, Seattle, WA, June 1998.
K. W. Ng, Z. Wang, R. R. Muntz, and S. Nittel. Dynamic query re-optimization. Proc. Conf. on Scientificand Statistical Database Management, July 1999.
H. Raiffa and R. Schlaifer. Applied Statistical Decision Theory. MIT Press, Cambridge, MA, 1968.
P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in arelational database management system. In Proc. ACM SIGMOD, page 23, Boston, MA, May-June 1979. Reprinted in M. Stonebraker, Readings in Database Sys., Morgan Kaufmann, San Mateo, CA, 1988.
K. Seppi, J. Barnes, and C. Morris. A Bayesian approach to query optimization in large scale data bases. ORSA J. of Computing, Fall 1993.
T. Urhan, M. J. Franklin, and L. Amsaleg. Cost-based query scrambling for initial delays. In Proc. ACMSIGMOD, Seattle, WA, June 1998.
E. Wong and K. Youssefi. Decomposition - A strategy for query processing. ACM Trans. on Database Sys.,1(3):223, September 1976.
Het afweersysteem is geprogrammeerd om elke verstoring te genezen op een voor ons minst schadelijke manier. Koorts, ten gevolge van een infectie is dus een noodzakelijke reactie voor genezing. Hetzelfde geldt voor snot en hoesten, omdat zij reinigende reacties van het lichaam zijn om bacteriën af te voeren. Ondersteun je afweersysteem met natuurlijke middelen en onderdruk je De winter is h
Lamps, LEDs and Circuits 8.2 Tungsten halogen lamps Key attributes For mains or low-voltage operationLonger rated life and higher luminous efficacy than Easy to dimBrilliant lightLow-voltage types are very small and are ideal for precise direction of light (but do require a transformer) Fig 8.2 Tungsten halogen lamps Key application areas Retail and domesticRestaurants and cate