Design and Modeling for Computer Experiments
From charlesreid1
Fang, Kai-Tai; Li, Runze; Sudjianto, Agus (2006). Design and Modeling for Computer Experiments. Chapman and Hall/CRC.
Chapter 1: Introduction
Concepts/Definitions
- Factor - controllable variable that is of interest in the experiment
- quantitative vs. qualitative
- quantitative - can be measured on numerical scale (e.g. T, P, ratio, rxn rate, etc.)
- qualitative - values are categories (e.g. operators, material type, etc.); also called categorical factor or indicator factor
- computer experiments: factor = input variable
- Experimental domain - hypercube of all possible factor values (also called input variable space)
- DC approach: this is the initial hypercube $ H = \bigcap \left[ \alpha_i \leq x_i \leq \beta_i \right] $
- Run/trial - implementation of level-combination in experimental environment
- Computer experiments: no random error, trials are deterministic
- Response - result of a run/trial based on purpose of experiment
- Can be a function: functional response
- Chapter 7: computer experiments with functional responses
- Responses also called outputs
- Factorial design - set of level-combinations with purpose of estimating main effects, interaction effects among factors
- symmetric - all factors have same number of levels
- asymmetric - factors have diff. numbers of levels
- full factorial design - all level combinations appear
- fractional factorial desgin - subset of all level combinations
- ANOVA models - factorial designs are based on statistical model
- e.g. 1 factor experiment, q levels, expressed as:
$ y_{ij} = \mu + \alpha_j + \epsilon_{ij} = \mu_{j} + \epsilon_{ij} , j = 1 \dots q, i = 1 \dots n_j $
- $ \mu $ - overall mean of y
- $ \mu_j $ - true value of response y at $ x_j $
- $ \epsilon_{ij} $ - random error in ith replication of jth level of x ($ x_j $)
- all errors $ \epsilon_{ij} $ assumed independently/identically distributed according to $ N(0,\sigma^2) $
- mean decomposed into $ \mu_j = \mu + \alpha_j $, where $ \alpha_j $ is main effect of y at x_j, and satisfies $ \sum_{j=1}^{q} \alpha_j = 0 $
- e.g. 2 factor experiment (factor A, factor B):
$ \begin{align} y_{ijk} &=& \mu + \alpha_i + \beta_j + ( \alpha \beta )_{ij} + \epsilon_{ijk} \\ & & i=1,\dots,p \\ & & j=1,\dots,q \\ & & k=1,\dots,K \end{align} $
- $ \mu $ = overall mean
- $ \alpha_i, \beta_j $ = main effect of factor A/factor B at level i/level j
- $ \epsilon_{ijk} $ = random error in kth trial at level combination $ \alpha_i \beta_j $
- $ ( \alpha \beta )_{ij} $ = interaction between A and B at level combination $ \alpha_i \beta_j $, under restrictions:
- $ \sum_{i=1}^{p} \alpha_i = 0 $
- $ \sum_{j=1}^{q} \beta_j = 0 $
- $ \sum_{i=1}^{p} ( \alpha \beta )_{ij} = \sum_{j=1}^{q} ( \alpha \beta )_{ij} = 0 $
- Factorial design cost - for an s-factor experiment with $ q_1,\dots,q_s $ levels, the number of main effects plus interactions is $ \prod_{j=1}^{s} q_j - 1 $ and this exponentially increases as s increases
- Sparsity principle - number of relatively important effects and interactions in factorial design is small
- Hierarchical ordering principle - lower order effects are more likely to be important than higher order effects; main effects more likely to be important than interactions; effects of same order are equally likely to be important
- Optimal design - given an underlying relationship, different optimization approaches may be taken
- General regression model:
- $ y_{k} = \sum_{j=1}^{m} \beta_j g_j ( x_{k1}, \dots, x_{ks} ) + \epsilon_{k} = \sum_{j=1}^{m} \beta_j g_j (\boldsymbol{x}_k) + \epsilon_k, k=1,\dots,n $
- $ g_j(\cdot) $ are specified or known functions
- $ \epsilon $ is random error, $ E(\epsilon)=0 $ and $ Var(\epsilon)=\sigma^2 $
- This can be applied to several specific cases, e.g. linear model, quadratic model, etc., but g can also be nonlinear functions of x
- Rewriting in matrix form:
- $ \mathbf{y} = \mathbf{G} \boldsymbol{\beta} + \epsilon $
- matrix G: design matrix
- row 1 = [ g_1(x_1) \dots g_m(x_1) ]
- row n = [ g_1(x_n) \dots g_m(x_n) ]
- matrix M: information matrix
- $ \mathbf{M} = \frac{1}{n} \mathbf{G^{\prime} G} $
- covariance matrix of least squares estimator:
- $ Cov(\hat{\beta}) = \frac{ \sigma^2 }{ n } \mathbf{M}^{-1} $
- Optimization techniques:
- Want covariance matrix to be small as possible, which suggests maximizing $ \mathbf{M} $ with respect to dsign $ D_n = \left[ \mathbf{x}k = (x_1,\dots,x_s) \right] $
- Many different proposed optimization criteria
- D-optimality: maximize determinant of $ \mathbf{M} $
- Equivalent to minimizing generalized variance (determinant of covariance matrix)
- A-optimality: minimize trace of $ \mathbf{M}^{-1} $
- equivalent to minimizing the sum of variances $ \sum_{i=1}^{m} \hat{\beta}_i $
- E-optimality: minimize largest eigenvalue of $ \mathbf{M}^{-1} $
Motivation
- Computer model: $ y = f(\mathbf{x}) $
- physical experiments to understand relationship between response y and inputs $ x_j $ are too expensive or time consuming
- computer models important for investigating complicated physical phenomena
- one goal of computer experiments is to find an approximate model that is much simpler than the true but complicated model
- True model: $ y = f(\mathbf{x}) $
- Metamodel: $ \hat{y} = f( \mathbf{x} ) $
Comprehensive review papers:
- Sacks Welch Mitchell Wynn 1989
- Bates Buck Riccomango Wynn 1996
- Koehler Owen 1996
Computer vs. physical experiments
- involve larger numbers of variables compared to typical physical experiments
- larger experiment domain or design space employed to explore nonlinear functions
- computer experimens are deterministic
Uses for metamodels:
- preliminary study and visualization
- prediction and optimization
- sensitivitiy analysis
- probabilistic analysis (effect of input uncertainty on variability of output variable; reliability and risk assessment appliations)
- robust design and reliability-based design
Statistical approach for computer experiments:
- Design - find set of points in input space so that a model can be constructed; e.g. space filling designs
- Modeling - fitting highly adaptive models using various techniques; these are more complex, straightforward interpretations not available, use of sophisticated ANOVA-like global sensitivity analysis needed to interpret metamodel
General discussion of computer models and their use in industry (internal combustion engine application)
- Robust design and prbabilistic-based design optimization approaches have been proposed:
- Wu Wang 1998
- Du Chen 2004
- Kalagnanam Diwekar 1997
- Du Sudjianto Chen 2004
- Hoffman Sudjianto Du Stout 2003
- Yang et al 2001
- Simpson Booker Ghosh Giunta Koch Yang 2000
- Du et al 2004
- Factorial design widely used in industrial designs
- Montgomery 2001
- Wu Hamada 2000
Space-filling designs
- trying to minimize deviation between expensive/full model and metamodel
- stochastic approaches - e.g. latin hypercube sampling (LHS)
- deterministic approaches - e.g. uniform design
Chapter 2/3: LHS and UD designs
Koehler Owen 1996: different way to classify approaches to computer experiments
- "There are two main statistical approaches to computer experiments, one based on Bayesian statistics and a frequentist one based on sampling techniques."
- LHS, UD = frequentist experimental designs
- optimal LHS designs = Bayesian designs
Modeling Techniques
Metamodels: can be represented using linear combination of set of specific basis functions
Univariate Functions
Polynomial models:
- popular for computer experiments
- 2nd order polynomial models most popular
- $ g(\mathbf{x}) = \beta_0 + \sum_{i=1}^{s} \beta_i x_i + \sum_{i=1}^{s} \sum_{j=1}^{s} \beta_{ij} x_i x_j $
- "response surfaces" refers to 2nd order polynomial models
- Myers Montgomery 1995
- Morris Mitchell 1995
- problems:
- unstable computations... bypassed by centering variables, e.g. replace $ x_i $ with $ x_i - \overline{x_i} $
- collinearilty problem with high-order polynomials... bypassed by using orthogonal polynomial models
- splines - variation of polynomial models ddesigned to work in high collinearity/high order case, better than polynomials alone
Fourier basis models:
- True model is approximated using Fourier regression, set of periodic funcitons
- Number of terms increases exponentially with dimension
- In practice, one particular Fourier metamodel used
- Riccomango, Schwabe, Wynn 1997
- Wavelets:
- used to improve Fourier basis
- work esp. well when function being approximated is not smooth
- Chui 1992
- Daubechies 1992
- Antoniadis Oppenheim 1995
polynomials, splines, fourier bases, and wavelets are powerful for univariate functions, but lose effectiveness and applicability for multivariate functions
Multivariate Functions
Kriging model
- assumes that $ y(\mathbf{x}) = \mu + z(\mathbf{x})</math ** <math>\mu $ = overall mean of $ y(\mathbf{x}) $
- $ z(\mathbf{x}) $ = Gaussian process with mean 0 and covariance function $ Cov(z(\mathbf{x}_i),z(\mathbf{x}_j) = \sigma^2 R(\mathbf{x}_i,\mathbf{x}_j) $
- $ \sigma^2 $ = unknown variance of $ z(\mathbf{x}) $
- R = correlation function with pre-specified functional form, some unknown parameters
- Typical correlation function: $ r(\mathbf{x}_1, \mathbf{x}_2) = exp \left[ - \sum_{k=1}^{s} \Theta_i (x_{k1} -x_{k2} )^2 \right] $
- $ \Theta_i $ are unknowns
- Resulting metamodel can be written $ g(\mathbf{x}) = \sum_{i=1}^{n} \beta_i r(\mathbf{x},\mathbf{x}_i) $
- which is of the general form of the linear combination of basis functions
- advantage of Krigging approach: constructs the basis directly using the correlation function
- under certain conditions, it can be shown that resulting metamodel from Kriging approach is the best linear unbiased predictor (see Ch. 5.4.1)
- Gaussian Kriging approach admits a Bayesian interpretation
Bayesian interpolation
- Proposed by:
- Currin Mitchell Morris Ylvisaker 1991
- Morris Mitchell Ylvisaker 1993
- advantage: can easily incorporate auxiliary information
- e.g.: Bayesian Kriging method
- Morris et al 1993
Neural networks (multilayer perceptron network)
- mathematical definition...
- nonlinear optimization for a least squares objective function
- training algorithms
- etc.
Radial basis function methods
- used for neural network modeling, closely relate to Kriging approach
- generally, for design inputs $ \mathbf{x}_1, \dots, \mathbf{x}_n $ and associated outputs $ y_1, \dots, y_n $:
- K \left( \left\Vert \mathbf{x} - \mathbf{x}_i \right\Vert / \theta \right), i=1,\dots,n_i</math>
- $ K(\cdot) $ = kernel function
- $ \theta $ = smoothing parameter
- Resulting metamodel:
- $ g(\mathbf{x}) = \sum_{i=1}^{n} \beta_i K \left( \left\Vert \mathbf{x} - \mathbf{x}_i \right\Vert / \theta \right) $
- if kernel function is taken to be the Gaussian kernel function (density function of normal distribution, the resulting metamodel has the same form as the Kriging metamodel
Local polynomial models
- Fan 1992
- Fan Gijbels 1996
- Concept: data point closer to $ \mathbf{x} $ carries more information about the value of $ f(\mathbf{x}) $ than one that is further away
- regression function estimator: running local average
- improved version of local average: locally weighted average, e.g.
- $ g(\mathbf{x}) = \sum_{i=1}^{n} w_i ( \mathbf{x} ) y_i $
Chapter 5 - more explanations of these various modeling approaches, more modeling techniques, etc.
Book Map
Part II: design of computer experiments
Chapter 2:
- Latin hypercube sampling
- its modifications
- randomized orthogonal array
- symmetric Latin hypercube sampling
- optimal Latin hypercube designs
Chapter 3:
- uniform design
- measures of uniformity
- modified L2-discrepancies
- algebraic approaches for constructing several classes of uniform design
Chapter 4
- stochastic optimization techniques for constructing optimal space-filling designs
- heuristic optimization algorithms
- high-quality space filling designs under variuos optimality criteria
- popular algorithms
Chapter 5:
- introduction to various modeling techniques
- fundamental concepts
- logical progression from simple to increasingly complex models
- polynomial models
- splines
- kriging
- bayesian approaches
- neural networks
- local polynomials
- unified view of all models is provided
- Kriging is a central concept to this chapter
Chapter 6
- special techniques for model interpretatino
- generalizations of the traditional ANOVA (analysis of variance) for linear models
- highly recommended that readers study this chapter, especially those interested in understanding sensitivity of input variables to the output
- beginning: traditional sum-of-squares decomposition (linear models)
- sequential sum of squares decomposition for general models
- Sobol functional decomposition (generalization of ANOVA decomposition)
- analytic functional decomposition (tensor product metamodel structures)
- computational technique: FAST (fourier amplitude sensitivity test)
Chapter 7
- computer experiments with functional responses
- response is in the form of a curve, where response data are collected over a range of time interval, space interval, or operation interval (e.g. experiment measuring engine noise at range of speeds)
- analysis of functional response in context of design of experiments is new area
Chapter 2: Latin Hypercube Sampling and its Modifications
Chapter 3: Uniform Experimental Data
Chapter 4: Optimization in Construction of Designs for Computer Experiments
Chapter 5: Metamodeling
Review of Modeling Concepts
Mean square error and prediction error
Mean square error (MSE):
$ MSE(g) = \int_{T} \left[ f(\mathbf{x}) - g(\mathbf{x}) \right]^2 d\mathbf{x} $
No random error in computer experiments, so MSE = PE (prediction error)
Weighted mean square error (WMSE):
$ WMSE(g) = \int_{T} \left[ f(\mathbf{x} - g(\mathbf{x}) \right]^2 w(\mathbf{x}) d\mathbf{x} $
$ w(\mathbf{x}) \geq 0 $ is a weighted function, with $ \int_{T} w(\mathbf{x}) d\mathbf{x} = 1 $
The weighting function allows you to incorporate prior information about the distribution of $ \mathbf{x} $ over the domain
When no information about distribution of x over domain is available, it is assumed to be uniformly distributed
In cases where computer experiments are expensive (order of hours), impractical to evaluate prediction error directly
General strategy: estimate prediction error of metamodel g by using cross-validation procedure
for $ i=1,\dots,n $ let $ g_{-i} $ denote the metamodel based on the sample excluding $ (\mathbf{x}_i, y_i) $.
Cross Validation for Prediction Error of Expensive Models
Cross validation score:
$ CV_{n} = \frac{1}{2} \sum_{i=1}^{n} \left[ f(x_i) - g_{-i}(\mathbf{x}_i) \right]^2 $
This gives a good estimate for the prediction error of g
Procedure also called "leave-one-out cross validation"
If sample size n is large, and process of building nonlinear metamodel is time-consuming (Kriging and/or neural network models), using $ CV_1 $ scores becomes computationally too demanding (b/c need to build n metamodels)
To reduce computational burden even further, can modify procedure
For pre-specified K, divide sample into K groups (equal sample size)
Let $ g_{(-k)} $ be metamodel built on sampel excluding observations in the kth group
Let $ \mathbf{y}_{(k)},\mathbf{g}_{(k)} $ be vector consisting of observed values and predicted values for the kth group using the $ g_{(-k)} $ respectively
This yields a K-fold cross validation score:
$ CV_{k} = \frac{1}{n} \sum_{k=1}^{K} \left( \mathbf{y}_{(k)} - \mathbf{g}_{(k)} \right)^{\prime} \left( \mathbf{y}_{(k)} - \mathbf{g}_{(k)} \right) $
Regularization Parameter
For most modeling procedures: metamodel g depends on a regularization parameter, say $ \lambda $
Cross validation score depends on regularization parameter, denoted by $ CV(\lambda) $ (either $ CV_1 $ or $ CV_K $)
Goal is to minimize cross validation score with respect to $ \lambda $
$ \hat{\lambda} = min CV(\lambda) $
Minimization done by searching over grid values of $ \lambda $
Theoretical properties of cross validation:
- Li 1987
General Metamodel Form
Metamodels can be generally expressed in the form:
$ g(\mathbf{x}) = \sum_{j=0}^{L} B_j (\mathbf{x}) \beta_j $
where $ B_j $ are a set of basis functions defined over experimental domain (hypercube)
This may be a polynomial basis function, covariance function (Kriging), radial basis function (neural networks), etc.
Outputs of computer experiments are deterministic: therefore metamodel construction is interpolation problem
Matrix notation:
$ \begin{array}{rcl} \mathbf{y} &=& (y_1,\dots,y_n)^{\prime} \\ \boldsymbol{\beta} &=& (\beta_0,\dots,\beta_L)^{\prime} \\ \mathbf{b(x)} &=& ( B_0(\mathbf{x}),\dots,B_L(\mathbf{x}))^{\prime} \end{array} $
and
$ \mathbf{B} = \left( \begin{array}{ccc} B_0 (\mathbf{x}_1) & \dots & B_L (\mathbf{x}_1) \\ B_0 (\mathbf{x}_2) & \dots & B_L (\mathbf{x}_2) \\ \dots & \dots & \dots \\ B_0 (\mathbf{x}_n) & \dots & B_L (\mathbf{x}_n) \end{array} \right) $