Overview of Experimental Design and Surrogate Models

The Problem Statement

Purpose: create a cheap representation of an expensive computer model

We're picking some input parameters, and some output variables

Normally there is a map from one to the other: the real function $f$ ,

${\boldsymbol {y}}=f({\boldsymbol {x}})$

And we're creating a surrogate model $g$ ,

${\boldsymbol {y}}=g({\boldsymbol {x}})$

This is sometimes called a "metamodel", because it's a model of a model

Classes of Surrogate Models

There are several classes or forms for $g$

Space-filling designs
- Latin hypercube
Uniform sampling
Neural networks
Gaussian
Polynomials (response surface methodology)

I won't cover all, I will only cover latin hypercube, space-filling, and response surface methodologies

Surrogate Modeling

When constructing surrogate models, important to distinguish between computer surrogate modeling (metamodeling) and experimental surrogate modeling

Big difference: experiments have random errors

Basic Concepts for Experiments

Analysis of Variance tables

Basic Concepts for Metamodeling

Metamodeling: regression on data without random errors

Trying to predict true value $f({\boldsymbol {x}})$ using surrogate model $g({\boldsymbol {x}})$

Mean square error:

$MSE(g)=\int _{R}\left(f({\boldsymbol {x}})-g({\boldsymbol {x}})\right)^{2}d{\boldsymbol {x}}$

where R is the region in parameter space where the metamodel applies

Example

Example function:

Real function f:

function real = real_function()                                      
                                                                     
% Define the domain of the real function                             
x=0:(pi/32):2*pi;                                                    
                                                                     
real = 2*x.*cos(4*pi*x);

Surrogate function f:

function surrogate = surrogate_function()

% Define the region in which the function is valid
x = 0:(pi/32):2*pi;

surrogate = 0.9931 + 1.96*(x-0.5) - 76.8838*(x-0.5).^2 - 152.0006*(x-0.5).^3 ...
        + 943.8565*(x-0.5).^4 + 1857.1427*(x-0.5).^5 - 3983.9332*(x-0.5).^6 ...
        - 7780.7937*(x-0.5).^7 + 5756.3561*(x-0.5).^8 + 11147.1698*(x-0.5).^9;

Comparing the two functions:

And comparing their error:

Mean square error:

r=real_function;
s=surrogate_function;
MSE = sum( (r-s).^2 );

MSE = 1.6354

Latin Hypercube

Latin Hypercube is a way of sampling a space randomly, but in such a way that each dimension of the space is sampled.

For example, in the following figure, one sample falls into each bin of each of the x and y dimensions:

For a domain divided into $n$ bins, each bin has an equal marginal probability of $1/n$

Algorithm

Purpose: create an experimental design with $n$ runs (number of samples to be taken), and $s$ input variables

The result should be a Latin hypercube design that is an $n\times s$ matrix denoting the variable combinations at which to sample

Step 1: take $s$ independent permutations of $n$ integers $\pi _{j}(1)\dots \pi _{j}(n)$

(note that $j$ indexes the dimension of the Latin hypercube, $j=1\dots s$ , and $n$ is the number of runs or experiments)

Step 2: Take $ns$ random numbers $U_{k}^{j}$ and compute the locations of the Latin hypercube samples as:

$x_{k}^{j}={\frac {\pi _{j}(k)-U_{k}^{j}}{n}}$

where $k=1\dots n$ and $j=1\dots s$

Variation

One variation is centered Latin hypercube sampling

Each sample location is given by:

$x_{k}^{i}={\frac {\pi ^{j}(k)-0.5}{n}}$

where $k=1\dots n$ indexes which experiment (or run)

(this technique does not require random numbers)

LHS in Matlab

If you have the statistical toolbox (which CHPC @ University of Utah does), Matlab has an LHS function available to you: lhsdesign

Documentation is available here: http://www.mathworks.com/help/toolbox/stats/lhsdesign.html

Discussion Question

Why might the LHS sample (same as given above):

be better than the one given below?

Monte Carlo Sampling

Main article: Monte Carlo Experimental Design


Roulette illustrates the random nature of the function samples used in Monte Carlo methods. A good random number generator (for example, a roulette table) is an essential part of any Monte Carlo method.

Monte Carlo sampling is essentially a brute-force technique in which random samples are taken until confidence that the entire space has been sampled is satisfactory.

Random numbers are used to create sampling points in each direction.

Think of Monte Carlo ray-tracing: you send out a whole bunch of rays, each in random directions, and from the result you determine the radiative flux. Mathematically, you're performing an integration by randomly sampling the function you want to integrate, then adding up all of the random samples:

$\int f(x)dx\approx {\frac {1}{N}}\sum _{i}f(x_{i})$

Discussion Question

Why use Monte Carlo sampling vs. Latin Hypercube sampling?

Response Surfaces

See below for details of the response surface methodology.

Response Surface Methodology

The response surface methodology for surrogate models (and experimental design) uses polynomials to approximate the response surface.

For a function $f({\boldsymbol {x}})$ , the response surface is the parameter-space plot of the output function $f$ as a function of the input parameters ${\boldsymbol {x}}$

Last week, Alex showed us a preliminary response surface plot of the output, product yield $y_{P}$ , as a function of two input paramters, $L_{mix}$ and $k$ :

The response surface methodology prescribes ways to sample the output function $f({\boldsymbol {x}})$ in order to produce the best polynomial representation $g({\boldsymbol {x}})$ possible

Terminology

A factor is an input parameter $x$

A level is a discrete value that the parameter $x$ can take on

For example: if I want to investigate the effect of $L_{mix}$ on $y_{P}$ , I have to decide the discrete values of $L_{mix}$ that I want to try

If I choose, for example, 1 and 10, then the factor $L_{mix}$ has two levels ${1,10}$

Factorial Design

Factorial designs are experimental designs for linear functions of input parameters. That is, they assume a linear model (given a $D$ -dimensional input parameter vector) of the form:

$y({\boldsymbol {x}})=\alpha _{0}+\sum _{i=1}^{D}\alpha _{i}x_{i}+\sum _{j=1}^{D}\sum _{k=1}^{j}\alpha _{jk}x_{j}x_{k}+\dots$

One Factor At A Time (OFAAT)

OFAAT testing isolates the interaction effects of each variable, and assumes each variable is independent

Given: a set of $k$ factors $i=1\dots k$

The algorithm is as follows:

1. Fix the levels of all but one factor $i=2,3,\dots k$ , and determine the optimal level for factor $i=1$

2. Use the optimal level for factor $i=1$ . Fix levels for all but one factor $i=3,4,\dots k$ and determine the optimal level for factor $i=2$

3. Use the optimal level for factor $i=1,2$ . Fix levels for all but one factor $i=4,5,\dots k$ , etc.

etc.

Full Factorial Design

The following discussion assumes use of two factor levels. Using more than 2 factor levels can potentially make factorial experimental design very complex.

The full factorial algorithm is as follows:

1. Assign the value $+1$ or $-1$ to the upper and lower factor levels, respectively.

2. Create a table of factor-level combinations. For example:

Factor-level combination enumeration	Factor 1	Factor 2
1	+1	+1
2	+1	-1
3	-1	+1
4	-1	-1

3. Generate a sequence of $N$ random numbers, where $N$ is the number of factor-level combinations.

4. Run each sequence in the determined order

Fractional Factorial Design

Fractional factorial designs reduce the number of experimental runs required to construct the polynomial, but they throw out some information about interaction effects. Typically, information about the highest-order effects is thrown out.

The essential technique is to determine which experiments you do not want to run, and exclude those from Step 4 in the factorial design algorithm given above. And the essential technique for determining which experiments you do not want to run is to determine which combination of experiments will give information about higher-order interaction effects.

To do this, an expanded version of the above table is required: one that includes factor interactions. To determine, for example, the factor interaction between factors 1 and 2, $I_{12}$ , simply multiply the factor 1 and factor 2 values (e.g. $+1\times +1$ or $-1\times +1$ , etc.).

Factor 1 = $I_{1}$

Factor 2 = $I_{2}$

Factor 3 = $I_{3}$

Factor 1-Factor 2 = $I_{12}$

Factor 2-Factor 3 = $I_{23}$

Factor 1-Factor 3 = $I_{13}$

Factor 1-Factor 2-Factor 3 = $I_{123}$

Factor-level combination enumeration	$I_{1}$	$I_{2}$	$I_{3}$	$I_{12}$	$I_{23}$	$I_{13}$	$I_{123}$
1	+1	+1	+1	+1	+1	+1	+1
2	+1	+1	-1	+1	-1	-1	-1
3	+1	-1	+1	-1	-1	+1	-1
4	+1	-1	-1	-1	+1	-1	+1
5	-1	+1	+1	-1	+1	-1	-1
6	-1	+1	-1	-1	-1	+1	+1
7	-1	-1	+1	+1	-1	-1	+1
8	-1	-1	-1	+1	+1	+1	-1

Composite Design

Composite design builds on the full factorial design by adding three additional levels. Rather than just $-1$ and $+1$ , composite design also includes $-2,0,+2$ .

Thinking about a three-dimensional factorial design, the factorial design sample points looks like a cube in parameter space. Composite design makes a "star" design.

The additional sample points consist of the new lower $-2$ and upper $+2$ levels for each factor, while keeping all other factors constant at the $0$ level. It also includes a center point, at which all factors have a factor level of $0$ .

For a two-parameter experimental design, this looks like:

Experimental Design Step	Factor 1	Factor 2
Full factorial	+1	+1
Full factorial	+1	-1
Full factorial	-1	+1
Full factorial	-1	-1
Composite	0	0
Composite	+2	0
Composite	-2	0
Composite	0	+2
Composite	0	-2

Other Alternatives

Box-Behnkin

Box-Behnkin designs are intended to minimize the number of function evaluations required to construct a response surface (polynomial surrogate model). The design consists of three levels, rather than the five required by composite designs.

For more information see Box Behnken 1960 <ref name="BoxBehnken1960">Box and Behnken (1960). "Three Level Designs for the Study of Quantitative Variables". Technometrics 2 (4): 455-475. http://files.charlesmartinreid.com/BoxBehnken1960.pdf. </ref>

Plackett Burman

Experimental designs requiring a very minimal number of sample points.<ref name="PlackettBurman"> Plackett and Burman (1946). "Design of optimum multifactorial experiments". Biometrika 33: 305-325. </ref>

Sequential Assembly

References

Links to Papers

Box Behnken 1960: http://files.charlesmartinreid.com/BoxBehnken1960.pdf

Plackett Burman 1946: http://files.charlesmatinreid.com/PlackettBurman1946.pdf

Experimental Design Lecture

From charlesreid1

Contents

Overview of Experimental Design and Surrogate Models

The Problem Statement

Classes of Surrogate Models

Surrogate Modeling

Basic Concepts for Experiments

Basic Concepts for Metamodeling

Example

Latin Hypercube

Algorithm

Variation

LHS in Matlab

Discussion Question

Monte Carlo Sampling

Discussion Question

Response Surfaces

Response Surface Methodology

Terminology

Factorial Design

One Factor At A Time (OFAAT)

Full Factorial Design

Fractional Factorial Design

Composite Design

Other Alternatives

Box-Behnkin

Plackett Burman

Sequential Assembly

References

Links to Papers