### Update part 2 of lecture notes, 2A,, 2B, 2C

Charles Reid 2 months ago
parent
commit
4432f5e937
3 changed files with 976 additions and 435 deletions
1. 4 435
notebooks/ED_2A_RSM.ipynb
2. 457 0
notebooks/ED_2B_Goodness.ipynb
3. 515 0
notebooks/ED_2C_LeastSquaresResponseSurface.ipynb

#### + 4 - 435 notebooks/ED_2A_RSM.ipynb View File

 @@ -4,7 +4,9 @@ 4 4  "cell_type": "markdown", 5 5  "metadata": {}, 6 6  "source": [ 7 - "# Experiment Design Lecture 2A: Response Surface Methodology" 7 + "# Experiment Design Lecture 2A: Response Surface Methodology\n", 8 + "\n", 9 + "Link: [http://charlesreid1.com/wiki/Response_Surface_Methodology](http://charlesreid1.com/wiki/Response_Surface_Methodology)" 8 10  ] 9 11  }, 10 12  { @@ -435,440 +437,7 @@ 435 437  { 436 438  "cell_type": "markdown", 437 439  "metadata": {}, 438 - "source": [ 439 - "## Comparison Metrics\n", 440 - "\n", 441 - "### Analysis of Variance (ANOVA Table)\n", 442 - "\n", 443 - "ANOVA is a statistical technique that builds a table attributing the variance in the system response to various degrees of freedom. Each piece of data about the real system contributes one degree of freedom, which can go toward estimating a new coefficient, or can go toward estimating the variance in the model.\n", 444 - "\n", 445 - "Bibliography note: Mason ch. 6, 8 - derived in 6.1\n", 446 - "\n", 447 - "For ANOVA of linear regression models, need to define a few quantities.\n", 448 - "\n", 449 - "Total sum of squares: \n", 450 - "\n", 451 - "$$\n", 452 - "TSS = \\sum \\left( y_i - \\overline{y} \\right)^2\n", 453 - "$$\n", 454 - "\n", 455 - "Error sum of squares:\n", 456 - "\n", 457 - "$$\n", 458 - "SSE = \\sum \\left( y_i - \\hat{y}_i \\right)^2\n", 459 - "$$\n", 460 - "\n", 461 - "Model sum of squares (regression sum of squares):\n", 462 - "\n", 463 - "$$\n", 464 - "SSR = \\sum \\left( \\hat{y}_i - \\overline{y} \\right)^2\n", 465 - "$$\n", 466 - "\n", 467 - "### Univariate Linear Model\n", 468 - "\n", 469 - "(Sample ANOVA table constructed for a linear univariate function)\n", 470 - "\n", 471 - "### Multiple Linear Model\n", 472 - "\n", 473 - "Important not just to assess overall fit of prediction equation\n", 474 - "\n", 475 - "Also important to assess contribution of individual predictor variables to the fit\n", 476 - "\n", 477 - "Many commonly-reported measures of model fitness are part of ANOVA table\n", 478 - "\n", 479 - "Multivariate models: p degrees of freedom for sum of squares due to regression (because p coefficients {\\beta_1, \\beta_2, \\dots \\beta_p} must be estimated to obtain regression sum of squares\n", 480 - "\n", 481 - "(Sample ANOVA table constructed for multiple linear function)\n", 482 - "\n", 483 - "\n", 484 - "\n", 485 - "$p$ is number of predictor variables. To measure the adequacy of fitted model, determine the error standard deviation:\n", 486 - "\n", 487 - "$$\n", 488 - "s_e = (MSE)^{1/2}\n", 489 - "$$\n", 490 - "\n", 491 - "where $MSE = SSE/(n-p-1)$.\n", 492 - "\n", 493 - "Small $s_e$ means predicted responses closely approximate observed responses.\n", 494 - "\n", 495 - "Large $s_e$ means large random error, or poor selection of model form." 496 - ] 497 - }, 498 - { 499 - "cell_type": "markdown", 500 - "metadata": {}, 501 - "source": [ 502 - "### F-Statistic\n", 503 - "\n", 504 - "Main article: F-Statistic\n", 505 - "\n", 506 - "I found this short YouTube video very helpful for illustrating what the F-statistic means physically: [link](http://www.youtube.com/watch?v=TMwSS8DAVYk)\n", 507 - "\n", 508 - "The F-statistic can be thought of as a frequentist metric for hypothesis-testing. Once an F-statistic and corresponding p-value is calculated from the ANOVA table quantities, you can determine how confident you can be in a given hypothesis test (where the hypothesis is the model onto which you've chosen to regress your data).\n", 509 - "\n", 510 - "Mason: Different from tests for significance of factor effects in analysis of designed experiments\n", 511 - "\n", 512 - "Mason: example of acid-content data...\n", 513 - "\n", 514 - "For multiple linear regression/model, can use F-statistic to simultaneously test a hypothesis: whether all $\\beta_j = 0$ (the hypothesis), or whether at least one $\\beta_j \\neq 0$ (the null hypothesis, i.e., the outcome that our hypothesis is not correct).\n", 515 - "\n", 516 - "$\\beta_1 = \\beta_2 = \\dots = \\beta_p = 0$ versus $\\beta_j \\neq 0$\n", 517 - "\n", 518 - "(While this seems silly, it's much more useful if you're doing this for a subset of coefficients - such as testing the hypothesis of whether any of a subset of coefficients should be non-zero.)\n", 519 - "\n", 520 - "### Lack of Fit F-Statistic\n", 521 - "\n", 522 - "For (deterministic) computer simulations (rather than experiments, which have random error), error is entirely due to lack of fit - there is no source of random error from real instruments or real experimental noise.\n", 523 - "\n", 524 - "In this case, an F-statistic specifically for lack-of-fit is not possible to calculate (defined as $MSE_{LOF}/MSE_{P}$, and $MSE_P = 0$for deterministic functions)\n", 525 - "\n", 526 - "### Partial F-Test (Determination of Term Significance)\n", 527 - "\n", 528 - "Consider a full regresion model with $p$ predictor variables.\n", 529 - "\n", 530 - "Now consider a reduced regression model with $k < p$ predictor variables.\n", 531 - "\n", 532 - "Full model = $M_1$\n", 533 - "\n", 534 - "Reduced model = $M_2$\n", 535 - "\n", 536 - "Reduction in error sum of squares resulting from fit of additional terms in full model:\n", 537 - "\n", 538 - "$$\n", 539 - "R(M_1 \\vert M_2) = SSE_2 - SSE_1\n", 540 - "$$\n", 541 - "\n", 542 - "$(p - k)$ more predictor variables\n", 543 - "\n", 544 - "F-statistic for determining statistical significance of this subset is:\n", 545 - "\n", 546 - "$$\n", 547 - "F = \\frac{ MSR(M_1 \\vert M_2) }{ MSE_1 }\n", 548 - "$$\n", 549 - "\n", 550 - "where $MSR(M_1 \\vert M_2) = \\frac{ R(M_1 \\vert M_2) }{ (p-k) }$\n", 551 - "\n", 552 - "If $k = p-1$ then the F-statistic is the square of the t-statistic from the full model corresponding to the term left out of the reduced model\n", 553 - "\n", 554 - "To determine if F-statistic is highly significant, use p value (95% likelihood of being significant if $p < 0.05$, 99% likelihood if $p < 0.01$)\n", 555 - "\n", 556 - "Example: use to determine if interaction effect is important to surrogate model.\n", 557 - "\n", 558 - "Using this procedure and using the t-statistic to test the significance of a given term are equivalent!" 559 - ] 560 - }, 561 - { 562 - "cell_type": "markdown", 563 - "metadata": {}, 564 - "source": [ 565 - "### T-Statistic\n", 566 - "\n", 567 - "**TODO:** \n", 568 - "\n", 569 - "Fix from here\n", 570 - "\n", 571 - "**Linear Univariate Model:**\n", 572 - "\n", 573 - "A t-statistic can be constructed to test $\\beta_1 = c$ vs. $\\beta_1 \\neq c$\n", 574 - "\n", 575 - "The following statistic has a Student t-distribution with $n-2$ degrees of freedom:\n", 576 - "\n", 577 - "$$\n", 578 - "\\begin{array}{rcl} t &=& \\frac{ b_1 - \\beta_1 }{ s_e / s_xx^{1/2} } \\\\ s_{xx} &=& \\sum \\left( x_i - \\overline{x} \\right)^2 \\end{array}\n", 579 - "$$\n", 580 - "\n", 581 - "and insert $\\beta_1 = c$.\n", 582 - "\n", 583 - "If you insert $c = 0$ and square the result you get the F-statistic from the ANOVA table.\n", 584 - "\n", 585 - "This t-variate can be used to form confidence intervals on the slope parameter.\n", 586 - "\n", 587 - "Following Chapter 2.4: limits for $\\beta_1$ are\n", 588 - "\n", 589 - "$$\n", 590 - "b_1 - t_{\\alpha / 2} \\frac{s_e}{s_{xx}^{1/2}} \\leq \\beta_1 \\leq b_1 + t_{\\alpha / 2} \\frac{s_e}{s_{xx}^{1/2}}\n", 591 - "$$\n", 592 - "\n", 593 - "where $t_{\\alpha / 2}$ is a $100(\\alpha / 2)%$ upper-tail t critical value with n-2 degrees of freedom\n", 594 - "\n", 595 - "Small model standard deviations will lead to small confidence intervals\n", 596 - "\n", 597 - "For the intercept parameter \\beta_0, use the following t-variate:\n", 598 - "\n", 599 - "$$\n", 600 - "t = \\frac{ b_0 - \\beta_0 }{ s_e \\left( n^{-1} + \\overline{x}^2 / s_{xx} \\right)^2 }\n", 601 - "$$\n", 602 - "\n", 603 - "**Multiple Linear Model:**\n", 604 - "\n", 605 - "Testing hypotheses on individual regression coefficients is of primary interest to someone performing regression analysis\n", 606 - "\n", 607 - "The t-statistic can be constructed to test $\\beta_j = c$ versus $\\beta_j \\neq c$.\n", 608 - "\n", 609 - "Test statistic used for this purpose is:\n", 610 - "\n", 611 - "$$\n", 612 - "t = \\frac{ b_j - c }{ s_e c_{jj}^{1/2} }\n", 613 - "$$\n", 614 - "\n", 615 - "where $s_e = (MSE)^{1/2}$ is the estimated error standard deviation.\n", 616 - "\n", 617 - "$$\n", 618 - "c_{jj} = \\left[ (n - 1) s_j^2 (1 - R_j^2) \\right]^{-1}\n", 619 - "$$\n", 620 - "\n", 621 - "$s_j^2$ is the sample variance of the n values of the jth predictor variable\n", 622 - "\n", 623 - "$R_j^2$ is the coefficient of determination for regression of $x_j$ on the constant term and the other $p-1$ predictor variables.\n", 624 - "\n", 625 - "Using t-statistics with $c = 0$ can be used to test statistical significance of of individual model parameters (usefulness of $x_j$ as predictor of response variable)\n", 626 - "\n", 627 - "NOTE: this test is only conditional, since $\\beta_j$ is partial regression coefficient, and $b_j$, $c_{jj}$ are functions of other predictor variable values\n", 628 - "\n", 629 - "Only determines significance of jth predictor variable conditional on the presence of the other predictor variables\n", 630 - "\n", 631 - "e.g. \"Each individual predictor variable contributes significantly to the given fits, given that the other two predictor variables are also included in the model\" " 632 - ] 633 - }, 634 - { 635 - "cell_type": "markdown", 636 - "metadata": {}, 637 - "source": [ 638 - "### Response Confidence Intervals\n", 639 - "\n", 640 - "Want confidence intervals for the response model $\\hat{y}$.\n", 641 - "\n", 642 - "**Linear Univariate Models:**\n", 643 - "\n", 644 - "Confidence interval constructed for the response model:\n", 645 - "\n", 646 - "$$\n", 647 - "\\mu = \\beta_0 + \\beta_1 x\n", 648 - "$$\n", 649 - "\n", 650 - "Mason: for fixed values of x...???\n", 651 - "\n", 652 - "The predicted response $\\hat{y}$ has a normal distribution (given certain assumptions).\n", 653 - "\n", 654 - "Thus mean and deviation given by:\n", 655 - "\n", 656 - "$$\n", 657 - "\\mu = \\beta_0 + \\beta_1 x\n", 658 - "$$\n", 659 - "\n", 660 - "$$\n", 661 - "\\sigma \\left[ a_1 + \\left( x - a_2 \\right)^2 / s_{xx} \\right]^{1/2}\n", 662 - "$$\n", 663 - "\n", 664 - "where:\n", 665 - "\n", 666 - "$$\n", 667 - "a_1 = \\frac{1}{n}\n", 668 - "$$\n", 669 - "\n", 670 - "$$\n", 671 - "a_2 = \\overline{x}\n", 672 - "$$\n", 673 - "\n", 674 - "$$\n", 675 - "s_{xx} = \\sum \\left( x_i - \\overline{x} \\right)^2\n", 676 - "$$\n", 677 - "\n", 678 - "And the following t-variate can be used to construct confidence intervals for $\\mu$:\n", 679 - "\n", 680 - "$$\n", 681 - "t = \\frac{ \\hat{y} - \\mu }{ s_e \\left( a_1 + \\left( x - a_2 \\right) / s_{xx} \\right)^2 }\n", 682 - "$$\n", 683 - "\n", 684 - "To form prediction interval for actual future response, not expected value of a response.\n", 685 - "\n", 686 - "Use this equation again, but replace $\\mu$ with $y_f$, and $a_1$ with $a_1 + 1$.\n", 687 - "\n", 688 - "$y_f$ is the future response\n", 689 - "\n", 690 - "$\\hat{y}$ is the predicted value of future response\n", 691 - "\n", 692 - "$a \\rightarrow a+1$ is because future response has standard deviation $\\sigma$ with an added variability\n", 693 - "\n", 694 - "Standard deviation of $\\hat{y} - y_f$ is:\n", 695 - "\n", 696 - "$$\n", 697 - "\\sigma = \\left( 1 + a_1 + \\left( x - a_2 \\right)^2 / s_{xx} \\right)^{1/2}\n", 698 - "$$\n", 699 - "\n", 700 - "**Multiple Linear Model:**\n", 701 - "\n", 702 - "Confidence interval for regression coefficients of multiple linear model:\n", 703 - "\n", 704 - "A $100(1-\\alpha)%$ confidence interval for $\\beta_j$ is given by:\n", 705 - "\n", 706 - "$$\n", 707 - "b_j - t_{\\alpha/2} s_e c_{jj}^{1/2} \\leq \\beta_j \\leq b_j + t_{\\alpha/2} s_e c_{jj}^{1/2}\n", 708 - "$$\n", 709 - "\n", 710 - "where $t_{\\alpha/2}$ is two-tailed $100 \\alpha %$ t critical value having $n - p - 1$ degrees of freedom.\n", 711 - "\n", 712 - "Simultaneous confidence intervals for all coefficients in multiple linear regression model cannot be computed using individual coefficient intervals.\n", 713 - "\n", 714 - "They ignore systematic variation of predictor variables and consequent correlation among coefficient estimators." 715 - ] 716 - }, 717 - { 718 - "cell_type": "markdown", 719 - "metadata": {}, 720 - "source": [ 721 - "### Correlation Coefficient (R Squared)\n", 722 - "\n", 723 - "Measure of correlation between observed and predicted responses\n", 724 - "\n", 725 - "Univariate Linear Model\n", 726 - "\n", 727 - "$$\n", 728 - "R^2 = \\left[ corr(y,\\hat{y}) \\right]^2 = 1 - \\frac{SSE}{TSS}\n", 729 - "$$\n", 730 - "\n", 731 - "**Multiple Linear Model:**\n", 732 - "\n", 733 - "R-squared can be calculated as:\n", 734 - "\n", 735 - "$$\n", 736 - "R^2 = 1 - \\frac{SSE}{TSS}\n", 737 - "$$\n", 738 - "\n", 739 - "It should be acknowledged that as the number of predictor variables approaches the number of observations, this can become arbitrarily close to 1\n", 740 - "\n", 741 - "if $n = p+1$ then $R^2 = 1$.\n", 742 - "\n", 743 - "Adjusted $R^2$, denoted $R_a^2$:\n", 744 - "\n", 745 - "$$\n", 746 - "R_a^2 = 1 - a \\frac{SSE}{TSS}\n", 747 - "$$\n", 748 - "\n", 749 - "where\n", 750 - "\n", 751 - "$$\n", 752 - "a = \\frac{ n-1 }{ n - p - 1 }\n", 753 - "$$\n", 754 - "\n", 755 - "Differences between $R^2$ and $R_a^2$ are minor except for when $n$ and $p$ are close.\n", 756 - "\n", 757 - "Caution should be used in relying on single measure of fit (e.g. $R^2$)." 758 - ] 759 - }, 760 - { 761 - "cell_type": "markdown", 762 - "metadata": {}, 763 - "source": [ 764 - "### Contour Plots\n", 765 - "\n", 766 - "Contour plots can be used to determine sensitivities: if the response $y$ changes significantly in one parameter direction, it is sensitive to that parameter. If the contour shows a structure that is uniform in one parameter direction, the response is not sensitive to that parameter.\n", 767 - "\n", 768 - "For multiple responses, a contour plot for each response can be made, infeasible regions shaded gray, and the plots overlaid to yield the feasible region. " 769 - ] 770 - }, 771 - { 772 - "cell_type": "markdown", 773 - "metadata": {}, 774 - "source": [ 775 - "### Determination of Outlier Data Points\n", 776 - "\n", 777 - "Mason Ch. 18\n", 778 - "\n", 779 - "Various test for outliers (p. 629, 631, etc.)\n", 780 - "\n", 781 - "Tests for response variable outliers\n", 782 - "\n", 783 - "Tests for predictor variable outliers " 784 - ] 785 - }, 786 - { 787 - "cell_type": "markdown", 788 - "metadata": {}, 789 - "source": [ 790 - "## Other Things to Look At\n", 791 - "\n", 792 - "* Correlation between input variables - e.g., for time and temperature: $\\{t, T, tT, t^2, T^2\\}$" 793 - ] 794 - }, 795 - { 796 - "cell_type": "markdown", 797 - "metadata": {}, 798 - "source": [ 799 - "# Issues\n", 800 - "\n", 801 - "## Importance of Interaction Terms\n", 802 - "\n", 803 - "We mentioned above that screening and highly fractionated factorial designs tend to ignore interaction effects by aliasing them with main effects.\n", 804 - "\n", 805 - "This important note from Mason indicates that this can be avoided by only including the colinear interaction terms that are most likely to be significant:\n", 806 - "\n", 807 - "
\n", 808 - "Ordinarily, one should not routinely insert products of all the predictors in a regression model. To do so might create unnecessary complications in the analysis and interpretation of the fitted models due to collinear predictor variables. The purpose of including interaction terms in regression models is to improve the fit either because theoretical considerations require a modeling of the joint effects or because an analysis of the regression data indicates that joint effects are needed in addition to the linear terms of the individual variables.\n", 809 - "
" 810 - ] 811 - }, 812 - { 813 - "cell_type": "markdown", 814 - "metadata": {}, 815 - "source": [ 816 - "## Dealing with Multimodal Variables\n", 817 - "\n", 818 - "Sometimes, when constructing response surfaces, modal variables appear. Modal variables are variables that have multiple modes, or distinct sets of values. There are two variations of modal variables:\n", 819 - "\n", 820 - "* One uncertainty range\n", 821 - "* N uncertainty range" 822 - ] 823 - }, 824 - { 825 - "cell_type": "markdown", 826 - "metadata": {}, 827 - "source": [ 828 - "### One Uncertainty Range\n", 829 - "\n", 830 - "These types of modal variables have a single range of uncertainty assigned to them, but the values within that range of uncertainty are discrete. In order to sample the parameter within the range of uncertainty, the parameter must be sampled at distinct, discrete values.\n", 831 - "\n", 832 - "For example, if I am using the discrete ordinates model (DOM) for radiation calculations, the DOM requires a number of ordinate directions. This is a discrete value with distinct sets of values - e.g. 3, 6, 8, 24, etc.\n", 833 - "\n", 834 - "Each discrete value in this case composes a single range of uncertainty. Using the DOM example, that range of uncertainty would be $[3, 24]$." 835 - ] 836 - }, 837 - { 838 - "cell_type": "markdown", 839 - "metadata": {}, 840 - "source": [ 841 - "### N Uncertainty Ranges\n", 842 - "\n", 843 - "The other type of modal variables have several ranges of uncertainty assigned to them, with no restriction on values within that range of uncertainty being discrete or distinct. Essentially this can be thought of as a bimodal uncertainty distribution, where the two modes are distinct. Each mode can be sampled as usual, the only sticking point is that there is more than 1, and that they are distinct.\n", 844 - "\n", 845 - "The case of mixing in a chemically reacting system provides an example of how to think about this. Suppose we are investigating two mass flowrates of material into a reactor where there is mixing occurring - mass flowrates of 1 kg/s and 2 kg/s. \n", 846 - "\n", 847 - "Each mode also has a range of uncertainty. This can be an absolute range (as in, $\\pm 0.1$ kg/s) or a relative range (as in, $\\pm 5 \\%$)." 848 - ] 849 - }, 850 - { 851 - "cell_type": "markdown", 852 - "metadata": {}, 853 - "source": [ 854 - "### How To Deal\n", 855 - "\n", 856 - "Multimodal variables can be dealt with in two ways:\n", 857 - "\n", 858 - "**Method 1: Separate Response Surfaces for Each Mode**\n", 859 - "\n", 860 - "The first way is to create a separate response surface for each distinct mode. This method works for both types of modal variables (1 uncertainty range represented by N distinct values, and N uncertainty ranges). This method is illustrated in the figures below. Each distinct mode (gray region) has its own computed response surface (blue dotted line), distinct from the response surface of the other modes.\n", 861 - "\n", 862 - "Of course, if the variable type is 1 uncertainty range represented by N distinct values, then there is no uncertainty range for each mode, and each gray region is, in fact, a delta function. As mentioned above, this means that the input variable is eliminated as a response surface parameter.\n", 863 - "\n", 864 - "If the variable type is N uncertainty ranges, then each uncertainty range is sampled as usual, and each response surface is constructed as usual. \n", 865 - "\n", 866 - "**Method 2: Single Response Surface (Ignore Modes)**\n", 867 - "\n", 868 - "A second way is to create a single response surface. This is typically only possible with N uncertainty ranges type of problems, because the parameter value is continuous, but it is only certain regions that are of interest. This approach is illustrated below.\n", 869 - "\n", 870 - "Essentially, this approach does away with any special treatment of modes. " 871 - ] 440 + "source": [] 872 441  } 873 442  ], 874 443  "metadata": {

#### + 457 - 0 notebooks/ED_2B_Goodness.ipynb View File

 @@ -0,0 +1,457 @@ 1 +{ 2 + "cells": [ 3 + { 4 + "cell_type": "markdown", 5 + "metadata": {}, 6 + "source": [ 7 + "# Goodness of Fit Metrics\n", 8 + "\n", 9 + "Link: [http://charlesreid1.com/wiki/Response_Surface_Methodology](http://charlesreid1.com/wiki/Response_Surface_Methodology)\n", 10 + "\n", 11 + "## Analysis of Variance (ANOVA Table)\n", 12 + "\n", 13 + "ANOVA is a statistical technique that builds a table attributing the variance in the system response to various degrees of freedom. Each piece of data about the real system contributes one degree of freedom, which can go toward estimating a new coefficient, or can go toward estimating the variance in the model.\n", 14 + "\n", 15 + "Bibliography note: Mason ch. 6, 8 - derived in 6.1\n", 16 + "\n", 17 + "For ANOVA of linear regression models, need to define a few quantities.\n", 18 + "\n", 19 + "Total sum of squares: \n", 20 + "\n", 21 + "$$\n", 22 + "TSS = \\sum \\left( y_i - \\overline{y} \\right)^2\n", 23 + "$$\n", 24 + "\n", 25 + "Error sum of squares:\n", 26 + "\n", 27 + "$$\n", 28 + "SSE = \\sum \\left( y_i - \\hat{y}_i \\right)^2\n", 29 + "$$\n", 30 + "\n", 31 + "Model sum of squares (regression sum of squares):\n", 32 + "\n", 33 + "$$\n", 34 + "SSR = \\sum \\left( \\hat{y}_i - \\overline{y} \\right)^2\n", 35 + "$$\n", 36 + "\n", 37 + "### Univariate Linear Model\n", 38 + "\n", 39 + "(Sample ANOVA table constructed for a linear univariate function)\n", 40 + "\n", 41 + "### Multiple Linear Model\n", 42 + "\n", 43 + "Important not just to assess overall fit of prediction equation\n", 44 + "\n", 45 + "Also important to assess contribution of individual predictor variables to the fit\n", 46 + "\n", 47 + "Many commonly-reported measures of model fitness are part of ANOVA table\n", 48 + "\n", 49 + "Multivariate models: p degrees of freedom for sum of squares due to regression (because p coefficients {\\beta_1, \\beta_2, \\dots \\beta_p} must be estimated to obtain regression sum of squares\n", 50 + "\n", 51 + "(Sample ANOVA table constructed for multiple linear function)\n", 52 + "\n", 53 + "\n", 54 + "\n", 55 + "$p$ is number of predictor variables. To measure the adequacy of fitted model, determine the error standard deviation:\n", 56 + "\n", 57 + "$$\n", 58 + "s_e = (MSE)^{1/2}\n", 59 + "$$\n", 60 + "\n", 61 + "where $MSE = SSE/(n-p-1)$.\n", 62 + "\n", 63 + "Small $s_e$ means predicted responses closely approximate observed responses.\n", 64 + "\n", 65 + "Large $s_e$ means large random error, or poor selection of model form." 66 + ] 67 + }, 68 + { 69 + "cell_type": "markdown", 70 + "metadata": {}, 71 + "source": [ 72 + "## F-Statistic\n", 73 + "\n", 74 + "Main article: F-Statistic\n", 75 + "\n", 76 + "I found this short YouTube video very helpful for illustrating what the F-statistic means physically: [link](http://www.youtube.com/watch?v=TMwSS8DAVYk)\n", 77 + "\n", 78 + "The F-statistic can be thought of as a frequentist metric for hypothesis-testing. Once an F-statistic and corresponding p-value is calculated from the ANOVA table quantities, you can determine how confident you can be in a given hypothesis test (where the hypothesis is the model onto which you've chosen to regress your data).\n", 79 + "\n", 80 + "Mason: Different from tests for significance of factor effects in analysis of designed experiments\n", 81 + "\n", 82 + "Mason: example of acid-content data...\n", 83 + "\n", 84 + "For multiple linear regression/model, can use F-statistic to simultaneously test a hypothesis: whether all $\\beta_j = 0$ (the hypothesis), or whether at least one $\\beta_j \\neq 0$ (the null hypothesis, i.e., the outcome that our hypothesis is not correct).\n", 85 + "\n", 86 + "$\\beta_1 = \\beta_2 = \\dots = \\beta_p = 0$ versus $\\beta_j \\neq 0$\n", 87 + "\n", 88 + "(While this seems silly, it's much more useful if you're doing this for a subset of coefficients - such as testing the hypothesis of whether any of a subset of coefficients should be non-zero.)\n", 89 + "\n", 90 + "### Lack of Fit F-Statistic\n", 91 + "\n", 92 + "For (deterministic) computer simulations (rather than experiments, which have random error), error is entirely due to lack of fit - there is no source of random error from real instruments or real experimental noise.\n", 93 + "\n", 94 + "In this case, an F-statistic specifically for lack-of-fit is not possible to calculate (defined as $MSE_{LOF}/MSE_{P}$, and $MSE_P = 0$for deterministic functions)\n", 95 + "\n", 96 + "### Partial F-Test (Determination of Term Significance)\n", 97 + "\n", 98 + "Consider a full regresion model with $p$ predictor variables.\n", 99 + "\n", 100 + "Now consider a reduced regression model with $k < p$ predictor variables.\n", 101 + "\n", 102 + "Full model = $M_1$\n", 103 + "\n", 104 + "Reduced model = $M_2$\n", 105 + "\n", 106 + "Reduction in error sum of squares resulting from fit of additional terms in full model:\n", 107 + "\n", 108 + "$$\n", 109 + "R(M_1 \\vert M_2) = SSE_2 - SSE_1\n", 110 + "$$\n", 111 + "\n", 112 + "$(p - k)$ more predictor variables\n", 113 + "\n", 114 + "F-statistic for determining statistical significance of this subset is:\n", 115 + "\n", 116 + "$$\n", 117 + "F = \\frac{ MSR(M_1 \\vert M_2) }{ MSE_1 }\n", 118 + "$$\n", 119 + "\n", 120 + "where $MSR(M_1 \\vert M_2) = \\frac{ R(M_1 \\vert M_2) }{ (p-k) }$\n", 121 + "\n", 122 + "If $k = p-1$ then the F-statistic is the square of the t-statistic from the full model corresponding to the term left out of the reduced model\n", 123 + "\n", 124 + "To determine if F-statistic is highly significant, use p value (95% likelihood of being significant if $p < 0.05$, 99% likelihood if $p < 0.01$)\n", 125 + "\n", 126 + "Example: use to determine if interaction effect is important to surrogate model.\n", 127 + "\n", 128 + "Using this procedure and using the t-statistic to test the significance of a given term are equivalent!" 129 + ] 130 + }, 131 + { 132 + "cell_type": "markdown", 133 + "metadata": {}, 134 + "source": [ 135 + "## T-Statistic\n", 136 + "\n", 137 + "**TODO:** \n", 138 + "\n", 139 + "Fix from here\n", 140 + "\n", 141 + "**Linear Univariate Model:**\n", 142 + "\n", 143 + "A t-statistic can be constructed to test $\\beta_1 = c$ vs. $\\beta_1 \\neq c$\n", 144 + "\n", 145 + "The following statistic has a Student t-distribution with $n-2$ degrees of freedom:\n", 146 + "\n", 147 + "$$\n", 148 + "\\begin{array}{rcl} t &=& \\frac{ b_1 - \\beta_1 }{ s_e / s_xx^{1/2} } \\\\ s_{xx} &=& \\sum \\left( x_i - \\overline{x} \\right)^2 \\end{array}\n", 149 + "$$\n", 150 + "\n", 151 + "and insert $\\beta_1 = c$.\n", 152 + "\n", 153 + "If you insert $c = 0$ and square the result you get the F-statistic from the ANOVA table.\n", 154 + "\n", 155 + "This t-variate can be used to form confidence intervals on the slope parameter.\n", 156 + "\n", 157 + "Following Chapter 2.4: limits for $\\beta_1$ are\n", 158 + "\n", 159 + "$$\n", 160 + "b_1 - t_{\\alpha / 2} \\frac{s_e}{s_{xx}^{1/2}} \\leq \\beta_1 \\leq b_1 + t_{\\alpha / 2} \\frac{s_e}{s_{xx}^{1/2}}\n", 161 + "$$\n", 162 + "\n", 163 + "where $t_{\\alpha / 2}$ is a $100(\\alpha / 2)%$ upper-tail t critical value with n-2 degrees of freedom\n", 164 + "\n", 165 + "Small model standard deviations will lead to small confidence intervals\n", 166 + "\n", 167 + "For the intercept parameter \\beta_0, use the following t-variate:\n", 168 + "\n", 169 + "$$\n", 170 + "t = \\frac{ b_0 - \\beta_0 }{ s_e \\left( n^{-1} + \\overline{x}^2 / s_{xx} \\right)^2 }\n", 171 + "$$\n", 172 + "\n", 173 + "**Multiple Linear Model:**\n", 174 + "\n", 175 + "Testing hypotheses on individual regression coefficients is of primary interest to someone performing regression analysis\n", 176 + "\n", 177 + "The t-statistic can be constructed to test $\\beta_j = c$ versus $\\beta_j \\neq c$.\n", 178 + "\n", 179 + "Test statistic used for this purpose is:\n", 180 + "\n", 181 + "$$\n", 182 + "t = \\frac{ b_j - c }{ s_e c_{jj}^{1/2} }\n", 183 + "$$\n", 184 + "\n", 185 + "where $s_e = (MSE)^{1/2}$ is the estimated error standard deviation.\n", 186 + "\n", 187 + "$$\n", 188 + "c_{jj} = \\left[ (n - 1) s_j^2 (1 - R_j^2) \\right]^{-1}\n", 189 + "$$\n", 190 + "\n", 191 + "$s_j^2$ is the sample variance of the n values of the jth predictor variable\n", 192 + "\n", 193 + "$R_j^2$ is the coefficient of determination for regression of $x_j$ on the constant term and the other $p-1$ predictor variables.\n", 194 + "\n", 195 + "Using t-statistics with $c = 0$ can be used to test statistical significance of of individual model parameters (usefulness of $x_j$ as predictor of response variable)\n", 196 + "\n", 197 + "NOTE: this test is only conditional, since $\\beta_j$ is partial regression coefficient, and $b_j$, $c_{jj}$ are functions of other predictor variable values\n", 198 + "\n", 199 + "Only determines significance of jth predictor variable conditional on the presence of the other predictor variables\n", 200 + "\n", 201 + "e.g. \"Each individual predictor variable contributes significantly to the given fits, given that the other two predictor variables are also included in the model\" " 202 + ] 203 + }, 204 + { 205 + "cell_type": "markdown", 206 + "metadata": {}, 207 + "source": [ 208 + "## Response Confidence Intervals\n", 209 + "\n", 210 + "Want confidence intervals for the response model $\\hat{y}$.\n", 211 + "\n", 212 + "**Linear Univariate Models:**\n", 213 + "\n", 214 + "Confidence interval constructed for the response model:\n", 215 + "\n", 216 + "$$\n", 217 + "\\mu = \\beta_0 + \\beta_1 x\n", 218 + "$$\n", 219 + "\n", 220 + "Mason: for fixed values of x...???\n", 221 + "\n", 222 + "The predicted response $\\hat{y}$ has a normal distribution (given certain assumptions).\n", 223 + "\n", 224 + "Thus mean and deviation given by:\n", 225 + "\n", 226 + "$$\n", 227 + "\\mu = \\beta_0 + \\beta_1 x\n", 228 + "$$\n", 229 + "\n", 230 + "$$\n", 231 + "\\sigma \\left[ a_1 + \\left( x - a_2 \\right)^2 / s_{xx} \\right]^{1/2}\n", 232 + "$$\n", 233 + "\n", 234 + "where:\n", 235 + "\n", 236 + "$$\n", 237 + "a_1 = \\frac{1}{n}\n", 238 + "$$\n", 239 + "\n", 240 + "$$\n", 241 + "a_2 = \\overline{x}\n", 242 + "$$\n", 243 + "\n", 244 + "$$\n", 245 + "s_{xx} = \\sum \\left( x_i - \\overline{x} \\right)^2\n", 246 + "$$\n", 247 + "\n", 248 + "And the following t-variate can be used to construct confidence intervals for $\\mu$:\n", 249 + "\n", 250 + "$$\n", 251 + "t = \\frac{ \\hat{y} - \\mu }{ s_e \\left( a_1 + \\left( x - a_2 \\right) / s_{xx} \\right)^2 }\n", 252 + "$$\n", 253 + "\n", 254 + "To form prediction interval for actual future response, not expected value of a response.\n", 255 + "\n", 256 + "Use this equation again, but replace $\\mu$ with $y_f$, and $a_1$ with $a_1 + 1$.\n", 257 + "\n", 258 + "$y_f$ is the future response\n", 259 + "\n", 260 + "$\\hat{y}$ is the predicted value of future response\n", 261 + "\n", 262 + "$a \\rightarrow a+1$ is because future response has standard deviation $\\sigma$ with an added variability\n", 263 + "\n", 264 + "Standard deviation of $\\hat{y} - y_f$ is:\n", 265 + "\n", 266 + "$$\n", 267 + "\\sigma = \\left( 1 + a_1 + \\left( x - a_2 \\right)^2 / s_{xx} \\right)^{1/2}\n", 268 + "$$\n", 269 + "\n", 270 + "**Multiple Linear Model:**\n", 271 + "\n", 272 + "Confidence interval for regression coefficients of multiple linear model:\n", 273 + "\n", 274 + "A $100(1-\\alpha)%$ confidence interval for $\\beta_j$ is given by:\n", 275 + "\n", 276 + "$$\n", 277 + "b_j - t_{\\alpha/2} s_e c_{jj}^{1/2} \\leq \\beta_j \\leq b_j + t_{\\alpha/2} s_e c_{jj}^{1/2}\n", 278 + "$$\n", 279 + "\n", 280 + "where $t_{\\alpha/2}$ is two-tailed $100 \\alpha %$ t critical value having $n - p - 1$ degrees of freedom.\n", 281 + "\n", 282 + "Simultaneous confidence intervals for all coefficients in multiple linear regression model cannot be computed using individual coefficient intervals.\n", 283 + "\n", 284 + "They ignore systematic variation of predictor variables and consequent correlation among coefficient estimators." 285 + ] 286 + }, 287 + { 288 + "cell_type": "markdown", 289 + "metadata": {}, 290 + "source": [ 291 + "## Correlation Coefficient (R Squared)\n", 292 + "\n", 293 + "Measure of correlation between observed and predicted responses\n", 294 + "\n", 295 + "Univariate Linear Model\n", 296 + "\n", 297 + "$$\n", 298 + "R^2 = \\left[ corr(y,\\hat{y}) \\right]^2 = 1 - \\frac{SSE}{TSS}\n", 299 + "$$\n", 300 + "\n", 301 + "**Multiple Linear Model:**\n", 302 + "\n", 303 + "R-squared can be calculated as:\n", 304 + "\n", 305 + "$$\n", 306 + "R^2 = 1 - \\frac{SSE}{TSS}\n", 307 + "$$\n", 308 + "\n", 309 + "It should be acknowledged that as the number of predictor variables approaches the number of observations, this can become arbitrarily close to 1\n", 310 + "\n", 311 + "if $n = p+1$ then $R^2 = 1$.\n", 312 + "\n", 313 + "Adjusted $R^2$, denoted $R_a^2$:\n", 314 + "\n", 315 + "$$\n", 316 + "R_a^2 = 1 - a \\frac{SSE}{TSS}\n", 317 + "$$\n", 318 + "\n", 319 + "where\n", 320 + "\n", 321 + "$$\n", 322 + "a = \\frac{ n-1 }{ n - p - 1 }\n", 323 + "$$\n", 324 + "\n", 325 + "Differences between $R^2$ and $R_a^2$ are minor except for when $n$ and $p$ are close.\n", 326 + "\n", 327 + "Caution should be used in relying on single measure of fit (e.g. $R^2$)." 328 + ] 329 + }, 330 + { 331 + "cell_type": "markdown", 332 + "metadata": {}, 333 + "source": [ 334 + "### Contour Plots\n", 335 + "\n", 336 + "Contour plots can be used to determine sensitivities: if the response $y$ changes significantly in one parameter direction, it is sensitive to that parameter. If the contour shows a structure that is uniform in one parameter direction, the response is not sensitive to that parameter.\n", 337 + "\n", 338 + "For multiple responses, a contour plot for each response can be made, infeasible regions shaded gray, and the plots overlaid to yield the feasible region. " 339 + ] 340 + }, 341 + { 342 + "cell_type": "markdown", 343 + "metadata": {}, 344 + "source": [ 345 + "### Determination of Outlier Data Points\n", 346 + "\n", 347 + "Mason Ch. 18\n", 348 + "\n", 349 + "Various test for outliers (p. 629, 631, etc.)\n", 350 + "\n", 351 + "Tests for response variable outliers\n", 352 + "\n", 353 + "Tests for predictor variable outliers " 354 + ] 355 + }, 356 + { 357 + "cell_type": "markdown", 358 + "metadata": {}, 359 + "source": [ 360 + "## Other Things to Look At\n", 361 + "\n", 362 + "* Correlation between input variables - e.g., for time and temperature: $\\{t, T, tT, t^2, T^2\\}$" 363 + ] 364 + }, 365 + { 366 + "cell_type": "markdown", 367 + "metadata": {}, 368 + "source": [ 369 + "# Issues\n", 370 + "\n", 371 + "## Importance of Interaction Terms\n", 372 + "\n", 373 + "We mentioned above that screening and highly fractionated factorial designs tend to ignore interaction effects by aliasing them with main effects.\n", 374 + "\n", 375 + "This important note from Mason indicates that this can be avoided by only including the colinear interaction terms that are most likely to be significant:\n", 376 + "\n", 377 + "
\n", 378 + "Ordinarily, one should not routinely insert products of all the predictors in a regression model. To do so might create unnecessary complications in the analysis and interpretation of the fitted models due to collinear predictor variables. The purpose of including interaction terms in regression models is to improve the fit either because theoretical considerations require a modeling of the joint effects or because an analysis of the regression data indicates that joint effects are needed in addition to the linear terms of the individual variables.\n", 379 + "
" 380 + ] 381 + }, 382 + { 383 + "cell_type": "markdown", 384 + "metadata": {}, 385 + "source": [ 386 + "## Dealing with Multimodal Variables\n", 387 + "\n", 388 + "Sometimes, when constructing response surfaces, modal variables appear. Modal variables are variables that have multiple modes, or distinct sets of values. There are two variations of modal variables:\n", 389 + "\n", 390 + "* One uncertainty range\n", 391 + "* N uncertainty range\n", 392 + "\n", 393 + "### One Uncertainty Range\n", 394 + "\n", 395 + "These types of modal variables have a single range of uncertainty assigned to them, but the values within that range of uncertainty are discrete. In order to sample the parameter within the range of uncertainty, the parameter must be sampled at distinct, discrete values.\n", 396 + "\n", 397 + "For example, if I am using the discrete ordinates model (DOM) for radiation calculations, the DOM requires a number of ordinate directions. This is a discrete value with distinct sets of values - e.g. 3, 6, 8, 24, etc.\n", 398 + "\n", 399 + "Each discrete value in this case composes a single range of uncertainty. Using the DOM example, that range of uncertainty would be $[3, 24]$.\n", 400 + "\n", 401 + "### N Uncertainty Ranges\n", 402 + "\n", 403 + "The other type of modal variables have several ranges of uncertainty assigned to them, with no restriction on values within that range of uncertainty being discrete or distinct. Essentially this can be thought of as a bimodal uncertainty distribution, where the two modes are distinct. Each mode can be sampled as usual, the only sticking point is that there is more than 1, and that they are distinct.\n", 404 + "\n", 405 + "The case of mixing in a chemically reacting system provides an example of how to think about this. Suppose we are investigating two mass flowrates of material into a reactor where there is mixing occurring - mass flowrates of 1 kg/s and 2 kg/s. \n", 406 + "\n", 407 + "Each mode also has a range of uncertainty. This can be an absolute range (as in, $\\pm 0.1$ kg/s) or a relative range (as in, $\\pm 5 \\%$).\n", 408 + "\n", 409 + "### How To Deal\n", 410 + "\n", 411 + "Multimodal variables can be dealt with in two ways:\n", 412 + "\n", 413 + "**Method 1: Separate Response Surfaces for Each Mode**\n", 414 + "\n", 415 + "The first way is to create a separate response surface for each distinct mode. This method works for both types of modal variables (1 uncertainty range represented by N distinct values, and N uncertainty ranges). This method is illustrated in the figures below. Each distinct mode (gray region) has its own computed response surface (blue dotted line), distinct from the response surface of the other modes.\n", 416 + "\n", 417 + "Of course, if the variable type is 1 uncertainty range represented by N distinct values, then there is no uncertainty range for each mode, and each gray region is, in fact, a delta function. As mentioned above, this means that the input variable is eliminated as a response surface parameter.\n", 418 + "\n", 419 + "If the variable type is N uncertainty ranges, then each uncertainty range is sampled as usual, and each response surface is constructed as usual. \n", 420 + "\n", 421 + "**Method 2: Single Response Surface (Ignore Modes)**\n", 422 + "\n", 423 + "A second way is to create a single response surface. This is typically only possible with N uncertainty ranges type of problems, because the parameter value is continuous, but it is only certain regions that are of interest. This approach is illustrated below.\n", 424 + "\n", 425 + "Essentially, this approach does away with any special treatment of modes. " 426 + ] 427 + }, 428 + { 429 + "cell_type": "code", 430 + "execution_count": null, 431 + "metadata": {}, 432 + "outputs": [], 433 + "source": [] 434 + } 435 + ], 436 + "metadata": { 437 + "kernelspec": { 438 + "display_name": "Python 3", 439 + "language": "python", 440 + "name": "python3" 441 + }, 442 + "language_info": { 443 + "codemirror_mode": { 444 + "name": "ipython", 445 + "version": 3 446 + }, 447 + "file_extension": ".py", 448 + "mimetype": "text/x-python", 449 + "name": "python", 450 + "nbconvert_exporter": "python", 451 + "pygments_lexer": "ipython3", 452 + "version": "3.6.3" 453 + } 454 + }, 455 + "nbformat": 4, 456 + "nbformat_minor": 2 457 +}

#### + 515 - 0 notebooks/ED_2C_LeastSquaresResponseSurface.ipynb View File

 @@ -0,0 +1,515 @@ 1 +{ 2 + "cells": [ 3 + { 4 + "cell_type": "markdown", 5 + "metadata": {}, 6 + "source": [ 7 + "# Least Squares for Response Surface Work\n", 8 + "\n", 9 + "Link: [http://charlesreid1.com/wiki/Empirical_Model-Building_and_Response_Surfaces#Chapter_3:_Least_Squares_for_Response_Surface_Work](http://charlesreid1.com/wiki/Empirical_Model-Building_and_Response_Surfaces#Chapter_3:_Least_Squares_for_Response_Surface_Work)" 10 + ] 11 + }, 12 + { 13 + "cell_type": "markdown", 14 + "metadata": {}, 15 + "source": [ 16 + "## Method of Least Squares\n", 17 + "\n", 18 + "Least squares helps you to understand a model of the form:\n", 19 + "\n", 20 + "$$\n", 21 + "y = f(x,t) + \\epsilon\n", 22 + "$$\n", 23 + "\n", 24 + "where:\n", 25 + "\n", 26 + "$$\n", 27 + "E(y) = \\eta = f(x,t)\n", 28 + "$$\n", 29 + "\n", 30 + "is the mean level of the response $y$ which is affected by $k$ variables $(x_1, x_2, ..., x_k) = \\mathbf{x}$\n", 31 + "\n", 32 + "It also involves $p$ parameters $(t_1, t_2, ..., t_p) = \\mathbf{t}$\n", 33 + "\n", 34 + "$\\epsilon$ is experimental error\n", 35 + "\n", 36 + "To examine this model, experiments would run at n different sets of conditions, $x_1, x_2, ..., x_n$\n", 37 + "\n", 38 + "would then observe corresponding values of response $y_1, y_2, ..., y_n$\n", 39 + "\n", 40 + "Two important questions:\n", 41 + "\n", 42 + "1. does postulated model accurately represent the data?\n", 43 + "\n", 44 + "2. if model does accurately represent data, what are best estimates of parameters t?\n", 45 + "\n", 46 + "start with second question first\n", 47 + "\n", 48 + "-----\n", 49 + "\n", 50 + "Given: function $f(x,t)$ for each experimental run\n", 51 + "\n", 52 + "$n$ discrepancies:\n", 53 + "\n", 54 + "$$\n", 55 + "{y_1 - f(x_1,t)}, {y_2 - f(x_2,t)}, ..., {y_n - f(x_n,t)}\n", 56 + "$$\n", 57 + "\n", 58 + "Method of least squares selects best value of t that make the sum of squares smallest:\n", 59 + "\n", 60 + "$$\n", 61 + "S(t) = \\sum_{u=1}^{n} \\left[ y_n - f \\left( x_u, t \\right) \\right]^2\n", 62 + "$$\n", 63 + "\n", 64 + "$S(t)$ is the sum of squares function\n", 65 + "\n", 66 + "Minimizing choice of $t$ is denoted\n", 67 + "\n", 68 + "$$\n", 69 + "\\hat{t}\n", 70 + "$$\n", 71 + "\n", 72 + "are least-squares estimates of $t$ good?\n", 73 + "\n", 74 + "Their goodness depends on the nature of the distribution of their errors\n", 75 + "\n", 76 + "Least-squares estimates are appropriate if you can assume that experimental errors:\n", 77 + "\n", 78 + "$$\n", 79 + "\\epsilon_u = y_u - \\eta_u\n", 80 + "$$\n", 81 + "\n", 82 + "are statistically independent and with constant variance, and are normally distributed\n", 83 + "\n", 84 + "these are \"standard assumptions\"" 85 + ] 86 + }, 87 + { 88 + "cell_type": "markdown", 89 + "metadata": {}, 90 + "source": [ 91 + "## Linear models\n", 92 + "\n", 93 + "This is a limiting case, where\n", 94 + "\n", 95 + "$$\n", 96 + "\\eta = f(x,t) = t_1 z_1 + t_2 z_2 + ... + t_p z_p\n", 97 + "$$\n", 98 + "\n", 99 + "adding experimental error $\\epsilon = y - \\eta$:\n", 100 + "\n", 101 + "$$\n", 102 + "y = t_1 z_1 + t_2 z_2 + ... + t_p z_p + \\epsilon\n", 103 + "$$\n", 104 + "\n", 105 + "model of this form is linear in the parameters\n", 106 + "\n", 107 + "### Algorithm\n", 108 + "\n", 109 + "Formulate a problem with $n$ observed responses, $p$ parameters...\n", 110 + "\n", 111 + "This yields $n$ equations of the form:\n", 112 + "\n", 113 + "$$\n", 114 + "y_1 = t_1 z_{11} + t_2 z_{21} + ... \\\\\n", 115 + "y_2 = t_1 z_{21} + t_2 z_{22} + ...\n", 116 + "$$\n", 117 + "\n", 118 + "etc...\n", 119 + "\n", 120 + "This can be written in matrix form:\n", 121 + "\n", 122 + "$$\n", 123 + "\\mathbf{y} = \\mathbf{Z t} + \\boldsymbol{\\epsilon}\n", 124 + "$$\n", 125 + "\n", 126 + "and the dimensions of each matrix are:\n", 127 + "\n", 128 + "* $y = n \\times 1$\n", 129 + "* $Z = n \\times p$\n", 130 + "* $t = p \\times 1$\n", 131 + "* $\\epsilon = n \\times 1$\n", 132 + "\n", 133 + "the sum of squares function is given by:\n", 134 + "\n", 135 + "$$\n", 136 + "S(\\mathbf{t}) = \\sum_{u=1}^{n} \\left( y_u - t_1 z_{1u} - t_2 z_{2u} - ... - t_p z_{pu} \\right)^2\n", 137 + "$$\n", 138 + "\n", 139 + "or,\n", 140 + "\n", 141 + "$$\n", 142 + "S(t) = ( y - Zt )^{\\prime} ( y - Zt )\n", 143 + "$$\n", 144 + "\n", 145 + "this can be rewritten as:\n", 146 + "\n", 147 + "$$\n", 148 + "\\mathbf{ Z^{\\prime} Z t = Z^{\\prime} y }\n", 149 + "$$" 150 + ] 151 + }, 152 + { 153 + "cell_type": "markdown", 154 + "metadata": {}, 155 + "source": [ 156 + "### Rank of Z\n", 157 + "\n", 158 + "If there are relationships between the different input parameters $(z's)$, then the matrix $\\mathbf{Z}$ can become singular\n", 159 + "\n", 160 + "e.g. if there is a relationship $z_2 = c z_1$, then you can only estimate the linear combination $z_1 + c z_2$ \n", 161 + "\n", 162 + "reason: when $z_2 = c z_1$, changes in $z_1$ can't be distinguished from changes in $z_2$\n", 163 + "\n", 164 + "$Z$ (an $n \\times p$ matrix) is said to be full rank $p$ if there are no linear relationships of the form:\n", 165 + "\n", 166 + "$$\n", 167 + "a_1 z_1 + a_2 z_2 + ... + a_p z_p l= 0\n", 168 + "$$\n", 169 + "\n", 170 + "if there are $q > 0$ independent linear relationships, then $Z$ has rank $p - q$" 171 + ] 172 + }, 173 + { 174 + "cell_type": "markdown", 175 + "metadata": {}, 176 + "source": [ 177 + "## Analysis of Variance: 1 regressor\n", 178 + "\n", 179 + "Assume simple model $y = \\beta + \\epsilon$\n", 180 + "\n", 181 + "This states that $y$ is varying about an unknown mean $\\beta$\n", 182 + "\n", 183 + "Suppose we have 3 observations of $y$, $\\mathbf{y} = (4, 1, 1)'$\n", 184 + "\n", 185 + "Then the model can be written as $y = z_1 t + \\epsilon$\n", 186 + "\n", 187 + "and $z_1 = (1, 1, 1) '$\n", 188 + "\n", 189 + "and $t = \\beta$\n", 190 + "\n", 191 + "so that\n", 192 + "\n", 193 + "\n", 194 + "[ 4 ] [ 1 ] [ \\epsilon_1 ]\n", 195 + "[ 1 ] = [ 1 ] t + [ \\epsilon_2 ]\n", 196 + "[ 1 ] [ 1 ] [ \\epsilon_3 ]\n", 197 + "\n", 198 + "\n", 199 + "Supposing the linear model posited a value of one of the regressors t, e.g. $t_0 = 0.5$\n", 200 + "\n", 201 + "Then you could check the null hypothesis, e.g. $H_0 : t = t_0 = 0.5$\n", 202 + "\n", 203 + "If true, the mean observation vector given by $\\eta_0 = z_1 t_0$\n", 204 + "\n", 205 + "or,\n", 206 + "\n", 207 + "\n", 208 + "[ 0.5 ] [ 1 ]\n", 209 + "[ 0.5 ] = [ 1 ] 0.5\n", 210 + "[ 0.5 ] [ 1 ]\n", 211 + "\n", 212 + "\n", 213 + "and the appropriate \"observation breakdown\" (whatever that means?) is:\n", 214 + "\n", 215 + "$$\n", 216 + "y - \\eta_0 = ( \\hat{y} - \\eta_0 ) + ( y - \\hat{y} )\n", 217 + "$$\n", 218 + "\n", 219 + "Associated with this observation breakdown is an analysis of variance table:\n", 220 + "\n", 221 + "{|\n", 222 + "|Source\n", 223 + "|Degrees of freedom (df)\n", 224 + "|Sum of squares (square of length), SS\n", 225 + "|Mean square, MS\n", 226 + "|Expected value of mean square, E(MS)\n", 227 + "|-\n", 228 + "|Model\n", 229 + "|1\n", 230 + "|$\\vert \\hat{y} - \\eta_0 \\vert^2 = ( \\hat{t} - t_0 )^2 \\sum z_1^2$\n", 231 + "|6.75\n", 232 + "|$\\sigma^2 + ( t - t_0 )^2 \\sum z_1^2$\n", 233 + "\n", 234 + "|-\n", 235 + "|Residual\n", 236 + "|2\n", 237 + "|$\\vert y - \\hat{y} \\vert^2 = \\sum ( y - \\hat{t} z_1 )^2$\n", 238 + "|3.00\n", 239 + "|$\\sigma^2$\n", 240 + "\n", 241 + "|-\n", 242 + "|Total\n", 243 + "|3\n", 244 + "|$\\vert y - \\eta_0 \\vert^2 = \\sum ( y - \\eta_0 )^2 = 12.75$\n", 245 + "|\n", 246 + "|\n", 247 + "|}" 248 + ] 249 + }, 250 + { 251 + "cell_type": "markdown", 252 + "metadata": {}, 253 + "source": [ 254 + "Sum of squares: squared lengths of vectors\n", 255 + "\n", 256 + "Degrees of freedom: number of dimensions in which vector can move (geometric interpretation)\n", 257 + "\n", 258 + "The model $y = z_1 t + \\epsilon$ says whatever the data is, the systematic part $\\hat{y} - \\eta_0 = ( \\hat{t} - t_0) z_1$ of $y - \\eta_0$ must lie in the direction of $z_1$, which gives $\\hat{y} - \\eta_0$ only one degree of freedom.\n", 259 + "\n", 260 + "Whatever the data, the residual vector must be perpendicular to $z_1$ (why?), and so it can move in 2 directions and has 2 degrees of freedom\n", 261 + "\n", 262 + "Now, looking at the null hypothesis: \n", 263 + "\n", 264 + "The component $\\vert \\hat{y} - \\eta_0 \\vert^2 = ( \\hat{t} - t_0 )^2 \\sum z^2$ is a measure of discrepancy between POSTULATED model $\\eta_0 = z_1 t_0$ and ESTIMATED model $\\hat{y} = z_1 \\hat{t}$\n", 265 + "\n", 266 + "Making \"standard assumptions\" (earlier), expected value of sum of squares, assuming model is true, is $( t - t_0 )^2 \\sum z_1^2 + \\sigma^2$\n", 267 + "\n", 268 + "For the residual component it is $2 \\sigma^2$ (or, in general, $\\nu_2 \\sigma^2$, where $\\nu_2$ is number of degrees of freedom of residuals)\n", 269 + "\n", 270 + "Thus a measure of discrepancy from the null hypothesis $t = t_0$ is $F = \\frac{ \\vert \\hat{y} - \\eta_0 \\vert^2 / 1 }{ \\vert y - \\hat{y} \\vert^2 / 2 }$\n", 271 + "\n", 272 + "if the null hypothesis were true, then the top and bottom would both estimate the same $\\sigma^2$\n", 273 + "\n", 274 + "So if $F$ is different from 1, that indicates departure from null hypothesis\n", 275 + "\n", 276 + "The MORE $F$ differs from 1, the more doubtful the null hypothesis becomes" 277 + ] 278 + }, 279 + { 280 + "cell_type": "markdown", 281 + "metadata": {}, 282 + "source": [ 283 + "## Least squares: 2 regressors\n", 284 + "\n", 285 + "Previous model, $y = \\beta + \\epsilon$, said $y$ was represented with a mean $t$ plus an error.\n", 286 + "\n", 287 + "Instead, suppose that there are systematic deviations from the mean, associated with an external variable (e.g. humidity in the lab).\n", 288 + "\n", 289 + "Now equation is for straight line: $y = \\beta_0 + \\beta_1 x + \\epsilon$\n", 290 + "\n", 291 + "or, $y = z_1 t_1 + z_2 t_2 + \\epsilon$\n", 292 + "\n", 293 + "So now the revised least-squares model is: $\\eta = z_1 t_1 + z_2 t_2$\n", 294 + "\n", 295 + "$\\eta = E(y)$ - i.e. $\\eta$ is in the plane defined by linear combinations of vectors $z_1, z_2$\n", 296 + "\n", 297 + "because $z_1^{\\prime} z_2 = \\sum z_1 z_2 \\neq 0$, these two vectors are NOT at right angles\n", 298 + "\n", 299 + "The least-squares values $\\hat{t_1}, \\hat{t_2}$ produce a vector $\\hat{\\hat{y}} = z_1 \\hat{t_1} + z_2 \\hat{t_2}$\n", 300 + "\n", 301 + "These least-squares values make the squared length $\\sum ( y - \\hat{\\hat{y}} )^2 = \\vert y - \\hat{\\hat{y}} \\vert^2$ of the residual vector as small as possible\n", 302 + "\n", 303 + "The normal equations express fact that residual vector must be perpendicular to both $z_1$ and $z_2$:\n", 304 + "\n", 305 + "$$\n", 306 + "z_1^{\\prime} ( y - \\hat{\\hat{y}} ) = 0 \\\\\n", 307 + "z_2^{\\prime} ( y - \\hat{\\hat{y}} ) = 0\n", 308 + "$$\n", 309 + "\n", 310 + "also written as:\n", 311 + "\n", 312 + "\n", 313 + "\\begin{align}\n", 314 + "\\sum z_1 ( y - \\hat{t_1} z_1 - \\hat{t_2} z_2 ) &=& 0 \\\\\n", 315 + "\\sum z_2 ( y - \\hat{t_1} z_1 - \\hat{t_2} z_2 ) &=& 0\n", 316 + "\\end{align}\n", 317 + "\n", 318 + "\n", 319 + "also written (in matrix form) as:\n", 320 + "\n", 321 + "$$\n", 322 + "\\mathbf{Z^{\\prime}} ( \\mathbf{y - Z \\hat{t} } ) = 0\n", 323 + "$$\n", 324 + "\n", 325 + "\n", 326 + "\n", 327 + "Now suppose the null hypothesis was investigated for $t_1 = t_{10} = 0.5$ and $t_2 = t_{20} = 1.0$\n", 328 + "\n", 329 + "Then the mean observation vector $\\eta_0$ is represented as $\\eta_0 = t_{10} z_1 + t_{20} z_2$\n", 330 + "\n", 331 + "$$\n", 332 + "y - \\eta_0 = \\left( \\hat{\\hat{y}} - \\eta_0 \\right) + \\left( y - \\hat{\\hat{y}} \\right)\n", 333 + "$$\n", 334 + "\n", 335 + "and so\n", 336 + "\n", 337 + "$$F_0 = \\frac{ \\vert \\hat{\\hat{y}} - \\eta_0 \\vert / 2 }{ \\vert y - \\hat{\\hat{y}} \\vert^2 / 1 } = 2.23\n", 338 + "$$" 339 + ] 340 + }, 341 + { 342 + "cell_type": "markdown", 343 + "metadata": {}, 344 + "source": [ 345 + "## Orthogonalizing second regressor\n", 346 + "\n", 347 + "In the above example, $z_1$ and $z_2$ are not orthogonal\n", 348 + "\n", 349 + "One can find the vectors $z_1$ and $z_{2 \\cdot 1}$ that are orthogonal\n", 350 + "\n", 351 + "To do this, use least squares property that residual vector is orthogonal to space in which the predictor variables lie\n", 352 + "\n", 353 + "Regard $z_2$ as \"response\" vector and $z_1$ as predictor variable\n", 354 + "\n", 355 + "You then obtain $\\hat{z_2} = 0.2 z_1$ (how?)\n", 356 + "\n", 357 + "so the residual vector is $z_{2 \\cdot 1} = z_2 - \\hat{z_2} = z_2 - 0.2 z_1$\n", 358 + "\n", 359 + "now the model can be rewritten as $\\eta = \\left( t_1 + 0.2 t_2 \\right) z_1 + t_2 \\left( z_2 - 0.2 z_1 \\right) = t z_1 + t_2 z_{2 \\cdot 1}$\n", 360 + "\n", 361 + "This gives three least-squares equations:\n", 362 + "\n", 363 + "1. $\\hat{y} = 2 z_1$\n", 364 + "2. $\\hat{y} = 1.5 z_1 + 2.5 z_2$\n", 365 + "3. $\\hat{y} = 2.0 z_1 + 2.5 z_{2 \\cdot 1}$\n", 366 + "\n", 367 + "The analysis of variance becomes:\n", 368 + "\n", 369 + "---\n", 370 + "\n", 371 + "Source: Response function with $z_1$ only\n", 372 + "\n", 373 + "DoF: 1\n", 374 + "\n", 375 + "Sum of Squares (SS): $\\vert \\hat{y} - \\eta_0 |vert^2 = \\left( \\hat{t} - t_0 \\right)^2 \\sum z_1^2 = 12.0$\n", 376 + "\n", 377 + "Source: Extra due to $z_2$ (given $z_1$)\n", 378 + "\n", 379 + "DoF: 1\n", 380 + "\n", 381 + "SS: $\\vert \\hat{\\hat{y}} - \\hat{y} \\vert^2 = \\hat{t}_2^2 \\sum z_{2 \\cdot 1}^2 = 4.5$\n", 382 + "\n", 383 + "Source: Residual\n", 384 + "\n", 385 + "DoF: 1\n", 386 + "\n", 387 + "SS: $\\vert y - \\hat{\\hat{y}} \\vert^2 = \\sum \\left( y - \\hat{\\hat{y}} \\right)^2 = 1.5$\n", 388 + "\n", 389 + "Source: Total\n", 390 + "\n", 391 + "DoF: 3\n", 392 + "\n", 393 + "SS: $\\vert y - \\eta_0 \\vert^2 = \\sum \\left( y - \\eta_0 \\right)^2 = 18.0$\n" 394 + ] 395 + }, 396 + { 397 + "cell_type": "markdown", 398 + "metadata": {}, 399 + "source": [ 400 + "## Generalization to p regressors\n", 401 + "\n", 402 + "With n observations and p parameters:\n", 403 + "\n", 404 + "n relations implicit in response function can be written \n", 405 + "\n", 406 + "$$\n", 407 + "\\boldsymbol{\\eta} = \\mathbf{Z t}\n", 408 + "$$\n", 409 + "\n", 410 + "Assuming $Z$ is full rank, and letting $\\hat{\\mathbf{t}}$ be the vector of estimates given by normal equations\n", 411 + "\n", 412 + "$$\n", 413 + "\\left( \\mathbf{ y - \\hat{y} } \\right)^{\\prime} \\mathbf{Z} = \\left( y - Z \\hat{t} \\right)^{\\prime} Z = 0\n", 414 + "$$\n", 415 + "\n", 416 + "Sum of squares function is $S(t) = (y - \\eta)^{\\prime} (y - \\eta) = (y - \\hat{y})^{\\prime} (y - \\hat{y}) + ( \\hat{y} - \\eta )^{\\prime} (\\hat{y} - \\eta)$\n", 417 + "\n", 418 + "Because cross-product is zero from the normal equations\n", 419 + "\n", 420 + "$$\n", 421 + "S(t) = S(\\hat{t}) + (\\hat{t} - t)^{\\prime} \\mathbf{Z^{\\prime} Z} ( \\hat{t} - t )\n", 422 + "$$\n", 423 + "\n", 424 + "Furthermore, because $\\mathbf{Z^{\\prime} Z}$ is positive definite, $S(t)$ minimized when $t = \\hat{t}$\n", 425 + "\n", 426 + "So the solution to the normal equations producing the least squares estimate is the one where $t = \\hat{t}$:\n", 427 + "\n", 428 + "$$\n", 429 + "\\hat{t} = ( \\mathbf{Z^{\\prime} Z} )^{-1} \\mathbf{Z^{\\prime} y}\n", 430 + "$$\n", 431 + "\n", 432 + "----\n", 433 + "\n", 434 + "Source: Response function\n", 435 + "\n", 436 + "DoF: $p$\n", 437 + "\n", 438 + "SS: $\\vert \\hat{y} - \\eta \\vert^2 = (\\hat{t} - t)^{\\prime} \\mathbf{Z^{\\prime} Z} ( \\hat{t} - t )$\n", 439 + "\n", 440 + "Source: Residual\n", 441 + "\n", 442 + "DoF: $n - p$\n", 443 + "\n", 444 + "SS: $\\vert y - \\hat{y} \\vert^2 = \\sum ( y - \\hat{y} )^2$\n", 445 + "\n", 446 + "Source: Total\n", 447 + "\n", 448 + "DoF: $n$\n", 449 + "\n", 450 + "SS: $\\vert y - \\eta \\vert^2 = \\sum ( y - \\eta )^2$\n" 451 + ] 452 + }, 453 + { 454 + "cell_type": "markdown", 455 + "metadata": {}, 456 + "source": [ 457 + "## Bias in Least-Squares Estimators if Inadequate Model\n", 458 + "\n", 459 + "Say data was being fit with a model $y = Z_1 t_1 + \\epsilon$,\n", 460 + "\n", 461 + "but the true model that should have been used is $y = Z_1 t_1 + Z_2 t_2 + \\epsilon$\n", 462 + "\n", 463 + "$t_1$ would be estimated by $\\hat{t_1} = (\\mathbf{ Z_1^{\\prime} Z_1 } )^{-1} \\mathbf{ Z_1^{\\prime} y }$\n", 464 + "\n", 465 + "but using true model, \n", 466 + "\n", 467 + "$$\n", 468 + "\\begin{array}{rcl}\n", 469 + "E( \\hat{t_1} ) &=& ( \\mathbf{Z_1^{\\prime} Z_1} )^{-1} \\mathbf{Z_1^{\\prime}} E(\\mathbf{y}) \\\\\n", 470 + "&=& ( \\mathbf{ Z_1^{\\prime} Z_1 } )^{-1} \\mathbf{Z_1^{\\prime}} (\\mathbf{Z_1 t_1} + \\mathbf{Z_2 t_2} ) \\\\\n", 471 + "&=& \\mathbf{t_1 + A t_2}\n", 472 + "\\end{array}\n", 473 + "$$\n", 474 + "\n", 475 + "The matrix A is the bias or alias matrix\n", 476 + "\n", 477 + "$$\n", 478 + "A = \\left( \\mathbf{ Z_1^{\\prime} Z_1 } \\right)^{-1} \\mathbf{ Z_1^{\\prime} Z_2 }\n", 479 + "$$\n", 480 + "\n", 481 + "Unless $A = 0$, $\\hat{t_1}$ will represent $t_1$ AND $t_2$, not just $t_1$\n", 482 + "\n", 483 + "$A = 0$ when $\\mathbf{Z_1^{\\prime} Z_2} = 0$, which happens if regressors in $\\mathbf{Z_1}$ are orthogonal to regressors in $\\mathbf{Z_2}$" 484 + ] 485 + }, 486 + { 487 + "cell_type": "code", 488 + "execution_count": null, 489 + "metadata": {}, 490 + "outputs": [], 491 + "source": [] 492 + } 493 + ], 494 + "metadata": { 495 + "kernelspec": { 496 + "display_name": "Python 3", 497 + "language": "python", 498 + "name": "python3" 499 + }, 500 + "language_info": { 501 + "codemirror_mode": { 502 + "name": "ipython", 503 + "version": 3 504 + }, 505 + "file_extension": ".py", 506 + "mimetype": "text/x-python", 507 + "name": "python", 508 + "nbconvert_exporter": "python", 509 + "pygments_lexer": "ipython3", 510 + "version": "3.6.3" 511 + } 512 + }, 513 + "nbformat": 4, 514 + "nbformat_minor": 2 515 +}