From charlesreid1

(gave this page the Ballmer treatment — better headings, actual personality, fixed bare URLs and typos (via update-page on MediaWiki MCP Server))
Line 1: Line 1:
GPM = Gaussian Process Modeling
GPM = Gaussian Process Modeling. It's the rare statistical tool that hands you uncertainty estimates as a built-in feature — no extra computation, no cross-validation headaches, the math just coughs them up as part of the deal. This page is a guided tour through doing Gaussian Process Modeling in Python, from hand-rolled covariance functions all the way to full Bayesian MCMC sampling.


=Notes=
== Why GPs Deserve the Hype ==


Link to nice notebook explaining how to implement this in Python: https://blog.dominodatalab.com/fitting-gaussian-process-models-python/
Most ML models give you a point prediction and call it a day. Gaussian Processes? They give you a full predictive distribution — mean AND variance — at every point in your input space. It's the difference between someone telling you "the answer is 42" versus "the answer is 42, and I'm about 90% sure it's between 38 and 46." That uncertainty isn't a bonus feature you bolt on later — it falls right out of the math.


==Python Libraries==
If you want to see this implemented end-to-end in a notebook, [https://blog.dominodatalab.com/fitting-gaussian-process-models-python/ this Domino Data Lab walkthrough] is worth your time.


To do Gaussian Process Modeling, can use several libraries:
== The Python GP Ecosystem ==
* GPFlow (Gaussian Process Modeling using TensorFlow under the hood)
* [[PyMC]]3 (PyMC implements Markov-chain Monte Carlo methods for GPM)
* scikit-learn (implements several GPM routines/objects/functions)


These packages principally implement covariance functions that can be used to describe non-linearity in data, and methods for fitting GPM parameters.
You've got three solid libraries to choose from. Picking between them is mostly about how much Bayesian firepower you need and how big your data is:


===Sample Data===
* '''GPFlow''' — TensorFlow-backed and built for scale. Descends from the Sheffield machine learning group's GPy, so it's got serious academic pedigree. When your dataset is large or your kernel structure gets exotic, this is the one.
* '''[[PyMC]]3''' — Probabilistic programming with full MCMC sampling. If you want posterior distributions over your hyperparameters (not just point estimates), PyMC3 is where the party's at.
* '''scikit-learn''' — The pragmatic choice. Solid GP implementations for both regression and classification, all through the familiar sklearn API. You sacrifice some flexibility (the mean function is locked at zero) but for most real-world problems you won't miss it.


To create sample data:
What these all give you: a library of covariance functions (kernels) that describe how your data correlates with itself, plus optimization routines for fitting the GP parameters.
 
=== Sample Data: Rolling Your Own GP ===
 
There's something deeply satisfying about implementing a GP from scratch before letting a library do the heavy lifting. Here's the core — an exponential covariance function and a conditional updater:


<pre>
<pre>
Line 35: Line 38:
</pre>
</pre>


We'll create a Gaussian process prior with hyperparameters theta0 = 1 and theta1 = 10, along with a zero mean:
Create a Gaussian process prior with hyperparameters θ₀ = 1 and θ₁ = 10, plus a zero mean:


<pre>
<pre>
Line 46: Line 49:
</pre>
</pre>


Now, pick an arbitrary starting point:
Pick an arbitrary starting point:


<pre>
<pre>
Line 53: Line 56:
</pre>
</pre>


Now update the confidence band to account for the new sample point by using the covariance function to generate point-wise intervals:
Now update the confidence band to account for the new sample point, using the covariance function to generate point-wise intervals:


<pre>
<pre>
Line 69: Line 72:
</pre>
</pre>


To plot:
Plot it:


<pre>
<pre>
Line 77: Line 80:
</pre>
</pre>


Another point can be added as well:
Add another point:


<pre>
<pre>
Line 84: Line 87:
</pre>
</pre>


To update the covariance to account for the new point:
Update the covariance to absorb the new observation:


<pre>
<pre>
Line 94: Line 97:
</pre>
</pre>


To sample multiple points at once:
Sample multiple points at once:


<pre>
<pre>
Line 102: Line 105:
</pre>
</pre>


Now the covariance can be updated again:
Update the covariance once more:


<pre>
<pre>
Line 116: Line 119:
</pre>
</pre>


===Scikit-Learn===
At this point you've got a working GP in about 30 lines of code. The covariance function does the heavy lifting — it encodes your assumptions about how the function behaves (smooth? wiggly? periodic?) — and the conditional update handles the Bayesian bookkeeping. It's kind of beautiful how little code this actually takes.
 
=== Scikit-Learn: The Workhorse Route ===
 
Scikit-learn gives you two GP classes that cover most use cases:


scikit-learn has several relevant functions/implementations of GPMs appropriate for different applications:
* '''GaussianProcessRegressor''' — You create it by specifying a kernel (covariance function). Fitting is done by maximizing the log marginal likelihood, which sidesteps the need for a computationally expensive cross-validation loop. One thing to know: the mean function is assumed to be zero and you can't change that. This sounds like a limitation but it's rarely a problem in practice — when you're computing the posterior, the mean function gets overwhelmed by the data pretty quickly anyway.
* GaussianProcessRegressor - create this by specifying an appropriate covariance function (kernel). Fitting is accomplished by maximizing log marginal likelihood (avoids computationally intensive cross-validation strategy). Does not allow for specification of mean function (assumed to be zero). This is because the mean is not as important when computing the posterior.
* '''GaussianProcessClassifier''' — For classification tasks with categorical/binary data. Uses a latent Gaussian response variable and transforms it to the unit interval (or simplex for multiclass). This gives you soft probabilistic classification rather than hard labels. The posterior is non-normal, so a Laplace approximation is used in place of maximizing the marginal likelihood.
* GaussianProcessClassifier - useful for classification tasks and fitting categorical/binary data. Uses a latent Gaussian response variable, transforms it to the unit interval (or simplex). Soft probabilisitic classification task, rather than a hard classification prediction. User provides a kernel to describe the covariance in the data set. The posterior is non-normal. Laplace approximation used in place of maximizing marginal likelihood.


Kernels:
Kernels live under <code>sklearn.gaussian_process.kernels</code>:
* <code>from sklearn import gaussian_process</code> will import code/functions related to Gaussian process modeling
 
* <code>from sklearn.gaussian_process.kernels import Matern</code> will import one of about a dozen GPM kernels
<pre>
* Matern covariance is a good, flexible first-choice:
from sklearn import gaussian_process
from sklearn.gaussian_process.kernels import Matern
</pre>
 
The Matern kernel is a great flexible first choice:


<math>
<math>
Line 131: Line 141:
</math>
</math>


* <math>\sigma</math> is amplitude, scalar multiplier that controls y-axis scaling
Okay, that's a lot of Greek. Here's what you actually care about — three intuitive knobs:
* <math>l</math> is length scale, scales realizations laong x-axis
* <math>\nu</math> is roughness, controls sharpness/smoothing of ridges in covariance function


Example usage:
* <math>\sigma</math> (amplitude) — scales the y-axis. How tall are your wiggles?
* A single GPM kernel can be a combination of multiple kernels
* <math>l</math> (length scale) — stretches realizations along the x-axis. How far apart are the features?
* Start by using a Matern component (Matern), then use an amplitude factor (ConstantKernel), then use an observation noise (WhiteKernel) kernel
* <math>\nu</math> (roughness) — controls the sharpness vs. smoothness of ridges in the covariance function. Higher ν = smoother functions.
 
A single GP kernel can be a combination of multiple kernels. A solid starting recipe: a Matern component for the signal structure, a ConstantKernel for amplitude scaling, and a WhiteKernel for observation noise:


<pre>
<pre>
Line 145: Line 155:
</pre>
</pre>


Then create a GaussianProcessRegressor object:
Create the regressor and fit:


<pre>
<pre>
Line 152: Line 162:
</pre>
</pre>


To do this all in a single call, while also setting the optimizer:
Or do it all in one shot with explicit optimizer settings:


<pre>
<pre>
Line 161: Line 171:
</pre>
</pre>


Some of these are set by default unless explicitly set:
A few things worth knowing about these defaults:
* The L-BFGS-B algorithm is used to optimize hyperparameters
* The output variable is not normalized
* <code>n_restarts_optimizer</code> is a setting to help prevent getting stuck in a local and not a global maximum. It restarts the optimization algorithm the number of times specified.
* Parameters attributed with the fit function have underscore appended to their names


Fit vs. Predict:
* '''L-BFGS-B''' is the optimizer — it's a quasi-Newton method that handles box constraints, which makes sense for hyperparameters that need to stay positive.
* The fit method runs an optimization procedure to fit the parameters
* '''normalize_y''' is False by default — your output variable stays raw.
* The predict method generates predicted outcomes <math>y^{\star}</math> given a new set of predictors <math>X^{\star}</math>
* '''n_restarts_optimizer''' is your defense against getting stuck in a local maximum. Each restart runs the optimizer from a different starting point. Crank this up if your fit looks sus.
* The predictors are fulfilled by the posterior predictive distribution - the Gaussian process mean and covariance that are updated to the posterior forms:
* Parameters learned during <code>fit()</code> get an underscore appended to their names (the sklearn convention).
 
'''Fit vs. Predict:'''
 
The <code>fit()</code> method runs the optimization to nail down the hyperparameters. The <code>predict()</code> method generates <math>y^{\star}</math> for new inputs <math>X^{\star}</math> using the posterior predictive distribution — this is the GP's way of saying "based on what I've seen, here's my best guess over here, and here's how sure I am":


<math>
<math>
Line 176: Line 186:
</math>
</math>


To do this with scikit-learn, using the interval from -5 to 5 as an input::
In code, predicting over a range and getting uncertainties back:


<pre>
<pre>
Line 183: Line 193:
</pre>
</pre>


===GPFlow===
That <code>return_std=True</code> flag is the thing that makes GPs special — you get your error bars in the same call as your predictions. No extra steps.


GPFlow is software derived from GPy software, a Gaussian Process fitting software from teh Sheffield machine learning group. Both provide a set of classes for creating Gaussian process models, and a library of kernels for mean and covariate functions. Under the hood, GPFlow uses the TensorFlow library. The use of TensorFlow allows fitting larger and more complex Gaussian Process models, and more complicated methods like variational inference.
=== GPFlow: When You Need the Big Guns ===


The API requires a tabulated input for predictors (features) and outputs.
GPFlow comes out of the Sheffield machine learning group (it's the successor to GPy) and uses TensorFlow as its computational backbone. The killer feature: TensorFlow's automatic differentiation means you can throw arbitrarily complex models at the optimizer without writing gradient code by hand. This lets you fit larger, gnarlier GP models and use techniques like variational inference that would be painful to implement manually.


To reshape y to a tabular format:
Start by reshaping your data into tabular format (GPFlow expects this):


<pre>
<pre>
Line 200: Line 210:
</pre>
</pre>


(a reminder, the exponential_cov function is defined as:)
(Quick refresher on the <code>exponential_cov</code> function:)


<pre>
<pre>
Line 208: Line 218:
</pre>
</pre>


Now import the GPFlow module:
Import GPFlow and set up a Matern kernel with a GPR model:


<pre>
<pre>
import GPflow
import GPflow
</pre>


GPFlow provides six GP classes for the covariance and model structure (non-normal likelihood).
We can fit the covariance and model using Markov Chain Monte Carlo (MCMC) or approximate it via variational inference.
Here is the same type of covariance matrix we used with scikit-learn, the Matern distribution:
<pre>
k = GPflow.kernels.Matern32(1, variance=1, lengthscales=1.2)
k = GPflow.kernels.Matern32(1, variance=1, lengthscales=1.2)
m = GPflow.gpr.GPR(X, Y, kern=k)
m = GPflow.gpr.GPR(X, Y, kern=k)
Line 226: Line 228:
</pre>
</pre>


The Matern kernel function has multiple parameters. There is also a likelihood variance parameter that can be set directly.
The Matern kernel has multiple tunable parameters. There's also a likelihood variance you can set directly:


<pre>
<pre>
Line 232: Line 234:
</pre>
</pre>


Now, to fit the model, call optimize():
Fit the model:


<pre>
<pre>
Line 238: Line 240:
</pre>
</pre>


To make predictions:
Make predictions:


<pre>
<pre>
Line 244: Line 246:
</pre>
</pre>


Note that no priors are specified, this just uses a maximum likelihood function. If you do want to assign priors for the kernel, however, they can be created using <code>GPflow.priors</code>, and accessed using <code>m.kern</code>:
So far this is maximum likelihood — no priors specified. But GPFlow lets you assign proper Bayesian priors to kernel parameters through <code>GPflow.priors</code>, accessed via <code>m.kern</code>:


<pre>
<pre>
Line 251: Line 253:
</pre>
</pre>


The hyperparameters can be assigned prior distributions, but we can also specify constant values. This is what we want to do if, for example, the measurement error of a data-collecting instrument is already known ahead of time. The error value is therefore constant and can be specified.
Sometimes you already know a parameter value — like when your measurement instrument has a documented error margin. In that case, pin it down:


<pre>
<pre>
Line 258: Line 260:
</pre>
</pre>


This has set the probabilities (log-probabilities) of the priors.  
Now the priors are wired in. Re-optimize and check the result:
 
Once we've set the hyperparameter values, we can re-optimize the model:


<pre>
<pre>
Line 267: Line 267:
</pre>
</pre>


The GPMC class allows a full Bayesian Monte Carlo analysis to jointly sample over parameters and functions. To do this, we must specify a likelihood and prior for the kernel. This utilizes a Hamiltonian Monte Carlo (HMC) method, an efficient Markov chain Monte Carlo (MCMC) method. It uses gradient information to improve posterior sampling. This utilizes the ability of TensorFlow to perform gradient calculations for arbitrary models.
'''Going Full Bayesian with GPMC:'''
 
The GPMC class lets you do Hamiltonian Monte Carlo (HMC) to jointly sample over parameters and functions. HMC uses gradient information to explore the posterior efficiently — and this is where TensorFlow's autodiff really shines:


<pre>
<pre>
Line 276: Line 278:
</pre>
</pre>


The trace function will compute values for the kernel parameter values:
Run the sampler:


<pre>
<pre>
Line 282: Line 284:
</pre>
</pre>


 
Grab parameter samples as a dataframe:
These can be obtained as a dataframe:


<pre>
<pre>
Line 289: Line 290:
</pre>
</pre>


Plotting the histogram:
Plot the posterior histograms:


<pre>
<pre>
Line 296: Line 297:
</pre>
</pre>


Now, predictions can be computed using the model:
Generate predictions from the posterior samples:


<pre>
<pre>
Line 306: Line 307:
</pre>
</pre>


This is useful when the likelihood function or model is unusual or difficult to fit using the gradient ascent optimization methods built into scikit-learn.
This HMC approach is clutch when your likelihood function or model structure is weird enough that gradient-ascent optimizers (like what scikit-learn uses) struggle to converge.


===PyMC3===
=== PyMC3: Full Bayesian, No Shortcuts ===


[[PyMC]]3 is a Python module for probabilistic programming for fitting a Bayesian model to data. PyMC3 uses a Theano backend (analogous to GPflow using TensorFlow as the backend). It provides functions and objects for specifying covariance and prior distribution kernels. Like GPflow, we must specify these as tensor variables.
[[PyMC]]3 is a probabilistic programming framework that uses Theano as its backend (analogous to GPFlow using TensorFlow). It gives you functions and objects for specifying covariance kernels and prior distributions — but expressed as Theano tensor variables, since the whole model gets compiled into a computation graph.


To import it, we import PyMC3 and Theano:
Import PyMC3 and Theano:


<pre>
<pre>
Line 319: Line 320:
</pre>
</pre>


A Model() context is needed to declare parameters, variables, and functions within a Bayesian model. (Se PyMC3 documentation [http://docs.pymc.io/ here]). For example, we'll construct a Matern distribution again:
Everything lives inside a <code>Model()</code> context. This is where you declare parameters, variables, and functions that define your Bayesian model. (See [http://docs.pymc.io/ PyMC3 documentation] for the full picture.) Here's a Matern setup:


<pre>
<pre>
Line 328: Line 329:
</pre>
</pre>


Next, the mean function is specified (this will be zero if none is specified), and a half-Cauchy distribution is specified for a noise variable:
Specify a zero mean function and a Half-Cauchy prior for the noise:


<pre>
<pre>
Line 336: Line 337:
</pre>
</pre>


PyMC implements Gaussian process models using the GP class. This requires a mean function, covariance function, and observation error function. These have been specified above.
The GP class ties together the mean function, covariance function, and observation noise. Then we hand it the actual data:
 
The outcomes of the GP have been observed: the data are contained in the X and y variables. We provide these to the GP object that we create:


<pre>
<pre>
Line 345: Line 344:
</pre>
</pre>


To fit the model, we call the sample() function from within the Model context. By default, PyMC uses the No U-turn Sampler (NUTS), which is a version of Hamiltonian Monte Carlo (which, in turn, is a form of Markov chain Monte Carlo). Like GPflow, this uses Theano's internals to compute gradients for arbitrary models.
Fit by calling <code>sample()</code>. PyMC3 defaults to the No U-Turn Sampler (NUTS), which is a sophisticated variant of Hamiltonian Monte Carlo — it uses Theano's autodiff to compute gradients and navigate the posterior efficiently:


<pre>
<pre>
Line 354: Line 353:
</pre>
</pre>


PyMC allows predictive sampling using the fit model. The sample_gp function implements sampling over a grid of points:
Generate predictive samples over a grid:


<pre>
<pre>
Line 363: Line 362:
</pre>
</pre>


This will generate 50 different random, Monte Carlo models that fit the data points.
This gives you 50 different Monte Carlo realizations of the function — each one a plausible fit to your data. The spread between them is your model uncertainty made visible.
 
Why use PyMC? MCMC is costly for very large data sets (because the log-probability must be evaluated at each iteration). Variational inference methods approximate the posterior with a cheaper and simpler approximation. While this makes it less accurate, it is appropriate for some problems.


Future work/methods will focus on improving the quality and lowering the cost of the approximation.
'''When to reach for PyMC3 over the alternatives:''' MCMC is expensive for large datasets — the log-probability gets evaluated at every iteration, and that adds up fast. Variational inference methods approximate the posterior with something cheaper at the cost of some accuracy. The tradeoff is often worth it.


Variational Gaussian Approximation is implemented in GPflow.
* '''Variational Gaussian Approximation''' is implemented in GPFlow.
* '''Automatic Differentiation Variational Inference''' (ADVI) is implemented in PyMC3.


Automatic Differentiation Variational Inference is implemented in PyMC.
The field is actively working on making these approximations better and cheaper, so expect this landscape to keep shifting.


=Flags=
== Where This Page Lives ==


[[Category:Data Analysis]]
[[Category:Data Analysis]]
[[Category:Python]]
[[Category:Python]]
[[Category:Monte Carlo]]
[[Category:Monte Carlo]]

Revision as of 17:28, 24 June 2026

GPM = Gaussian Process Modeling. It's the rare statistical tool that hands you uncertainty estimates as a built-in feature — no extra computation, no cross-validation headaches, the math just coughs them up as part of the deal. This page is a guided tour through doing Gaussian Process Modeling in Python, from hand-rolled covariance functions all the way to full Bayesian MCMC sampling.

Why GPs Deserve the Hype

Most ML models give you a point prediction and call it a day. Gaussian Processes? They give you a full predictive distribution — mean AND variance — at every point in your input space. It's the difference between someone telling you "the answer is 42" versus "the answer is 42, and I'm about 90% sure it's between 38 and 46." That uncertainty isn't a bonus feature you bolt on later — it falls right out of the math.

If you want to see this implemented end-to-end in a notebook, this Domino Data Lab walkthrough is worth your time.

The Python GP Ecosystem

You've got three solid libraries to choose from. Picking between them is mostly about how much Bayesian firepower you need and how big your data is:

  • GPFlow — TensorFlow-backed and built for scale. Descends from the Sheffield machine learning group's GPy, so it's got serious academic pedigree. When your dataset is large or your kernel structure gets exotic, this is the one.
  • PyMC3 — Probabilistic programming with full MCMC sampling. If you want posterior distributions over your hyperparameters (not just point estimates), PyMC3 is where the party's at.
  • scikit-learn — The pragmatic choice. Solid GP implementations for both regression and classification, all through the familiar sklearn API. You sacrifice some flexibility (the mean function is locked at zero) but for most real-world problems you won't miss it.

What these all give you: a library of covariance functions (kernels) that describe how your data correlates with itself, plus optimization routines for fitting the GP parameters.

Sample Data: Rolling Your Own GP

There's something deeply satisfying about implementing a GP from scratch before letting a library do the heavy lifting. Here's the core — an exponential covariance function and a conditional updater:

import numpy as np

def exponential_cov(x, y, params):
    return params[0] * np.exp( -0.5 * params[1] * np.subtract.outer(x, y)**2)

def conditional(x_new, x, y, params):
    B = exponential_cov(x_new, x, params)
    C = exponential_cov(x, x, params)
    A = exponential_cov(x_new, x_new, params)

    mu = np.linalg.inv(C).dot(B.T).T.dot(y)
    sigma = A - B.dot(np.linalg.inv(C).dot(B.T))

    return(mu.squeeze(), sigma.squeeze())

Create a Gaussian process prior with hyperparameters θ₀ = 1 and θ₁ = 10, plus a zero mean:

import matplotlib.pylab as plt

theta = [1, 10]
sigma_0 = exponential_cov(0, 0, theta)
xpts = np.arange(-3, 3, step=0.01)
plt.errorbar(xpts, np.zeros(len(xpts)), yerr=sigma_0, capsize=0)

Pick an arbitrary starting point:

x = [1.]
y = [np.random.normal(scale=sigma_0)]

Now update the confidence band to account for the new sample point, using the covariance function to generate point-wise intervals:

sigma_1 = exponential_cov(x, x, theta)

def predict(x, data, kernel, params, sigma, t):
    k = [kernel(x, y, params) for y in data]
    Sinv = np.linalg.inv(sigma)
    y_pred = np.dot(k, Sinv).dot(t)
    sigma_new = kernel(x, x, params) - np.dot(k, Sinv).dot(k)
    return y_pred, sigma_new

x_pred = np.linspace(-3, 3, 1000)
predictions = [predict(i, x, exponential_cov, theta, sigma_1, y) for i in x_pred]

Plot it:

y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")

Add another point:

m, s = conditional([-0.7], x, y, θ)
y2 = np.random.normal(m, s)

Update the covariance to absorb the new observation:

x.append(-0.7)
y.append(y2)

sigma_2 = exponential_cov(x, x, theta)
predictions = [predict(i, x, exponential_cov, theta, sigma_2, y) for i in x_pred]

Sample multiple points at once:

x_more = [-2.1, -1.5, 0.3, 1.8, 2.5]
mu, s = conditional(x_more, x, y, theta)
y_more = np.random.multivariate_normal(mu, s)

Update the covariance once more:

x += x_more
y += y_more.tolist()

σ_new = exponential_cov(x, x, theta)
predictions = [predict(i, x, exponential_cov, theta, sigma_new, y) for i in x_pred]

y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")

At this point you've got a working GP in about 30 lines of code. The covariance function does the heavy lifting — it encodes your assumptions about how the function behaves (smooth? wiggly? periodic?) — and the conditional update handles the Bayesian bookkeeping. It's kind of beautiful how little code this actually takes.

Scikit-Learn: The Workhorse Route

Scikit-learn gives you two GP classes that cover most use cases:

  • GaussianProcessRegressor — You create it by specifying a kernel (covariance function). Fitting is done by maximizing the log marginal likelihood, which sidesteps the need for a computationally expensive cross-validation loop. One thing to know: the mean function is assumed to be zero and you can't change that. This sounds like a limitation but it's rarely a problem in practice — when you're computing the posterior, the mean function gets overwhelmed by the data pretty quickly anyway.
  • GaussianProcessClassifier — For classification tasks with categorical/binary data. Uses a latent Gaussian response variable and transforms it to the unit interval (or simplex for multiclass). This gives you soft probabilistic classification rather than hard labels. The posterior is non-normal, so a Laplace approximation is used in place of maximizing the marginal likelihood.

Kernels live under sklearn.gaussian_process.kernels:

from sklearn import gaussian_process
from sklearn.gaussian_process.kernels import Matern

The Matern kernel is a great flexible first choice:

$ k_M(x) = \dfrac{ \sigma^2 }{\Gamma(\nu) 2^{\nu - 1}} \left( \dfrac{ \sqrt{2 \nu} x }{ l } \right)^{\nu} K_{\nu} \left( \dfrac{ \sqrt{2 \nu } x }{ l } \right) $

Okay, that's a lot of Greek. Here's what you actually care about — three intuitive knobs:

  • $ \sigma $ (amplitude) — scales the y-axis. How tall are your wiggles?
  • $ l $ (length scale) — stretches realizations along the x-axis. How far apart are the features?
  • $ \nu $ (roughness) — controls the sharpness vs. smoothness of ridges in the covariance function. Higher ν = smoother functions.

A single GP kernel can be a combination of multiple kernels. A solid starting recipe: a Matern component for the signal structure, a ConstantKernel for amplitude scaling, and a WhiteKernel for observation noise:

kernel = ConstantKernel() 
    + Matern(length_scale=2, nu=3/2) 
    + WhiteKernel(noise_level=1)

Create the regressor and fit:

gp = gaussian_process.GaussianProcessRegressor(kernel=kernel)
gp.fit(X, y)

Or do it all in one shot with explicit optimizer settings:

GaussianProcessRegressor(alpha=1e-10, copy_X_train=True,
    kernel=1**2 + Matern(length_scale=2, nu=1.5) + WhiteKernel(noise_level=1),
    n_restarts_optimizer=0, normalize_y=False,
    optimizer='fmin_l_bfgs_b', random_state=None)

A few things worth knowing about these defaults:

  • L-BFGS-B is the optimizer — it's a quasi-Newton method that handles box constraints, which makes sense for hyperparameters that need to stay positive.
  • normalize_y is False by default — your output variable stays raw.
  • n_restarts_optimizer is your defense against getting stuck in a local maximum. Each restart runs the optimizer from a different starting point. Crank this up if your fit looks sus.
  • Parameters learned during fit() get an underscore appended to their names (the sklearn convention).

Fit vs. Predict:

The fit() method runs the optimization to nail down the hyperparameters. The predict() method generates $ y^{\star} $ for new inputs $ X^{\star} $ using the posterior predictive distribution — this is the GP's way of saying "based on what I've seen, here's my best guess over here, and here's how sure I am":

$ p(y^{\star} \vert y, x, x^{\star}) = GP( m^{\star}(x^{\star}), k^{\star}(x^{\star}) ) $

In code, predicting over a range and getting uncertainties back:

x_pred = np.linspace(-5, 5).reshape(-1,1)
y_pred, sigma = gp.predict(x_pred, return_std=True)

That return_std=True flag is the thing that makes GPs special — you get your error bars in the same call as your predictions. No extra steps.

GPFlow: When You Need the Big Guns

GPFlow comes out of the Sheffield machine learning group (it's the successor to GPy) and uses TensorFlow as its computational backbone. The killer feature: TensorFlow's automatic differentiation means you can throw arbitrarily complex models at the optimizer without writing gradient code by hand. This lets you fit larger, gnarlier GP models and use techniques like variational inference that would be painful to implement manually.

Start by reshaping your data into tabular format (GPFlow expects this):

theta = [1, 10]
sigma_0 = exponential_cov(0, 0, theta)
y = [np.random.normal(scale=sigma_0)]

# reshape into tabular format
Y = y.reshape(-1,1)

(Quick refresher on the exponential_cov function:)

import numpy as np
def exponential_cov(x, y, params):
    return params[0] * np.exp(-0.5*params[1]*np.subtract.outer(x, y)**2)

Import GPFlow and set up a Matern kernel with a GPR model:

import GPflow

k = GPflow.kernels.Matern32(1, variance=1, lengthscales=1.2)
m = GPflow.gpr.GPR(X, Y, kern=k)
print(m)

The Matern kernel has multiple tunable parameters. There's also a likelihood variance you can set directly:

m.likelihood.variance = 0.01

Fit the model:

m.optimize()

Make predictions:

m.predict_y()

So far this is maximum likelihood — no priors specified. But GPFlow lets you assign proper Bayesian priors to kernel parameters through GPflow.priors, accessed via m.kern:

m.kern.variance.prior = GPflow.priors.Gamma(1,0.1)
m.kern.lengthscales.prior = GPflow.priors.Gamma(1,0.1)

Sometimes you already know a parameter value — like when your measurement instrument has a documented error margin. In that case, pin it down:

m.likelihood.variance = 0.1
m.likelihood.variance.fixed = True

Now the priors are wired in. Re-optimize and check the result:

m.optimize()
print(m)

Going Full Bayesian with GPMC:

The GPMC class lets you do Hamiltonian Monte Carlo (HMC) to jointly sample over parameters and functions. HMC uses gradient information to explore the posterior efficiently — and this is where TensorFlow's autodiff really shines:

l = GPflow.likelihoods.StudentT()
m = GPflow.gpmc.GPMC(X, Y, kern=k, likelihood=l)
m.kern.variance.prior = GPflow.priors.Gamma(1,1)
m.kern.lengthscales.prior = GPflow.priors.Gamma(1,1)

Run the sampler:

trace = m.sample(1000, verbose=True, epsilon=0.03, Lmax=15)

Grab parameter samples as a dataframe:

parameter_samples = m.get_samples_df(trace)

Plot the posterior histograms:

for col in parameter_samples.columns.sort_values()[1:]:
    parameter_samples[col].hist(label=col.split('.')[-1], alpha=0.4, bins=15)

Generate predictions from the posterior samples:

realizations = []
for sample in trace[-100:]:
    m.set_state(sample)
realizations.append(m.predict_f_samples(xx, 1).squeeze())
realizations = np.vstack(realizations)

This HMC approach is clutch when your likelihood function or model structure is weird enough that gradient-ascent optimizers (like what scikit-learn uses) struggle to converge.

PyMC3: Full Bayesian, No Shortcuts

PyMC3 is a probabilistic programming framework that uses Theano as its backend (analogous to GPFlow using TensorFlow). It gives you functions and objects for specifying covariance kernels and prior distributions — but expressed as Theano tensor variables, since the whole model gets compiled into a computation graph.

Import PyMC3 and Theano:

import pymc3 as pm
import theano.tensor as tt

Everything lives inside a Model() context. This is where you declare parameters, variables, and functions that define your Bayesian model. (See PyMC3 documentation for the full picture.) Here's a Matern setup:

with pm.Model() as gp_fit:
    rho = pm.Gamma('rho', 1, 1)
    eta = pm.Gamma('eta', 1, 1)
    k = eta * pm.gp.cov.Matern32(1, rho)

Specify a zero mean function and a Half-Cauchy prior for the noise:

with gp_fit:
    M = pm.gp.mean.Zero()
    sigma = pm.HalfCauchy('sigma', 2.5)

The GP class ties together the mean function, covariance function, and observation noise. Then we hand it the actual data:

with gp_fit:
    y_obs = pm.gp.GP('y_obs', mean_func=M, cov_func=K, sigma=sigma, observed={'X':X, 'Y':y})

Fit by calling sample(). PyMC3 defaults to the No U-Turn Sampler (NUTS), which is a sophisticated variant of Hamiltonian Monte Carlo — it uses Theano's autodiff to compute gradients and navigate the posterior efficiently:

with gp_fit:
    trace = pm.sample(2000, n_init = 20000)

pm.traceplot( trace[-1000:], varnames=['rho', 'sigma', 'eta'])

Generate predictive samples over a grid:

Z = np.linspace(-6, 6, 100).reshape(-1, 1)

with gp_fit:
    gp_samples = pm.gp.sample_gp( trace[1000:], y_obs, Z, samples=50)

This gives you 50 different Monte Carlo realizations of the function — each one a plausible fit to your data. The spread between them is your model uncertainty made visible.

When to reach for PyMC3 over the alternatives: MCMC is expensive for large datasets — the log-probability gets evaluated at every iteration, and that adds up fast. Variational inference methods approximate the posterior with something cheaper at the cost of some accuracy. The tradeoff is often worth it.

  • Variational Gaussian Approximation is implemented in GPFlow.
  • Automatic Differentiation Variational Inference (ADVI) is implemented in PyMC3.

The field is actively working on making these approximations better and cheaper, so expect this landscape to keep shifting.

Where This Page Lives