From charlesreid1

No edit summary
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
GPM = Gaussian Process Modeling
GPM = Gaussian Process Modeling


=Notes=
The rare statistical tool that hands you uncertainty estimates as a built-in feature - no extra computation, no cross-validation headaches, the math just coughs them up as part of the deal.


Link to nice notebook explaining how to implement this in Python: https://blog.dominodatalab.com/fitting-gaussian-process-models-python/
This page is a guided tour through doing Gaussian Process Modeling in Python, from hand-rolled covariance functions all the way to full Bayesian MCMC sampling.


==Python Libraries==
== Why GPs Deserve the Hype ==


To do Gaussian Process Modeling, can use several libraries:
Most ML models give you a point prediction and call it a day. Gaussian Processes? They give you a full predictive distribution - mean AND variance - at every point in your input space. It's the difference between someone telling you "the answer is 42" versus "the answer is 42, and I'm about 90% sure it's between 38 and 46." That uncertainty isn't a bonus feature you bolt on later - it falls right out of the math.
* GPFlow (Gaussian Process Modeling using TensorFlow under the hood)
* [[PyMC]]3 (PyMC implements Markov-chain Monte Carlo methods for GPM)
* scikit-learn (implements several GPM routines/objects/functions)


These packages principally implement covariance functions that can be used to describe non-linearity in data, and methods for fitting GPM parameters.
If you want to see this implemented end-to-end in a notebook, [https://blog.dominodatalab.com/fitting-gaussian-process-models-python/ this Domino Data Lab walkthrough] is worth your time.


===Scikit-Learn===
== The Python GP Ecosystem ==


scikit-learn has several relevant functions/implementations of GPMs appropriate for different applications:
You've got three solid libraries to choose from. Picking between them is mostly about how much Bayesian firepower you need and how big your data is:
* GaussianProcessRegressor - create this by specifying an appropriate covariance function (kernel). Fitting is accomplished by maximizing log marginal likelihood (avoids computationally intensive cross-validation strategy). Does not allow for specification of mean function (assumed to be zero). This is because the mean is not as important when computing the posterior.
* GaussianProcessClassifier - useful for classification tasks and fitting categorical/binary data. Uses a latent Gaussian response variable, transforms it to the unit interval (or simplex). Soft probabilisitic classification task, rather than a hard classification prediction. User provides a kernel to describe the covariance in the data set. The posterior is non-normal. Laplace approximation used in place of maximizing marginal likelihood.


Kernels:
* '''GPFlow''' - TensorFlow-backed and built for scale. Descends from the Sheffield machine learning group's GPy, so it's got serious academic pedigree. When your dataset is large or your kernel structure gets exotic, this is the one.
* <code>from sklearn import gaussian_process</code> will import code/functions related to Gaussian process modeling
* '''[[PyMC]]3''' - Probabilistic programming with full MCMC sampling. If you want posterior distributions over your hyperparameters (not just point estimates), PyMC3 is where the party's at.
* <code>from sklearn.gaussian_process.kernels import Matern</code> will import one of about a dozen GPM kernels
* '''scikit-learn''' - The pragmatic choice. Solid GP implementations for both regression and classification, all through the familiar sklearn API. You sacrifice some flexibility (the mean function is locked at zero) but for most real-world problems you won't miss it.
* Matern covariance is a good, flexible first-choice:
 
What these all give you: a library of covariance functions (kernels) that describe how your data correlates with itself, plus optimization routines for fitting the GP parameters.
 
=== Sample Data: Rolling Your Own GP ===
 
There's something deeply satisfying about implementing a GP from scratch before letting a library do the heavy lifting. Here's the core - an exponential covariance function and a conditional updater:
 
<pre>
import numpy as np
 
def exponential_cov(x, y, params):
    return params[0] * np.exp( -0.5 * params[1] * np.subtract.outer(x, y)**2)
 
def conditional(x_new, x, y, params):
    B = exponential_cov(x_new, x, params)
    C = exponential_cov(x, x, params)
    A = exponential_cov(x_new, x_new, params)
 
    mu = np.linalg.inv(C).dot(B.T).T.dot(y)
    sigma = A - B.dot(np.linalg.inv(C).dot(B.T))
 
    return(mu.squeeze(), sigma.squeeze())
</pre>
 
Create a Gaussian process prior with hyperparameters θ₀ = 1 and θ₁ = 10, plus a zero mean:
 
<pre>
import matplotlib.pylab as plt
 
theta = [1, 10]
sigma_0 = exponential_cov(0, 0, theta)
xpts = np.arange(-3, 3, step=0.01)
plt.errorbar(xpts, np.zeros(len(xpts)), yerr=sigma_0, capsize=0)
</pre>
 
Pick an arbitrary starting point:
 
<pre>
x = [1.]
y = [np.random.normal(scale=sigma_0)]
</pre>
 
Now update the confidence band to account for the new sample point, using the covariance function to generate point-wise intervals:
 
<pre>
sigma_1 = exponential_cov(x, x, theta)
 
def predict(x, data, kernel, params, sigma, t):
    k = [kernel(x, y, params) for y in data]
    Sinv = np.linalg.inv(sigma)
    y_pred = np.dot(k, Sinv).dot(t)
    sigma_new = kernel(x, x, params) - np.dot(k, Sinv).dot(k)
    return y_pred, sigma_new
 
x_pred = np.linspace(-3, 3, 1000)
predictions = [predict(i, x, exponential_cov, theta, sigma_1, y) for i in x_pred]
</pre>
 
Plot it:
 
<pre>
y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")
</pre>
 
Add another point:
 
<pre>
m, s = conditional([-0.7], x, y, θ)
y2 = np.random.normal(m, s)
</pre>
 
Update the covariance to absorb the new observation:
 
<pre>
x.append(-0.7)
y.append(y2)
 
sigma_2 = exponential_cov(x, x, theta)
predictions = [predict(i, x, exponential_cov, theta, sigma_2, y) for i in x_pred]
</pre>
 
Sample multiple points at once:
 
<pre>
x_more = [-2.1, -1.5, 0.3, 1.8, 2.5]
mu, s = conditional(x_more, x, y, theta)
y_more = np.random.multivariate_normal(mu, s)
</pre>
 
Update the covariance once more:
 
<pre>
x += x_more
y += y_more.tolist()
 
σ_new = exponential_cov(x, x, theta)
predictions = [predict(i, x, exponential_cov, theta, sigma_new, y) for i in x_pred]
 
y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")
</pre>
 
At this point you've got a working GP in about 30 lines of code. The covariance function does the heavy lifting - it encodes your assumptions about how the function behaves (smooth? wiggly? periodic?) - and the conditional update handles the Bayesian bookkeeping. It's kind of beautiful how little code this actually takes.
 
=== Scikit-Learn: The Workhorse Route ===
 
Scikit-learn gives you two GP classes that cover most use cases:
 
* '''GaussianProcessRegressor''' - You create it by specifying a kernel (covariance function). Fitting is done by maximizing the log marginal likelihood, which sidesteps the need for a computationally expensive cross-validation loop. One thing to know: the mean function is assumed to be zero and you can't change that. This sounds like a limitation but it's rarely a problem in practice - when you're computing the posterior, the mean function gets overwhelmed by the data pretty quickly anyway.
* '''GaussianProcessClassifier''' - For classification tasks with categorical/binary data. Uses a latent Gaussian response variable and transforms it to the unit interval (or simplex for multiclass). This gives you soft probabilistic classification rather than hard labels. The posterior is non-normal, so a Laplace approximation is used in place of maximizing the marginal likelihood.
 
Kernels live under <code>sklearn.gaussian_process.kernels</code>:
 
<pre>
from sklearn import gaussian_process
from sklearn.gaussian_process.kernels import Matern
</pre>
 
The Matern kernel is a great flexible first choice:


<math>
<math>
Line 29: Line 145:
</math>
</math>


* <math>\sigma</math> is amplitude, scalar multiplier that controls y-axis scaling
Okay, that's a lot of Greek. Here's what you actually care about - three intuitive knobs:
* <math>l</math> is length scale, scales realizations laong x-axis
* <math>\nu</math> is roughness, controls sharpness/smoothing of ridges in covariance function


Example usage:
* <math>\sigma</math> (amplitude) - scales the y-axis. How tall are your wiggles?
* A single GPM kernel can be a combination of multiple kernels
* <math>l</math> (length scale) - stretches realizations along the x-axis. How far apart are the features?
* Start by using a Matern component (Matern), then use an amplitude factor (ConstantKernel), then use an observation noise (WhiteKernel) kernel
* <math>\nu</math> (roughness) - controls the sharpness vs. smoothness of ridges in the covariance function. Higher ν = smoother functions.
 
A single GP kernel can be a combination of multiple kernels. A solid starting recipe: a Matern component for the signal structure, a ConstantKernel for amplitude scaling, and a WhiteKernel for observation noise:


<pre>
<pre>
Line 43: Line 159:
</pre>
</pre>


Then create a GaussianProcessRegressor object:
Create the regressor and fit:


<pre>
<pre>
Line 50: Line 166:
</pre>
</pre>


To do this all in a single call, while also setting the optimizer:
Or do it all in one shot with explicit optimizer settings:


<pre>
<pre>
Line 59: Line 175:
</pre>
</pre>


Some of these are set by default unless explicitly set:
A few things worth knowing about these defaults:
* The L-BFGS-B algorithm is used to optimize hyperparameters
 
* The output variable is not normalized
* '''L-BFGS-B''' is the optimizer - it's a quasi-Newton method that handles box constraints, which makes sense for hyperparameters that need to stay positive.
* <code>n_restarts_optimizer</code> is a setting to help prevent getting stuck in a local and not a global maximum. It restarts the optimization algorithm the number of times specified.
* '''normalize_y''' is False by default - your output variable stays raw.
* Parameters attributed with the fit function have underscore appended to their names
* '''n_restarts_optimizer''' is your defense against getting stuck in a local maximum. Each restart runs the optimizer from a different starting point. Crank this up if your fit looks sus.
* Parameters learned during <code>fit()</code> get an underscore appended to their names (the sklearn convention).
 
'''Fit vs. Predict:'''


Fit vs. Predict:
The <code>fit()</code> method runs the optimization to nail down the hyperparameters. The <code>predict()</code> method generates <math>y^{\star}</math> for new inputs <math>X^{\star}</math> using the posterior predictive distribution - this is the GP's way of saying "based on what I've seen, here's my best guess over here, and here's how sure I am":
* The fit method runs an optimization procedure to fit the parameters
* The predict method generates predicted outcomes <math>y^{\star}</math> given a new set of predictors <math>X^{\star}</math>
* The predictors are fulfilled by the posterior predictive distribution - the Gaussian process mean and covariance that are updated to the posterior forms:


<math>
<math>
Line 74: Line 190:
</math>
</math>


To do this with scikit-learn, using the interval from -5 to 5 as an input::
In code, predicting over a range and getting uncertainties back:


<pre>
<pre>
Line 81: Line 197:
</pre>
</pre>


===GPFlow===
That <code>return_std=True</code> flag is the thing that makes GPs special - you get your error bars in the same call as your predictions. No extra steps.


GPFlow is software derived from GPy software, a Gaussian Process fitting software from teh Sheffield machine learning group. Both provide a set of classes for creating Gaussian process models, and a library of kernels for mean and covariate functions. Under the hood, GPFlow uses the TensorFlow library. The use of TensorFlow allows fitting larger and more complex Gaussian Process models, and more complicated methods like variational inference.
=== GPFlow: When You Need the Big Guns ===


The API requires a tabulated input for predictors (features) and outputs.
GPFlow comes out of the Sheffield machine learning group (it's the successor to GPy) and uses TensorFlow as its computational backbone. The killer feature: TensorFlow's automatic differentiation means you can throw arbitrarily complex models at the optimizer without writing gradient code by hand. This lets you fit larger, gnarlier GP models and use techniques like variational inference that would be painful to implement manually.


To reshape y to a tabular format:
Start by reshaping your data into tabular format (GPFlow expects this):


<pre>
<pre>
Line 98: Line 214:
</pre>
</pre>


(a reminder, the exponential_cov function is defined as:)
(Quick refresher on the <code>exponential_cov</code> function:)


<pre>
<pre>
Line 106: Line 222:
</pre>
</pre>


Now import the GPFlow module:
Import GPFlow and set up a Matern kernel with a GPR model:


<pre>
<pre>
import GPflow
import GPflow
k = GPflow.kernels.Matern32(1, variance=1, lengthscales=1.2)
m = GPflow.gpr.GPR(X, Y, kern=k)
print(m)
</pre>
The Matern kernel has multiple tunable parameters. There's also a likelihood variance you can set directly:
<pre>
m.likelihood.variance = 0.01
</pre>
Fit the model:
<pre>
m.optimize()
</pre>
Make predictions:
<pre>
m.predict_y()
</pre>
</pre>


GPFlow provides six GP classes for the covariance and model structure (non-normal likelihood).
So far this is maximum likelihood - no priors specified. But GPFlow lets you assign proper Bayesian priors to kernel parameters through <code>GPflow.priors</code>, accessed via <code>m.kern</code>:


We can fit the covariance and model using Markov Chain Monte Carlo (MCMC) or approximate it via variational inference.
<pre>
m.kern.variance.prior = GPflow.priors.Gamma(1,0.1)
m.kern.lengthscales.prior = GPflow.priors.Gamma(1,0.1)
</pre>


Here is the same type of covariance matrix we used with scikit-learn, the Matern distribution:
Sometimes you already know a parameter value - like when your measurement instrument has a documented error margin. In that case, pin it down:


<pre>
<pre>
k = GPflow.kernels.Matern32(1, variance=1, lengthscales=1.2)
m.likelihood.variance = 0.1
m.likelihood.variance.fixed = True
</pre>


m = GPflow.gpr.GPR(X, Y, kern=k)
Now the priors are wired in. Re-optimize and check the result:


<pre>
m.optimize()
print(m)
print(m)
</pre>
</pre>


=Flags=
'''Going Full Bayesian with GPMC:'''
 
The GPMC class lets you do Hamiltonian Monte Carlo (HMC) to jointly sample over parameters and functions. HMC uses gradient information to explore the posterior efficiently - and this is where TensorFlow's autodiff really shines:
 
<pre>
l = GPflow.likelihoods.StudentT()
m = GPflow.gpmc.GPMC(X, Y, kern=k, likelihood=l)
m.kern.variance.prior = GPflow.priors.Gamma(1,1)
m.kern.lengthscales.prior = GPflow.priors.Gamma(1,1)
</pre>
 
Run the sampler:
 
<pre>
trace = m.sample(1000, verbose=True, epsilon=0.03, Lmax=15)
</pre>
 
Grab parameter samples as a dataframe:
 
<pre>
parameter_samples = m.get_samples_df(trace)
</pre>
 
Plot the posterior histograms:
 
<pre>
for col in parameter_samples.columns.sort_values()[1:]:
    parameter_samples[col].hist(label=col.split('.')[-1], alpha=0.4, bins=15)
</pre>
 
Generate predictions from the posterior samples:
 
<pre>
realizations = []
for sample in trace[-100:]:
    m.set_state(sample)
realizations.append(m.predict_f_samples(xx, 1).squeeze())
realizations = np.vstack(realizations)
</pre>
 
This HMC approach is clutch when your likelihood function or model structure is weird enough that gradient-ascent optimizers (like what scikit-learn uses) struggle to converge.
 
=== PyMC3: Full Bayesian, No Shortcuts ===
 
[[PyMC]]3 is a probabilistic programming framework that uses Theano as its backend (analogous to GPFlow using TensorFlow). It gives you functions and objects for specifying covariance kernels and prior distributions - but expressed as Theano tensor variables, since the whole model gets compiled into a computation graph.
 
Import PyMC3 and Theano:
 
<pre>
import pymc3 as pm
import theano.tensor as tt
</pre>
 
Everything lives inside a <code>Model()</code> context. This is where you declare parameters, variables, and functions that define your Bayesian model. (See [http://docs.pymc.io/ PyMC3 documentation] for the full picture.) Here's a Matern setup:
 
<pre>
with pm.Model() as gp_fit:
    rho = pm.Gamma('rho', 1, 1)
    eta = pm.Gamma('eta', 1, 1)
    k = eta * pm.gp.cov.Matern32(1, rho)
</pre>
 
Specify a zero mean function and a Half-Cauchy prior for the noise:
 
<pre>
with gp_fit:
    M = pm.gp.mean.Zero()
    sigma = pm.HalfCauchy('sigma', 2.5)
</pre>
 
The GP class ties together the mean function, covariance function, and observation noise. Then we hand it the actual data:
 
<pre>
with gp_fit:
    y_obs = pm.gp.GP('y_obs', mean_func=M, cov_func=K, sigma=sigma, observed={'X':X, 'Y':y})
</pre>
 
Fit by calling <code>sample()</code>. PyMC3 defaults to the No U-Turn Sampler (NUTS), which is a sophisticated variant of Hamiltonian Monte Carlo - it uses Theano's autodiff to compute gradients and navigate the posterior efficiently:
 
<pre>
with gp_fit:
    trace = pm.sample(2000, n_init = 20000)
 
pm.traceplot( trace[-1000:], varnames=['rho', 'sigma', 'eta'])
</pre>
 
Generate predictive samples over a grid:
 
<pre>
Z = np.linspace(-6, 6, 100).reshape(-1, 1)
 
with gp_fit:
    gp_samples = pm.gp.sample_gp( trace[1000:], y_obs, Z, samples=50)
</pre>
 
This gives you 50 different Monte Carlo realizations of the function - each one a plausible fit to your data. The spread between them is your model uncertainty made visible.
 
'''When to reach for PyMC3 over the alternatives:''' MCMC is expensive for large datasets - the log-probability gets evaluated at every iteration, and that adds up fast. Variational inference methods approximate the posterior with something cheaper at the cost of some accuracy. The tradeoff is often worth it.
 
* '''Variational Gaussian Approximation''' is implemented in GPFlow.
* '''Automatic Differentiation Variational Inference''' (ADVI) is implemented in PyMC3.
 
The field is actively working on making these approximations better and cheaper, so expect this landscape to keep shifting.


[[Category:Data Analysis]]
[[Category:Data Analysis]]
[[Category:Python]]
[[Category:Python]]
[[Category:Monte Carlo]]
[[Category:Monte Carlo]]

Latest revision as of 17:32, 24 June 2026

GPM = Gaussian Process Modeling

The rare statistical tool that hands you uncertainty estimates as a built-in feature - no extra computation, no cross-validation headaches, the math just coughs them up as part of the deal.

This page is a guided tour through doing Gaussian Process Modeling in Python, from hand-rolled covariance functions all the way to full Bayesian MCMC sampling.

Why GPs Deserve the Hype

Most ML models give you a point prediction and call it a day. Gaussian Processes? They give you a full predictive distribution - mean AND variance - at every point in your input space. It's the difference between someone telling you "the answer is 42" versus "the answer is 42, and I'm about 90% sure it's between 38 and 46." That uncertainty isn't a bonus feature you bolt on later - it falls right out of the math.

If you want to see this implemented end-to-end in a notebook, this Domino Data Lab walkthrough is worth your time.

The Python GP Ecosystem

You've got three solid libraries to choose from. Picking between them is mostly about how much Bayesian firepower you need and how big your data is:

  • GPFlow - TensorFlow-backed and built for scale. Descends from the Sheffield machine learning group's GPy, so it's got serious academic pedigree. When your dataset is large or your kernel structure gets exotic, this is the one.
  • PyMC3 - Probabilistic programming with full MCMC sampling. If you want posterior distributions over your hyperparameters (not just point estimates), PyMC3 is where the party's at.
  • scikit-learn - The pragmatic choice. Solid GP implementations for both regression and classification, all through the familiar sklearn API. You sacrifice some flexibility (the mean function is locked at zero) but for most real-world problems you won't miss it.

What these all give you: a library of covariance functions (kernels) that describe how your data correlates with itself, plus optimization routines for fitting the GP parameters.

Sample Data: Rolling Your Own GP

There's something deeply satisfying about implementing a GP from scratch before letting a library do the heavy lifting. Here's the core - an exponential covariance function and a conditional updater:

import numpy as np

def exponential_cov(x, y, params):
    return params[0] * np.exp( -0.5 * params[1] * np.subtract.outer(x, y)**2)

def conditional(x_new, x, y, params):
    B = exponential_cov(x_new, x, params)
    C = exponential_cov(x, x, params)
    A = exponential_cov(x_new, x_new, params)

    mu = np.linalg.inv(C).dot(B.T).T.dot(y)
    sigma = A - B.dot(np.linalg.inv(C).dot(B.T))

    return(mu.squeeze(), sigma.squeeze())

Create a Gaussian process prior with hyperparameters θ₀ = 1 and θ₁ = 10, plus a zero mean:

import matplotlib.pylab as plt

theta = [1, 10]
sigma_0 = exponential_cov(0, 0, theta)
xpts = np.arange(-3, 3, step=0.01)
plt.errorbar(xpts, np.zeros(len(xpts)), yerr=sigma_0, capsize=0)

Pick an arbitrary starting point:

x = [1.]
y = [np.random.normal(scale=sigma_0)]

Now update the confidence band to account for the new sample point, using the covariance function to generate point-wise intervals:

sigma_1 = exponential_cov(x, x, theta)

def predict(x, data, kernel, params, sigma, t):
    k = [kernel(x, y, params) for y in data]
    Sinv = np.linalg.inv(sigma)
    y_pred = np.dot(k, Sinv).dot(t)
    sigma_new = kernel(x, x, params) - np.dot(k, Sinv).dot(k)
    return y_pred, sigma_new

x_pred = np.linspace(-3, 3, 1000)
predictions = [predict(i, x, exponential_cov, theta, sigma_1, y) for i in x_pred]

Plot it:

y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")

Add another point:

m, s = conditional([-0.7], x, y, θ)
y2 = np.random.normal(m, s)

Update the covariance to absorb the new observation:

x.append(-0.7)
y.append(y2)

sigma_2 = exponential_cov(x, x, theta)
predictions = [predict(i, x, exponential_cov, theta, sigma_2, y) for i in x_pred]

Sample multiple points at once:

x_more = [-2.1, -1.5, 0.3, 1.8, 2.5]
mu, s = conditional(x_more, x, y, theta)
y_more = np.random.multivariate_normal(mu, s)

Update the covariance once more:

x += x_more
y += y_more.tolist()

σ_new = exponential_cov(x, x, theta)
predictions = [predict(i, x, exponential_cov, theta, sigma_new, y) for i in x_pred]

y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")

At this point you've got a working GP in about 30 lines of code. The covariance function does the heavy lifting - it encodes your assumptions about how the function behaves (smooth? wiggly? periodic?) - and the conditional update handles the Bayesian bookkeeping. It's kind of beautiful how little code this actually takes.

Scikit-Learn: The Workhorse Route

Scikit-learn gives you two GP classes that cover most use cases:

  • GaussianProcessRegressor - You create it by specifying a kernel (covariance function). Fitting is done by maximizing the log marginal likelihood, which sidesteps the need for a computationally expensive cross-validation loop. One thing to know: the mean function is assumed to be zero and you can't change that. This sounds like a limitation but it's rarely a problem in practice - when you're computing the posterior, the mean function gets overwhelmed by the data pretty quickly anyway.
  • GaussianProcessClassifier - For classification tasks with categorical/binary data. Uses a latent Gaussian response variable and transforms it to the unit interval (or simplex for multiclass). This gives you soft probabilistic classification rather than hard labels. The posterior is non-normal, so a Laplace approximation is used in place of maximizing the marginal likelihood.

Kernels live under sklearn.gaussian_process.kernels:

from sklearn import gaussian_process
from sklearn.gaussian_process.kernels import Matern

The Matern kernel is a great flexible first choice:

$ k_M(x) = \dfrac{ \sigma^2 }{\Gamma(\nu) 2^{\nu - 1}} \left( \dfrac{ \sqrt{2 \nu} x }{ l } \right)^{\nu} K_{\nu} \left( \dfrac{ \sqrt{2 \nu } x }{ l } \right) $

Okay, that's a lot of Greek. Here's what you actually care about - three intuitive knobs:

  • $ \sigma $ (amplitude) - scales the y-axis. How tall are your wiggles?
  • $ l $ (length scale) - stretches realizations along the x-axis. How far apart are the features?
  • $ \nu $ (roughness) - controls the sharpness vs. smoothness of ridges in the covariance function. Higher ν = smoother functions.

A single GP kernel can be a combination of multiple kernels. A solid starting recipe: a Matern component for the signal structure, a ConstantKernel for amplitude scaling, and a WhiteKernel for observation noise:

kernel = ConstantKernel() 
    + Matern(length_scale=2, nu=3/2) 
    + WhiteKernel(noise_level=1)

Create the regressor and fit:

gp = gaussian_process.GaussianProcessRegressor(kernel=kernel)
gp.fit(X, y)

Or do it all in one shot with explicit optimizer settings:

GaussianProcessRegressor(alpha=1e-10, copy_X_train=True,
    kernel=1**2 + Matern(length_scale=2, nu=1.5) + WhiteKernel(noise_level=1),
    n_restarts_optimizer=0, normalize_y=False,
    optimizer='fmin_l_bfgs_b', random_state=None)

A few things worth knowing about these defaults:

  • L-BFGS-B is the optimizer - it's a quasi-Newton method that handles box constraints, which makes sense for hyperparameters that need to stay positive.
  • normalize_y is False by default - your output variable stays raw.
  • n_restarts_optimizer is your defense against getting stuck in a local maximum. Each restart runs the optimizer from a different starting point. Crank this up if your fit looks sus.
  • Parameters learned during fit() get an underscore appended to their names (the sklearn convention).

Fit vs. Predict:

The fit() method runs the optimization to nail down the hyperparameters. The predict() method generates $ y^{\star} $ for new inputs $ X^{\star} $ using the posterior predictive distribution - this is the GP's way of saying "based on what I've seen, here's my best guess over here, and here's how sure I am":

$ p(y^{\star} \vert y, x, x^{\star}) = GP( m^{\star}(x^{\star}), k^{\star}(x^{\star}) ) $

In code, predicting over a range and getting uncertainties back:

x_pred = np.linspace(-5, 5).reshape(-1,1)
y_pred, sigma = gp.predict(x_pred, return_std=True)

That return_std=True flag is the thing that makes GPs special - you get your error bars in the same call as your predictions. No extra steps.

GPFlow: When You Need the Big Guns

GPFlow comes out of the Sheffield machine learning group (it's the successor to GPy) and uses TensorFlow as its computational backbone. The killer feature: TensorFlow's automatic differentiation means you can throw arbitrarily complex models at the optimizer without writing gradient code by hand. This lets you fit larger, gnarlier GP models and use techniques like variational inference that would be painful to implement manually.

Start by reshaping your data into tabular format (GPFlow expects this):

theta = [1, 10]
sigma_0 = exponential_cov(0, 0, theta)
y = [np.random.normal(scale=sigma_0)]

# reshape into tabular format
Y = y.reshape(-1,1)

(Quick refresher on the exponential_cov function:)

import numpy as np
def exponential_cov(x, y, params):
    return params[0] * np.exp(-0.5*params[1]*np.subtract.outer(x, y)**2)

Import GPFlow and set up a Matern kernel with a GPR model:

import GPflow

k = GPflow.kernels.Matern32(1, variance=1, lengthscales=1.2)
m = GPflow.gpr.GPR(X, Y, kern=k)
print(m)

The Matern kernel has multiple tunable parameters. There's also a likelihood variance you can set directly:

m.likelihood.variance = 0.01

Fit the model:

m.optimize()

Make predictions:

m.predict_y()

So far this is maximum likelihood - no priors specified. But GPFlow lets you assign proper Bayesian priors to kernel parameters through GPflow.priors, accessed via m.kern:

m.kern.variance.prior = GPflow.priors.Gamma(1,0.1)
m.kern.lengthscales.prior = GPflow.priors.Gamma(1,0.1)

Sometimes you already know a parameter value - like when your measurement instrument has a documented error margin. In that case, pin it down:

m.likelihood.variance = 0.1
m.likelihood.variance.fixed = True

Now the priors are wired in. Re-optimize and check the result:

m.optimize()
print(m)

Going Full Bayesian with GPMC:

The GPMC class lets you do Hamiltonian Monte Carlo (HMC) to jointly sample over parameters and functions. HMC uses gradient information to explore the posterior efficiently - and this is where TensorFlow's autodiff really shines:

l = GPflow.likelihoods.StudentT()
m = GPflow.gpmc.GPMC(X, Y, kern=k, likelihood=l)
m.kern.variance.prior = GPflow.priors.Gamma(1,1)
m.kern.lengthscales.prior = GPflow.priors.Gamma(1,1)

Run the sampler:

trace = m.sample(1000, verbose=True, epsilon=0.03, Lmax=15)

Grab parameter samples as a dataframe:

parameter_samples = m.get_samples_df(trace)

Plot the posterior histograms:

for col in parameter_samples.columns.sort_values()[1:]:
    parameter_samples[col].hist(label=col.split('.')[-1], alpha=0.4, bins=15)

Generate predictions from the posterior samples:

realizations = []
for sample in trace[-100:]:
    m.set_state(sample)
realizations.append(m.predict_f_samples(xx, 1).squeeze())
realizations = np.vstack(realizations)

This HMC approach is clutch when your likelihood function or model structure is weird enough that gradient-ascent optimizers (like what scikit-learn uses) struggle to converge.

PyMC3: Full Bayesian, No Shortcuts

PyMC3 is a probabilistic programming framework that uses Theano as its backend (analogous to GPFlow using TensorFlow). It gives you functions and objects for specifying covariance kernels and prior distributions - but expressed as Theano tensor variables, since the whole model gets compiled into a computation graph.

Import PyMC3 and Theano:

import pymc3 as pm
import theano.tensor as tt

Everything lives inside a Model() context. This is where you declare parameters, variables, and functions that define your Bayesian model. (See PyMC3 documentation for the full picture.) Here's a Matern setup:

with pm.Model() as gp_fit:
    rho = pm.Gamma('rho', 1, 1)
    eta = pm.Gamma('eta', 1, 1)
    k = eta * pm.gp.cov.Matern32(1, rho)

Specify a zero mean function and a Half-Cauchy prior for the noise:

with gp_fit:
    M = pm.gp.mean.Zero()
    sigma = pm.HalfCauchy('sigma', 2.5)

The GP class ties together the mean function, covariance function, and observation noise. Then we hand it the actual data:

with gp_fit:
    y_obs = pm.gp.GP('y_obs', mean_func=M, cov_func=K, sigma=sigma, observed={'X':X, 'Y':y})

Fit by calling sample(). PyMC3 defaults to the No U-Turn Sampler (NUTS), which is a sophisticated variant of Hamiltonian Monte Carlo - it uses Theano's autodiff to compute gradients and navigate the posterior efficiently:

with gp_fit:
    trace = pm.sample(2000, n_init = 20000)

pm.traceplot( trace[-1000:], varnames=['rho', 'sigma', 'eta'])

Generate predictive samples over a grid:

Z = np.linspace(-6, 6, 100).reshape(-1, 1)

with gp_fit:
    gp_samples = pm.gp.sample_gp( trace[1000:], y_obs, Z, samples=50)

This gives you 50 different Monte Carlo realizations of the function - each one a plausible fit to your data. The spread between them is your model uncertainty made visible.

When to reach for PyMC3 over the alternatives: MCMC is expensive for large datasets - the log-probability gets evaluated at every iteration, and that adds up fast. Variational inference methods approximate the posterior with something cheaper at the cost of some accuracy. The tradeoff is often worth it.

  • Variational Gaussian Approximation is implemented in GPFlow.
  • Automatic Differentiation Variational Inference (ADVI) is implemented in PyMC3.

The field is actively working on making these approximations better and cheaper, so expect this landscape to keep shifting.