Gaussian Process Modeling

GPM = Gaussian Process Modeling

The rare statistical tool that hands you uncertainty estimates as a built-in feature - no extra computation, no cross-validation headaches, the math just coughs them up as part of the deal.

This page is a guided tour through doing Gaussian Process Modeling in Python, from hand-rolled covariance functions all the way to full Bayesian MCMC sampling.

Why GPs Deserve the Hype

Most ML models give you a point prediction and call it a day. Gaussian Processes? They give you a full predictive distribution - mean AND variance - at every point in your input space. It's the difference between someone telling you "the answer is 42" versus "the answer is 42, and I'm about 90% sure it's between 38 and 46." That uncertainty isn't a bonus feature you bolt on later - it falls right out of the math.

If you want to see this implemented end-to-end in a notebook, this Domino Data Lab walkthrough is worth your time.

The Python GP Ecosystem

You've got three solid libraries to choose from. Picking between them is mostly about how much Bayesian firepower you need and how big your data is:

GPFlow - TensorFlow-backed and built for scale. Descends from the Sheffield machine learning group's GPy, so it's got serious academic pedigree. When your dataset is large or your kernel structure gets exotic, this is the one.
PyMC3 - Probabilistic programming with full MCMC sampling. If you want posterior distributions over your hyperparameters (not just point estimates), PyMC3 is where the party's at.
scikit-learn - The pragmatic choice. Solid GP implementations for both regression and classification, all through the familiar sklearn API. You sacrifice some flexibility (the mean function is locked at zero) but for most real-world problems you won't miss it.

What these all give you: a library of covariance functions (kernels) that describe how your data correlates with itself, plus optimization routines for fitting the GP parameters.

Sample Data: Rolling Your Own GP

There's something deeply satisfying about implementing a GP from scratch before letting a library do the heavy lifting. Here's the core - an exponential covariance function and a conditional updater:

import numpy as np

def exponential_cov(x, y, params):
    return params[0] * np.exp( -0.5 * params[1] * np.subtract.outer(x, y)**2)

def conditional(x_new, x, y, params):
    B = exponential_cov(x_new, x, params)
    C = exponential_cov(x, x, params)
    A = exponential_cov(x_new, x_new, params)

    mu = np.linalg.inv(C).dot(B.T).T.dot(y)
    sigma = A - B.dot(np.linalg.inv(C).dot(B.T))

    return(mu.squeeze(), sigma.squeeze())

Create a Gaussian process prior with hyperparameters θ₀ = 1 and θ₁ = 10, plus a zero mean:

import matplotlib.pylab as plt

theta = [1, 10]
sigma_0 = exponential_cov(0, 0, theta)
xpts = np.arange(-3, 3, step=0.01)
plt.errorbar(xpts, np.zeros(len(xpts)), yerr=sigma_0, capsize=0)

Pick an arbitrary starting point:

x = [1.]
y = [np.random.normal(scale=sigma_0)]

Now update the confidence band to account for the new sample point, using the covariance function to generate point-wise intervals:

sigma_1 = exponential_cov(x, x, theta)

def predict(x, data, kernel, params, sigma, t):
    k = [kernel(x, y, params) for y in data]
    Sinv = np.linalg.inv(sigma)
    y_pred = np.dot(k, Sinv).dot(t)
    sigma_new = kernel(x, x, params) - np.dot(k, Sinv).dot(k)
    return y_pred, sigma_new

x_pred = np.linspace(-3, 3, 1000)
predictions = [predict(i, x, exponential_cov, theta, sigma_1, y) for i in x_pred]

Plot it:

y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")

Add another point:

m, s = conditional([-0.7], x, y, θ)
y2 = np.random.normal(m, s)

Update the covariance to absorb the new observation:

x.append(-0.7)
y.append(y2)

sigma_2 = exponential_cov(x, x, theta)
predictions = [predict(i, x, exponential_cov, theta, sigma_2, y) for i in x_pred]

Sample multiple points at once:

x_more = [-2.1, -1.5, 0.3, 1.8, 2.5]
mu, s = conditional(x_more, x, y, theta)
y_more = np.random.multivariate_normal(mu, s)

Update the covariance once more:

x += x_more
y += y_more.tolist()

σ_new = exponential_cov(x, x, theta)
predictions = [predict(i, x, exponential_cov, theta, sigma_new, y) for i in x_pred]

y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")

At this point you've got a working GP in about 30 lines of code. The covariance function does the heavy lifting - it encodes your assumptions about how the function behaves (smooth? wiggly? periodic?) - and the conditional update handles the Bayesian bookkeeping. It's kind of beautiful how little code this actually takes.

Scikit-Learn: The Workhorse Route

Scikit-learn gives you two GP classes that cover most use cases:

GaussianProcessRegressor - You create it by specifying a kernel (covariance function). Fitting is done by maximizing the log marginal likelihood, which sidesteps the need for a computationally expensive cross-validation loop. One thing to know: the mean function is assumed to be zero and you can't change that. This sounds like a limitation but it's rarely a problem in practice - when you're computing the posterior, the mean function gets overwhelmed by the data pretty quickly anyway.
GaussianProcessClassifier - For classification tasks with categorical/binary data. Uses a latent Gaussian response variable and transforms it to the unit interval (or simplex for multiclass). This gives you soft probabilistic classification rather than hard labels. The posterior is non-normal, so a Laplace approximation is used in place of maximizing the marginal likelihood.

Kernels live under sklearn.gaussian_process.kernels:

from sklearn import gaussian_process
from sklearn.gaussian_process.kernels import Matern

The Matern kernel is a great flexible first choice:

$k_M(x) = \dfrac{ \sigma^2 }{\Gamma(\nu) 2^{\nu - 1}} \left( \dfrac{ \sqrt{2 \nu} x }{ l } \right)^{\nu} K_{\nu} \left( \dfrac{ \sqrt{2 \nu } x }{ l } \right)$

Okay, that's a lot of Greek. Here's what you actually care about - three intuitive knobs:

$\sigma$ (amplitude) - scales the y-axis. How tall are your wiggles?
$$ l $$ (length scale) - stretches realizations along the x-axis. How far apart are the features?
$\nu$ (roughness) - controls the sharpness vs. smoothness of ridges in the covariance function. Higher ν = smoother functions.

A single GP kernel can be a combination of multiple kernels. A solid starting recipe: a Matern component for the signal structure, a ConstantKernel for amplitude scaling, and a WhiteKernel for observation noise:

kernel = ConstantKernel() 
    + Matern(length_scale=2, nu=3/2) 
    + WhiteKernel(noise_level=1)

Create the regressor and fit:

gp = gaussian_process.GaussianProcessRegressor(kernel=kernel)
gp.fit(X, y)

Or do it all in one shot with explicit optimizer settings:

GaussianProcessRegressor(alpha=1e-10, copy_X_train=True,
    kernel=1**2 + Matern(length_scale=2, nu=1.5) + WhiteKernel(noise_level=1),
    n_restarts_optimizer=0, normalize_y=False,
    optimizer='fmin_l_bfgs_b', random_state=None)

A few things worth knowing about these defaults:

L-BFGS-B is the optimizer - it's a quasi-Newton method that handles box constraints, which makes sense for hyperparameters that need to stay positive.
normalize_y is False by default - your output variable stays raw.
n_restarts_optimizer is your defense against getting stuck in a local maximum. Each restart runs the optimizer from a different starting point. Crank this up if your fit looks sus.
Parameters learned during fit() get an underscore appended to their names (the sklearn convention).

Fit vs. Predict:

The fit() method runs the optimization to nail down the hyperparameters. The predict() method generates $y^{\star}$ for new inputs $X^{\star}$ using the posterior predictive distribution - this is the GP's way of saying "based on what I've seen, here's my best guess over here, and here's how sure I am":

$p(y^{\star} \vert y, x, x^{\star}) = GP( m^{\star}(x^{\star}), k^{\star}(x^{\star}) )$

In code, predicting over a range and getting uncertainties back:

x_pred = np.linspace(-5, 5).reshape(-1,1)
y_pred, sigma = gp.predict(x_pred, return_std=True)

That return_std=True flag is the thing that makes GPs special - you get your error bars in the same call as your predictions. No extra steps.

GPFlow: When You Need the Big Guns

GPFlow comes out of the Sheffield machine learning group (it's the successor to GPy) and uses TensorFlow as its computational backbone. The killer feature: TensorFlow's automatic differentiation means you can throw arbitrarily complex models at the optimizer without writing gradient code by hand. This lets you fit larger, gnarlier GP models and use techniques like variational inference that would be painful to implement manually.

Start by reshaping your data into tabular format (GPFlow expects this):

theta = [1, 10]
sigma_0 = exponential_cov(0, 0, theta)
y = [np.random.normal(scale=sigma_0)]

# reshape into tabular format
Y = y.reshape(-1,1)

(Quick refresher on the exponential_cov function:)

import numpy as np
def exponential_cov(x, y, params):
    return params[0] * np.exp(-0.5*params[1]*np.subtract.outer(x, y)**2)

Import GPFlow and set up a Matern kernel with a GPR model:

import GPflow

k = GPflow.kernels.Matern32(1, variance=1, lengthscales=1.2)
m = GPflow.gpr.GPR(X, Y, kern=k)
print(m)

The Matern kernel has multiple tunable parameters. There's also a likelihood variance you can set directly:

m.likelihood.variance = 0.01

Fit the model:

m.optimize()

Make predictions:

m.predict_y()

So far this is maximum likelihood - no priors specified. But GPFlow lets you assign proper Bayesian priors to kernel parameters through GPflow.priors, accessed via m.kern:

m.kern.variance.prior = GPflow.priors.Gamma(1,0.1)
m.kern.lengthscales.prior = GPflow.priors.Gamma(1,0.1)

Sometimes you already know a parameter value - like when your measurement instrument has a documented error margin. In that case, pin it down:

m.likelihood.variance = 0.1
m.likelihood.variance.fixed = True

Now the priors are wired in. Re-optimize and check the result:

m.optimize()
print(m)

Going Full Bayesian with GPMC:

The GPMC class lets you do Hamiltonian Monte Carlo (HMC) to jointly sample over parameters and functions. HMC uses gradient information to explore the posterior efficiently - and this is where TensorFlow's autodiff really shines:

l = GPflow.likelihoods.StudentT()
m = GPflow.gpmc.GPMC(X, Y, kern=k, likelihood=l)
m.kern.variance.prior = GPflow.priors.Gamma(1,1)
m.kern.lengthscales.prior = GPflow.priors.Gamma(1,1)

Run the sampler:

trace = m.sample(1000, verbose=True, epsilon=0.03, Lmax=15)

Grab parameter samples as a dataframe:

parameter_samples = m.get_samples_df(trace)

Plot the posterior histograms:

for col in parameter_samples.columns.sort_values()[1:]:
    parameter_samples[col].hist(label=col.split('.')[-1], alpha=0.4, bins=15)

Generate predictions from the posterior samples:

realizations = []
for sample in trace[-100:]:
    m.set_state(sample)
realizations.append(m.predict_f_samples(xx, 1).squeeze())
realizations = np.vstack(realizations)

This HMC approach is clutch when your likelihood function or model structure is weird enough that gradient-ascent optimizers (like what scikit-learn uses) struggle to converge.

PyMC3: Full Bayesian, No Shortcuts

PyMC3 is a probabilistic programming framework that uses Theano as its backend (analogous to GPFlow using TensorFlow). It gives you functions and objects for specifying covariance kernels and prior distributions - but expressed as Theano tensor variables, since the whole model gets compiled into a computation graph.

Import PyMC3 and Theano:

import pymc3 as pm
import theano.tensor as tt

Everything lives inside a Model() context. This is where you declare parameters, variables, and functions that define your Bayesian model. (See PyMC3 documentation for the full picture.) Here's a Matern setup:

with pm.Model() as gp_fit:
    rho = pm.Gamma('rho', 1, 1)
    eta = pm.Gamma('eta', 1, 1)
    k = eta * pm.gp.cov.Matern32(1, rho)

Specify a zero mean function and a Half-Cauchy prior for the noise:

with gp_fit:
    M = pm.gp.mean.Zero()
    sigma = pm.HalfCauchy('sigma', 2.5)

The GP class ties together the mean function, covariance function, and observation noise. Then we hand it the actual data:

with gp_fit:
    y_obs = pm.gp.GP('y_obs', mean_func=M, cov_func=K, sigma=sigma, observed={'X':X, 'Y':y})

Fit by calling sample(). PyMC3 defaults to the No U-Turn Sampler (NUTS), which is a sophisticated variant of Hamiltonian Monte Carlo - it uses Theano's autodiff to compute gradients and navigate the posterior efficiently:

with gp_fit:
    trace = pm.sample(2000, n_init = 20000)

pm.traceplot( trace[-1000:], varnames=['rho', 'sigma', 'eta'])

Generate predictive samples over a grid:

Z = np.linspace(-6, 6, 100).reshape(-1, 1)

with gp_fit:
    gp_samples = pm.gp.sample_gp( trace[1000:], y_obs, Z, samples=50)

This gives you 50 different Monte Carlo realizations of the function - each one a plausible fit to your data. The spread between them is your model uncertainty made visible.

When to reach for PyMC3 over the alternatives: MCMC is expensive for large datasets - the log-probability gets evaluated at every iteration, and that adds up fast. Variational inference methods approximate the posterior with something cheaper at the cost of some accuracy. The tradeoff is often worth it.

Variational Gaussian Approximation is implemented in GPFlow.
Automatic Differentiation Variational Inference (ADVI) is implemented in PyMC3.

The field is actively working on making these approximations better and cheaper, so expect this landscape to keep shifting.