Gaussian Process Modeling: Difference between revisions
From charlesreid1
(→Flags) |
(gave this page the Ballmer treatment — better headings, actual personality, fixed bare URLs and typos (via update-page on MediaWiki MCP Server)) |
||
| Line 1: | Line 1: | ||
GPM = Gaussian Process Modeling | GPM = Gaussian Process Modeling. It's the rare statistical tool that hands you uncertainty estimates as a built-in feature — no extra computation, no cross-validation headaches, the math just coughs them up as part of the deal. This page is a guided tour through doing Gaussian Process Modeling in Python, from hand-rolled covariance functions all the way to full Bayesian MCMC sampling. | ||
= | == Why GPs Deserve the Hype == | ||
Most ML models give you a point prediction and call it a day. Gaussian Processes? They give you a full predictive distribution — mean AND variance — at every point in your input space. It's the difference between someone telling you "the answer is 42" versus "the answer is 42, and I'm about 90% sure it's between 38 and 46." That uncertainty isn't a bonus feature you bolt on later — it falls right out of the math. | |||
If you want to see this implemented end-to-end in a notebook, [https://blog.dominodatalab.com/fitting-gaussian-process-models-python/ this Domino Data Lab walkthrough] is worth your time. | |||
== The Python GP Ecosystem == | |||
You've got three solid libraries to choose from. Picking between them is mostly about how much Bayesian firepower you need and how big your data is: | |||
* '''GPFlow''' — TensorFlow-backed and built for scale. Descends from the Sheffield machine learning group's GPy, so it's got serious academic pedigree. When your dataset is large or your kernel structure gets exotic, this is the one. | |||
* '''[[PyMC]]3''' — Probabilistic programming with full MCMC sampling. If you want posterior distributions over your hyperparameters (not just point estimates), PyMC3 is where the party's at. | |||
* '''scikit-learn''' — The pragmatic choice. Solid GP implementations for both regression and classification, all through the familiar sklearn API. You sacrifice some flexibility (the mean function is locked at zero) but for most real-world problems you won't miss it. | |||
What these all give you: a library of covariance functions (kernels) that describe how your data correlates with itself, plus optimization routines for fitting the GP parameters. | |||
=== Sample Data: Rolling Your Own GP === | |||
There's something deeply satisfying about implementing a GP from scratch before letting a library do the heavy lifting. Here's the core — an exponential covariance function and a conditional updater: | |||
<pre> | <pre> | ||
| Line 35: | Line 38: | ||
</pre> | </pre> | ||
Create a Gaussian process prior with hyperparameters θ₀ = 1 and θ₁ = 10, plus a zero mean: | |||
<pre> | <pre> | ||
| Line 46: | Line 49: | ||
</pre> | </pre> | ||
Pick an arbitrary starting point: | |||
<pre> | <pre> | ||
| Line 53: | Line 56: | ||
</pre> | </pre> | ||
Now update the confidence band to account for the new sample point | Now update the confidence band to account for the new sample point, using the covariance function to generate point-wise intervals: | ||
<pre> | <pre> | ||
| Line 69: | Line 72: | ||
</pre> | </pre> | ||
Plot it: | |||
<pre> | <pre> | ||
| Line 77: | Line 80: | ||
</pre> | </pre> | ||
Add another point: | |||
<pre> | <pre> | ||
| Line 84: | Line 87: | ||
</pre> | </pre> | ||
Update the covariance to absorb the new observation: | |||
<pre> | <pre> | ||
| Line 94: | Line 97: | ||
</pre> | </pre> | ||
Sample multiple points at once: | |||
<pre> | <pre> | ||
| Line 102: | Line 105: | ||
</pre> | </pre> | ||
Update the covariance once more: | |||
<pre> | <pre> | ||
| Line 116: | Line 119: | ||
</pre> | </pre> | ||
===Scikit-Learn=== | At this point you've got a working GP in about 30 lines of code. The covariance function does the heavy lifting — it encodes your assumptions about how the function behaves (smooth? wiggly? periodic?) — and the conditional update handles the Bayesian bookkeeping. It's kind of beautiful how little code this actually takes. | ||
=== Scikit-Learn: The Workhorse Route === | |||
Scikit-learn gives you two GP classes that cover most use cases: | |||
* '''GaussianProcessRegressor''' — You create it by specifying a kernel (covariance function). Fitting is done by maximizing the log marginal likelihood, which sidesteps the need for a computationally expensive cross-validation loop. One thing to know: the mean function is assumed to be zero and you can't change that. This sounds like a limitation but it's rarely a problem in practice — when you're computing the posterior, the mean function gets overwhelmed by the data pretty quickly anyway. | |||
* GaussianProcessRegressor | * '''GaussianProcessClassifier''' — For classification tasks with categorical/binary data. Uses a latent Gaussian response variable and transforms it to the unit interval (or simplex for multiclass). This gives you soft probabilistic classification rather than hard labels. The posterior is non-normal, so a Laplace approximation is used in place of maximizing the marginal likelihood. | ||
* GaussianProcessClassifier | |||
Kernels | Kernels live under <code>sklearn.gaussian_process.kernels</code>: | ||
<pre> | |||
from sklearn import gaussian_process | |||
from sklearn.gaussian_process.kernels import Matern | |||
</pre> | |||
The Matern kernel is a great flexible first choice: | |||
<math> | <math> | ||
| Line 131: | Line 141: | ||
</math> | </math> | ||
Okay, that's a lot of Greek. Here's what you actually care about — three intuitive knobs: | |||
* <math>\sigma</math> (amplitude) — scales the y-axis. How tall are your wiggles? | |||
* A single | * <math>l</math> (length scale) — stretches realizations along the x-axis. How far apart are the features? | ||
* <math>\nu</math> (roughness) — controls the sharpness vs. smoothness of ridges in the covariance function. Higher ν = smoother functions. | |||
A single GP kernel can be a combination of multiple kernels. A solid starting recipe: a Matern component for the signal structure, a ConstantKernel for amplitude scaling, and a WhiteKernel for observation noise: | |||
<pre> | <pre> | ||
| Line 145: | Line 155: | ||
</pre> | </pre> | ||
Create the regressor and fit: | |||
<pre> | <pre> | ||
| Line 152: | Line 162: | ||
</pre> | </pre> | ||
Or do it all in one shot with explicit optimizer settings: | |||
<pre> | <pre> | ||
| Line 161: | Line 171: | ||
</pre> | </pre> | ||
A few things worth knowing about these defaults: | |||
Fit vs. Predict: | * '''L-BFGS-B''' is the optimizer — it's a quasi-Newton method that handles box constraints, which makes sense for hyperparameters that need to stay positive. | ||
* '''normalize_y''' is False by default — your output variable stays raw. | |||
* '''n_restarts_optimizer''' is your defense against getting stuck in a local maximum. Each restart runs the optimizer from a different starting point. Crank this up if your fit looks sus. | |||
* Parameters learned during <code>fit()</code> get an underscore appended to their names (the sklearn convention). | |||
'''Fit vs. Predict:''' | |||
The <code>fit()</code> method runs the optimization to nail down the hyperparameters. The <code>predict()</code> method generates <math>y^{\star}</math> for new inputs <math>X^{\star}</math> using the posterior predictive distribution — this is the GP's way of saying "based on what I've seen, here's my best guess over here, and here's how sure I am": | |||
<math> | <math> | ||
| Line 176: | Line 186: | ||
</math> | </math> | ||
In code, predicting over a range and getting uncertainties back: | |||
<pre> | <pre> | ||
| Line 183: | Line 193: | ||
</pre> | </pre> | ||
= | That <code>return_std=True</code> flag is the thing that makes GPs special — you get your error bars in the same call as your predictions. No extra steps. | ||
GPFlow | === GPFlow: When You Need the Big Guns === | ||
GPFlow comes out of the Sheffield machine learning group (it's the successor to GPy) and uses TensorFlow as its computational backbone. The killer feature: TensorFlow's automatic differentiation means you can throw arbitrarily complex models at the optimizer without writing gradient code by hand. This lets you fit larger, gnarlier GP models and use techniques like variational inference that would be painful to implement manually. | |||
Start by reshaping your data into tabular format (GPFlow expects this): | |||
<pre> | <pre> | ||
| Line 200: | Line 210: | ||
</pre> | </pre> | ||
( | (Quick refresher on the <code>exponential_cov</code> function:) | ||
<pre> | <pre> | ||
| Line 208: | Line 218: | ||
</pre> | </pre> | ||
Import GPFlow and set up a Matern kernel with a GPR model: | |||
<pre> | <pre> | ||
import GPflow | import GPflow | ||
k = GPflow.kernels.Matern32(1, variance=1, lengthscales=1.2) | k = GPflow.kernels.Matern32(1, variance=1, lengthscales=1.2) | ||
m = GPflow.gpr.GPR(X, Y, kern=k) | m = GPflow.gpr.GPR(X, Y, kern=k) | ||
| Line 226: | Line 228: | ||
</pre> | </pre> | ||
The Matern kernel | The Matern kernel has multiple tunable parameters. There's also a likelihood variance you can set directly: | ||
<pre> | <pre> | ||
| Line 232: | Line 234: | ||
</pre> | </pre> | ||
Fit the model: | |||
<pre> | <pre> | ||
| Line 238: | Line 240: | ||
</pre> | </pre> | ||
Make predictions: | |||
<pre> | <pre> | ||
| Line 244: | Line 246: | ||
</pre> | </pre> | ||
So far this is maximum likelihood — no priors specified. But GPFlow lets you assign proper Bayesian priors to kernel parameters through <code>GPflow.priors</code>, accessed via <code>m.kern</code>: | |||
<pre> | <pre> | ||
| Line 251: | Line 253: | ||
</pre> | </pre> | ||
Sometimes you already know a parameter value — like when your measurement instrument has a documented error margin. In that case, pin it down: | |||
<pre> | <pre> | ||
| Line 258: | Line 260: | ||
</pre> | </pre> | ||
Now the priors are wired in. Re-optimize and check the result: | |||
<pre> | <pre> | ||
| Line 267: | Line 267: | ||
</pre> | </pre> | ||
The GPMC class | '''Going Full Bayesian with GPMC:''' | ||
The GPMC class lets you do Hamiltonian Monte Carlo (HMC) to jointly sample over parameters and functions. HMC uses gradient information to explore the posterior efficiently — and this is where TensorFlow's autodiff really shines: | |||
<pre> | <pre> | ||
| Line 276: | Line 278: | ||
</pre> | </pre> | ||
Run the sampler: | |||
<pre> | <pre> | ||
| Line 282: | Line 284: | ||
</pre> | </pre> | ||
Grab parameter samples as a dataframe: | |||
<pre> | <pre> | ||
| Line 289: | Line 290: | ||
</pre> | </pre> | ||
Plot the posterior histograms: | |||
<pre> | <pre> | ||
| Line 296: | Line 297: | ||
</pre> | </pre> | ||
Generate predictions from the posterior samples: | |||
<pre> | <pre> | ||
| Line 306: | Line 307: | ||
</pre> | </pre> | ||
This is | This HMC approach is clutch when your likelihood function or model structure is weird enough that gradient-ascent optimizers (like what scikit-learn uses) struggle to converge. | ||
===PyMC3=== | === PyMC3: Full Bayesian, No Shortcuts === | ||
[[PyMC]]3 is a | [[PyMC]]3 is a probabilistic programming framework that uses Theano as its backend (analogous to GPFlow using TensorFlow). It gives you functions and objects for specifying covariance kernels and prior distributions — but expressed as Theano tensor variables, since the whole model gets compiled into a computation graph. | ||
Import PyMC3 and Theano: | |||
<pre> | <pre> | ||
| Line 319: | Line 320: | ||
</pre> | </pre> | ||
Everything lives inside a <code>Model()</code> context. This is where you declare parameters, variables, and functions that define your Bayesian model. (See [http://docs.pymc.io/ PyMC3 documentation] for the full picture.) Here's a Matern setup: | |||
<pre> | <pre> | ||
| Line 328: | Line 329: | ||
</pre> | </pre> | ||
Specify a zero mean function and a Half-Cauchy prior for the noise: | |||
<pre> | <pre> | ||
| Line 336: | Line 337: | ||
</pre> | </pre> | ||
The GP class ties together the mean function, covariance function, and observation noise. Then we hand it the actual data: | |||
<pre> | <pre> | ||
| Line 345: | Line 344: | ||
</pre> | </pre> | ||
Fit by calling <code>sample()</code>. PyMC3 defaults to the No U-Turn Sampler (NUTS), which is a sophisticated variant of Hamiltonian Monte Carlo — it uses Theano's autodiff to compute gradients and navigate the posterior efficiently: | |||
<pre> | <pre> | ||
| Line 354: | Line 353: | ||
</pre> | </pre> | ||
Generate predictive samples over a grid: | |||
<pre> | <pre> | ||
| Line 363: | Line 362: | ||
</pre> | </pre> | ||
This | This gives you 50 different Monte Carlo realizations of the function — each one a plausible fit to your data. The spread between them is your model uncertainty made visible. | ||
'''When to reach for PyMC3 over the alternatives:''' MCMC is expensive for large datasets — the log-probability gets evaluated at every iteration, and that adds up fast. Variational inference methods approximate the posterior with something cheaper at the cost of some accuracy. The tradeoff is often worth it. | |||
Variational Gaussian Approximation is implemented in | * '''Variational Gaussian Approximation''' is implemented in GPFlow. | ||
* '''Automatic Differentiation Variational Inference''' (ADVI) is implemented in PyMC3. | |||
The field is actively working on making these approximations better and cheaper, so expect this landscape to keep shifting. | |||
= | == Where This Page Lives == | ||
[[Category:Data Analysis]] | [[Category:Data Analysis]] | ||
[[Category:Python]] | [[Category:Python]] | ||
[[Category:Monte Carlo]] | [[Category:Monte Carlo]] | ||
Revision as of 17:28, 24 June 2026
GPM = Gaussian Process Modeling. It's the rare statistical tool that hands you uncertainty estimates as a built-in feature — no extra computation, no cross-validation headaches, the math just coughs them up as part of the deal. This page is a guided tour through doing Gaussian Process Modeling in Python, from hand-rolled covariance functions all the way to full Bayesian MCMC sampling.
Why GPs Deserve the Hype
Most ML models give you a point prediction and call it a day. Gaussian Processes? They give you a full predictive distribution — mean AND variance — at every point in your input space. It's the difference between someone telling you "the answer is 42" versus "the answer is 42, and I'm about 90% sure it's between 38 and 46." That uncertainty isn't a bonus feature you bolt on later — it falls right out of the math.
If you want to see this implemented end-to-end in a notebook, this Domino Data Lab walkthrough is worth your time.
The Python GP Ecosystem
You've got three solid libraries to choose from. Picking between them is mostly about how much Bayesian firepower you need and how big your data is:
- GPFlow — TensorFlow-backed and built for scale. Descends from the Sheffield machine learning group's GPy, so it's got serious academic pedigree. When your dataset is large or your kernel structure gets exotic, this is the one.
- PyMC3 — Probabilistic programming with full MCMC sampling. If you want posterior distributions over your hyperparameters (not just point estimates), PyMC3 is where the party's at.
- scikit-learn — The pragmatic choice. Solid GP implementations for both regression and classification, all through the familiar sklearn API. You sacrifice some flexibility (the mean function is locked at zero) but for most real-world problems you won't miss it.
What these all give you: a library of covariance functions (kernels) that describe how your data correlates with itself, plus optimization routines for fitting the GP parameters.
Sample Data: Rolling Your Own GP
There's something deeply satisfying about implementing a GP from scratch before letting a library do the heavy lifting. Here's the core — an exponential covariance function and a conditional updater:
import numpy as np
def exponential_cov(x, y, params):
return params[0] * np.exp( -0.5 * params[1] * np.subtract.outer(x, y)**2)
def conditional(x_new, x, y, params):
B = exponential_cov(x_new, x, params)
C = exponential_cov(x, x, params)
A = exponential_cov(x_new, x_new, params)
mu = np.linalg.inv(C).dot(B.T).T.dot(y)
sigma = A - B.dot(np.linalg.inv(C).dot(B.T))
return(mu.squeeze(), sigma.squeeze())
Create a Gaussian process prior with hyperparameters θ₀ = 1 and θ₁ = 10, plus a zero mean:
import matplotlib.pylab as plt theta = [1, 10] sigma_0 = exponential_cov(0, 0, theta) xpts = np.arange(-3, 3, step=0.01) plt.errorbar(xpts, np.zeros(len(xpts)), yerr=sigma_0, capsize=0)
Pick an arbitrary starting point:
x = [1.] y = [np.random.normal(scale=sigma_0)]
Now update the confidence band to account for the new sample point, using the covariance function to generate point-wise intervals:
sigma_1 = exponential_cov(x, x, theta)
def predict(x, data, kernel, params, sigma, t):
k = [kernel(x, y, params) for y in data]
Sinv = np.linalg.inv(sigma)
y_pred = np.dot(k, Sinv).dot(t)
sigma_new = kernel(x, x, params) - np.dot(k, Sinv).dot(k)
return y_pred, sigma_new
x_pred = np.linspace(-3, 3, 1000)
predictions = [predict(i, x, exponential_cov, theta, sigma_1, y) for i in x_pred]
Plot it:
y_pred, sigmas = np.transpose(predictions) plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0) plt.plot(x, y, "ro")
Add another point:
m, s = conditional([-0.7], x, y, θ) y2 = np.random.normal(m, s)
Update the covariance to absorb the new observation:
x.append(-0.7) y.append(y2) sigma_2 = exponential_cov(x, x, theta) predictions = [predict(i, x, exponential_cov, theta, sigma_2, y) for i in x_pred]
Sample multiple points at once:
x_more = [-2.1, -1.5, 0.3, 1.8, 2.5] mu, s = conditional(x_more, x, y, theta) y_more = np.random.multivariate_normal(mu, s)
Update the covariance once more:
x += x_more y += y_more.tolist() σ_new = exponential_cov(x, x, theta) predictions = [predict(i, x, exponential_cov, theta, sigma_new, y) for i in x_pred] y_pred, sigmas = np.transpose(predictions) plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0) plt.plot(x, y, "ro")
At this point you've got a working GP in about 30 lines of code. The covariance function does the heavy lifting — it encodes your assumptions about how the function behaves (smooth? wiggly? periodic?) — and the conditional update handles the Bayesian bookkeeping. It's kind of beautiful how little code this actually takes.
Scikit-Learn: The Workhorse Route
Scikit-learn gives you two GP classes that cover most use cases:
- GaussianProcessRegressor — You create it by specifying a kernel (covariance function). Fitting is done by maximizing the log marginal likelihood, which sidesteps the need for a computationally expensive cross-validation loop. One thing to know: the mean function is assumed to be zero and you can't change that. This sounds like a limitation but it's rarely a problem in practice — when you're computing the posterior, the mean function gets overwhelmed by the data pretty quickly anyway.
- GaussianProcessClassifier — For classification tasks with categorical/binary data. Uses a latent Gaussian response variable and transforms it to the unit interval (or simplex for multiclass). This gives you soft probabilistic classification rather than hard labels. The posterior is non-normal, so a Laplace approximation is used in place of maximizing the marginal likelihood.
Kernels live under sklearn.gaussian_process.kernels:
from sklearn import gaussian_process from sklearn.gaussian_process.kernels import Matern
The Matern kernel is a great flexible first choice:
$ k_M(x) = \dfrac{ \sigma^2 }{\Gamma(\nu) 2^{\nu - 1}} \left( \dfrac{ \sqrt{2 \nu} x }{ l } \right)^{\nu} K_{\nu} \left( \dfrac{ \sqrt{2 \nu } x }{ l } \right) $
Okay, that's a lot of Greek. Here's what you actually care about — three intuitive knobs:
- $ \sigma $ (amplitude) — scales the y-axis. How tall are your wiggles?
- $ l $ (length scale) — stretches realizations along the x-axis. How far apart are the features?
- $ \nu $ (roughness) — controls the sharpness vs. smoothness of ridges in the covariance function. Higher ν = smoother functions.
A single GP kernel can be a combination of multiple kernels. A solid starting recipe: a Matern component for the signal structure, a ConstantKernel for amplitude scaling, and a WhiteKernel for observation noise:
kernel = ConstantKernel()
+ Matern(length_scale=2, nu=3/2)
+ WhiteKernel(noise_level=1)
Create the regressor and fit:
gp = gaussian_process.GaussianProcessRegressor(kernel=kernel) gp.fit(X, y)
Or do it all in one shot with explicit optimizer settings:
GaussianProcessRegressor(alpha=1e-10, copy_X_train=True,
kernel=1**2 + Matern(length_scale=2, nu=1.5) + WhiteKernel(noise_level=1),
n_restarts_optimizer=0, normalize_y=False,
optimizer='fmin_l_bfgs_b', random_state=None)
A few things worth knowing about these defaults:
- L-BFGS-B is the optimizer — it's a quasi-Newton method that handles box constraints, which makes sense for hyperparameters that need to stay positive.
- normalize_y is False by default — your output variable stays raw.
- n_restarts_optimizer is your defense against getting stuck in a local maximum. Each restart runs the optimizer from a different starting point. Crank this up if your fit looks sus.
- Parameters learned during
fit()get an underscore appended to their names (the sklearn convention).
Fit vs. Predict:
The fit() method runs the optimization to nail down the hyperparameters. The predict() method generates $ y^{\star} $ for new inputs $ X^{\star} $ using the posterior predictive distribution — this is the GP's way of saying "based on what I've seen, here's my best guess over here, and here's how sure I am":
$ p(y^{\star} \vert y, x, x^{\star}) = GP( m^{\star}(x^{\star}), k^{\star}(x^{\star}) ) $
In code, predicting over a range and getting uncertainties back:
x_pred = np.linspace(-5, 5).reshape(-1,1) y_pred, sigma = gp.predict(x_pred, return_std=True)
That return_std=True flag is the thing that makes GPs special — you get your error bars in the same call as your predictions. No extra steps.
GPFlow: When You Need the Big Guns
GPFlow comes out of the Sheffield machine learning group (it's the successor to GPy) and uses TensorFlow as its computational backbone. The killer feature: TensorFlow's automatic differentiation means you can throw arbitrarily complex models at the optimizer without writing gradient code by hand. This lets you fit larger, gnarlier GP models and use techniques like variational inference that would be painful to implement manually.
Start by reshaping your data into tabular format (GPFlow expects this):
theta = [1, 10] sigma_0 = exponential_cov(0, 0, theta) y = [np.random.normal(scale=sigma_0)] # reshape into tabular format Y = y.reshape(-1,1)
(Quick refresher on the exponential_cov function:)
import numpy as np
def exponential_cov(x, y, params):
return params[0] * np.exp(-0.5*params[1]*np.subtract.outer(x, y)**2)
Import GPFlow and set up a Matern kernel with a GPR model:
import GPflow k = GPflow.kernels.Matern32(1, variance=1, lengthscales=1.2) m = GPflow.gpr.GPR(X, Y, kern=k) print(m)
The Matern kernel has multiple tunable parameters. There's also a likelihood variance you can set directly:
m.likelihood.variance = 0.01
Fit the model:
m.optimize()
Make predictions:
m.predict_y()
So far this is maximum likelihood — no priors specified. But GPFlow lets you assign proper Bayesian priors to kernel parameters through GPflow.priors, accessed via m.kern:
m.kern.variance.prior = GPflow.priors.Gamma(1,0.1) m.kern.lengthscales.prior = GPflow.priors.Gamma(1,0.1)
Sometimes you already know a parameter value — like when your measurement instrument has a documented error margin. In that case, pin it down:
m.likelihood.variance = 0.1 m.likelihood.variance.fixed = True
Now the priors are wired in. Re-optimize and check the result:
m.optimize() print(m)
Going Full Bayesian with GPMC:
The GPMC class lets you do Hamiltonian Monte Carlo (HMC) to jointly sample over parameters and functions. HMC uses gradient information to explore the posterior efficiently — and this is where TensorFlow's autodiff really shines:
l = GPflow.likelihoods.StudentT() m = GPflow.gpmc.GPMC(X, Y, kern=k, likelihood=l) m.kern.variance.prior = GPflow.priors.Gamma(1,1) m.kern.lengthscales.prior = GPflow.priors.Gamma(1,1)
Run the sampler:
trace = m.sample(1000, verbose=True, epsilon=0.03, Lmax=15)
Grab parameter samples as a dataframe:
parameter_samples = m.get_samples_df(trace)
Plot the posterior histograms:
for col in parameter_samples.columns.sort_values()[1:]:
parameter_samples[col].hist(label=col.split('.')[-1], alpha=0.4, bins=15)
Generate predictions from the posterior samples:
realizations = []
for sample in trace[-100:]:
m.set_state(sample)
realizations.append(m.predict_f_samples(xx, 1).squeeze())
realizations = np.vstack(realizations)
This HMC approach is clutch when your likelihood function or model structure is weird enough that gradient-ascent optimizers (like what scikit-learn uses) struggle to converge.
PyMC3: Full Bayesian, No Shortcuts
PyMC3 is a probabilistic programming framework that uses Theano as its backend (analogous to GPFlow using TensorFlow). It gives you functions and objects for specifying covariance kernels and prior distributions — but expressed as Theano tensor variables, since the whole model gets compiled into a computation graph.
Import PyMC3 and Theano:
import pymc3 as pm import theano.tensor as tt
Everything lives inside a Model() context. This is where you declare parameters, variables, and functions that define your Bayesian model. (See PyMC3 documentation for the full picture.) Here's a Matern setup:
with pm.Model() as gp_fit:
rho = pm.Gamma('rho', 1, 1)
eta = pm.Gamma('eta', 1, 1)
k = eta * pm.gp.cov.Matern32(1, rho)
Specify a zero mean function and a Half-Cauchy prior for the noise:
with gp_fit:
M = pm.gp.mean.Zero()
sigma = pm.HalfCauchy('sigma', 2.5)
The GP class ties together the mean function, covariance function, and observation noise. Then we hand it the actual data:
with gp_fit:
y_obs = pm.gp.GP('y_obs', mean_func=M, cov_func=K, sigma=sigma, observed={'X':X, 'Y':y})
Fit by calling sample(). PyMC3 defaults to the No U-Turn Sampler (NUTS), which is a sophisticated variant of Hamiltonian Monte Carlo — it uses Theano's autodiff to compute gradients and navigate the posterior efficiently:
with gp_fit:
trace = pm.sample(2000, n_init = 20000)
pm.traceplot( trace[-1000:], varnames=['rho', 'sigma', 'eta'])
Generate predictive samples over a grid:
Z = np.linspace(-6, 6, 100).reshape(-1, 1)
with gp_fit:
gp_samples = pm.gp.sample_gp( trace[1000:], y_obs, Z, samples=50)
This gives you 50 different Monte Carlo realizations of the function — each one a plausible fit to your data. The spread between them is your model uncertainty made visible.
When to reach for PyMC3 over the alternatives: MCMC is expensive for large datasets — the log-probability gets evaluated at every iteration, and that adds up fast. Variational inference methods approximate the posterior with something cheaper at the cost of some accuracy. The tradeoff is often worth it.
- Variational Gaussian Approximation is implemented in GPFlow.
- Automatic Differentiation Variational Inference (ADVI) is implemented in PyMC3.
The field is actively working on making these approximations better and cheaper, so expect this landscape to keep shifting.