GPM = Gaussian Process Modeling

Notes

Link to nice notebook explaining how to implement this in Python: https://blog.dominodatalab.com/fitting-gaussian-process-models-python/

Python Libraries

To do Gaussian Process Modeling, can use several libraries:

GPFlow (Gaussian Process Modeling using TensorFlow under the hood)
PyMC3 (PyMC implements Markov-chain Monte Carlo methods for GPM)
scikit-learn (implements several GPM routines/objects/functions)

These packages principally implement covariance functions that can be used to describe non-linearity in data, and methods for fitting GPM parameters.

Sample Data

To create sample data:

import numpy as np

def exponential_cov(x, y, params):
    return params[0] * np.exp( -0.5 * params[1] * np.subtract.outer(x, y)**2)

def conditional(x_new, x, y, params):
    B = exponential_cov(x_new, x, params)
    C = exponential_cov(x, x, params)
    A = exponential_cov(x_new, x_new, params)

    mu = np.linalg.inv(C).dot(B.T).T.dot(y)
    sigma = A - B.dot(np.linalg.inv(C).dot(B.T))

    return(mu.squeeze(), sigma.squeeze())

We'll create a Gaussian process prior with hyperparameters theta0 = 1 and theta1 = 10, along with a zero mean:

import matplotlib.pylab as plt

theta = [1, 10]
sigma_0 = exponential_cov(0, 0, theta)
xpts = np.arange(-3, 3, step=0.01)
plt.errorbar(xpts, np.zeros(len(xpts)), yerr=sigma_0, capsize=0)

Now, pick an arbitrary starting point:

x = [1.]
y = [np.random.normal(scale=sigma_0)]

Now update the confidence band to account for the new sample point by using the covariance function to generate point-wise intervals:

sigma_1 = exponential_cov(x, x, theta)

def predict(x, data, kernel, params, sigma, t):
    k = [kernel(x, y, params) for y in data]
    Sinv = np.linalg.inv(sigma)
    y_pred = np.dot(k, Sinv).dot(t)
    sigma_new = kernel(x, x, params) - np.dot(k, Sinv).dot(k)
    return y_pred, sigma_new

x_pred = np.linspace(-3, 3, 1000)
predictions = [predict(i, x, exponential_cov, theta, sigma_1, y) for i in x_pred]

To plot:

y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")

Another point can be added as well:

m, s = conditional([-0.7], x, y, θ)
y2 = np.random.normal(m, s)

To update the covariance to account for the new point:

x.append(-0.7)
y.append(y2)

sigma_2 = exponential_cov(x, x, theta)
predictions = [predict(i, x, exponential_cov, theta, sigma_2, y) for i in x_pred]

To sample multiple points at once:

x_more = [-2.1, -1.5, 0.3, 1.8, 2.5]
mu, s = conditional(x_more, x, y, theta)
y_more = np.random.multivariate_normal(mu, s)

Now the covariance can be updated again:

x += x_more
y += y_more.tolist()

σ_new = exponential_cov(x, x, theta)
predictions = [predict(i, x, exponential_cov, theta, sigma_new, y) for i in x_pred]

y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")

Scikit-Learn

scikit-learn has several relevant functions/implementations of GPMs appropriate for different applications:

GaussianProcessRegressor - create this by specifying an appropriate covariance function (kernel). Fitting is accomplished by maximizing log marginal likelihood (avoids computationally intensive cross-validation strategy). Does not allow for specification of mean function (assumed to be zero). This is because the mean is not as important when computing the posterior.
GaussianProcessClassifier - useful for classification tasks and fitting categorical/binary data. Uses a latent Gaussian response variable, transforms it to the unit interval (or simplex). Soft probabilisitic classification task, rather than a hard classification prediction. User provides a kernel to describe the covariance in the data set. The posterior is non-normal. Laplace approximation used in place of maximizing marginal likelihood.

Kernels:

from sklearn import gaussian_process will import code/functions related to Gaussian process modeling
from sklearn.gaussian_process.kernels import Matern will import one of about a dozen GPM kernels
Matern covariance is a good, flexible first-choice:

$k_{M}(x)={\dfrac {\sigma ^{2}}{\Gamma (\nu )2^{\nu -1}}}\left({\dfrac {{\sqrt {2\nu }}x}{l}}\right)^{\nu }K_{\nu }\left({\dfrac {{\sqrt {2\nu }}x}{l}}\right)$

$\sigma$ is amplitude, scalar multiplier that controls y-axis scaling
$l$ is length scale, scales realizations laong x-axis
$\nu$ is roughness, controls sharpness/smoothing of ridges in covariance function

Example usage:

A single GPM kernel can be a combination of multiple kernels
Start by using a Matern component (Matern), then use an amplitude factor (ConstantKernel), then use an observation noise (WhiteKernel) kernel

kernel = ConstantKernel() 
    + Matern(length_scale=2, nu=3/2) 
    + WhiteKernel(noise_level=1)

Then create a GaussianProcessRegressor object:

gp = gaussian_process.GaussianProcessRegressor(kernel=kernel)
gp.fit(X, y)

To do this all in a single call, while also setting the optimizer:

GaussianProcessRegressor(alpha=1e-10, copy_X_train=True,
    kernel=1**2 + Matern(length_scale=2, nu=1.5) + WhiteKernel(noise_level=1),
    n_restarts_optimizer=0, normalize_y=False,
    optimizer='fmin_l_bfgs_b', random_state=None)

Some of these are set by default unless explicitly set:

The L-BFGS-B algorithm is used to optimize hyperparameters
The output variable is not normalized
n_restarts_optimizer is a setting to help prevent getting stuck in a local and not a global maximum. It restarts the optimization algorithm the number of times specified.
Parameters attributed with the fit function have underscore appended to their names

Fit vs. Predict:

The fit method runs an optimization procedure to fit the parameters
The predict method generates predicted outcomes $y^{\star }$ given a new set of predictors $X^{\star }$
The predictors are fulfilled by the posterior predictive distribution - the Gaussian process mean and covariance that are updated to the posterior forms:

$p(y^{\star }\vert y,x,x^{\star })=GP(m^{\star }(x^{\star }),k^{\star }(x^{\star }))$

To do this with scikit-learn, using the interval from -5 to 5 as an input::

x_pred = np.linspace(-5, 5).reshape(-1,1)
y_pred, sigma = gp.predict(x_pred, return_std=True)

GPFlow

GPFlow is software derived from GPy software, a Gaussian Process fitting software from teh Sheffield machine learning group. Both provide a set of classes for creating Gaussian process models, and a library of kernels for mean and covariate functions. Under the hood, GPFlow uses the TensorFlow library. The use of TensorFlow allows fitting larger and more complex Gaussian Process models, and more complicated methods like variational inference.

The API requires a tabulated input for predictors (features) and outputs.

To reshape y to a tabular format:

theta = [1, 10]
sigma_0 = exponential_cov(0, 0, theta)
y = [np.random.normal(scale=sigma_0)]

# reshape into tabular format
Y = y.reshape(-1,1)

(a reminder, the exponential_cov function is defined as:)

import numpy as np
def exponential_cov(x, y, params):
    return params[0] * np.exp(-0.5*params[1]*np.subtract.outer(x, y)**2)

Now import the GPFlow module:

import GPflow

GPFlow provides six GP classes for the covariance and model structure (non-normal likelihood).

We can fit the covariance and model using Markov Chain Monte Carlo (MCMC) or approximate it via variational inference.

Here is the same type of covariance matrix we used with scikit-learn, the Matern distribution:

k = GPflow.kernels.Matern32(1, variance=1, lengthscales=1.2)
m = GPflow.gpr.GPR(X, Y, kern=k)
print(m)

The Matern kernel function has multiple parameters. There is also a likelihood variance parameter that can be set directly.

m.likelihood.variance = 0.01

Now, to fit the model, call optimize():

m.optimize()

To make predictions:

m.predict_y()

Note that no priors are specified, this just uses a maximum likelihood function. If you do want to assign priors for the kernel, however, they can be created using GPflow.priors, and accessed using m.kern:

m.kern.variance.prior = GPflow.priors.Gamma(1,0.1)
m.kern.lengthscales.prior = GPflow.priors.Gamma(1,0.1)

The hyperparameters can be assigned prior distributions, but we can also specify constant values. This is what we want to do if, for example, the measurement error of a data-collecting instrument is already known ahead of time. The error value is therefore constant and can be specified.

m.likelihood.variance = 0.1
m.likelihood.variance.fixed = True

This has set the probabilities (log-probabilities) of the priors.

Once we've set the hyperparameter values, we can re-optimize the model:

m.optimize()
print(m)

The GPMC class allows a full Bayesian Monte Carlo analysis to jointly sample over parameters and functions. To do this, we must specify a likelihood and prior for the kernel. This utilizes a Hamiltonian Monte Carlo (HMC) method, an efficient Markov chain Monte Carlo (MCMC) method. It uses gradient information to improve posterior sampling. This utilizes the ability of TensorFlow to perform gradient calculations for arbitrary models.

l = GPflow.likelihoods.StudentT()
m = GPflow.gpmc.GPMC(X, Y, kern=k, likelihood=l)
m.kern.variance.prior = GPflow.priors.Gamma(1,1)
m.kern.lengthscales.prior = GPflow.priors.Gamma(1,1)

The trace function will compute values for the kernel parameter values:

trace = m.sample(1000, verbose=True, epsilon=0.03, Lmax=15)

These can be obtained as a dataframe:

parameter_samples = m.get_samples_df(trace)

Plotting the histogram:

for col in parameter_samples.columns.sort_values()[1:]:
    parameter_samples[col].hist(label=col.split('.')[-1], alpha=0.4, bins=15)

Now, predictions can be computed using the model:

realizations = []
for sample in trace[-100:]:
    m.set_state(sample)
realizations.append(m.predict_f_samples(xx, 1).squeeze())
realizations = np.vstack(realizations)

This is useful when the likelihood function or model is unusual or difficult to fit using the gradient ascent optimization methods built into scikit-learn.

PyMC3

PyMC3 is a Python module for probabilistic programming for fitting a Bayesian model to data. PyMC3 uses a Theano backend (analogous to GPflow using TensorFlow as the backend). It provides functions and objects for specifying covariance and prior distribution kernels. Like GPflow, we must specify these as tensor variables.

To import it, we import PyMC3 and Theano:

import pymc3 as pm
import theano.tensor as tt

A Model() context is needed to declare parameters, variables, and functions within a Bayesian model. (Se PyMC3 documentation here). For example, we'll construct a Matern distribution again:

with pm.Model() as gp_fit:
    rho = pm.Gamma('rho', 1, 1)
    eta = pm.Gamma('eta', 1, 1)
    k = eta * pm.gp.cov.Matern32(1, rho)

Next, the mean function is specified (this will be zero if none is specified), and a half-Cauchy distribution is specified for a noise variable:

with gp_fit:
    M = pm.gp.mean.Zero()
    sigma = pm.HalfCauchy('sigma', 2.5)

PyMC implements Gaussian process models using the GP class. This requires a mean function, covariance function, and observation error function. These have been specified above.

The outcomes of the GP have been observed: the data are contained in the X and y variables. We provide these to the GP object that we create:

with gp_fit:
    y_obs = pm.gp.GP('y_obs', mean_func=M, cov_func=K, sigma=sigma, observed={'X':X, 'Y':y})

To fit the model, we call the sample() function from within the Model context. By default, PyMC uses the No U-turn Sampler (NUTS), which is a version of Hamiltonian Monte Carlo (which, in turn, is a form of Markov chain Monte Carlo). Like GPflow, this uses Theano's internals to compute gradients for arbitrary models.

with gp_fit:
    trace = pm.sample(2000, n_init = 20000)

pm.traceplot( trace[-1000:], varnames=['rho', 'sigma', 'eta'])

PyMC allows predictive sampling using the fit model. The sample_gp function implements sampling over a grid of points:

Z = np.linspace(-6, 6, 100).reshape(-1, 1)

with gp_fit:
    gp_samples = pm.gp.sample_gp( trace[1000:], y_obs, Z, samples=50)

This will generate 50 different random, Monte Carlo models that fit the data points.

Why use PyMC? MCMC is costly for very large data sets (because the log-probability must be evaluated at each iteration). Variational inference methods approximate the posterior with a cheaper and simpler approximation. While this makes it less accurate, it is appropriate for some problems.

Future work/methods will focus on improving the quality and lowering the cost of the approximation.

Variational Gaussian Approximation is implemented in GPflow.

Automatic Differentiation Variational Inference is implemented in PyMC.

Flags

GPM

From charlesreid1

Contents