Numerics/Interpolation Extrapolation: Difference between revisions

Latest revision as of 06:12, 20 June 2026

Interpolation & Extrapolation is about constructing new data points from a discrete set of known data points. Interpolation fills gaps *within* the known range; extrapolation ventures *outside* it — which is almost always a terrible idea.

The One-Sentence Opinion

Use cubic splines for smooth data, linear interpolation when you're in a hurry or the data is dense, and never trust an extrapolation unless you have a physical model backing it up. Polynomial interpolation of high degree is a trap — Runge's phenomenon will ruin your day.

Interpolation vs. Extrapolation

	Interpolation	Extrapolation
What	Estimate \(y\) for \(x\) inside the convex hull of known points	Estimate \(y\) for \(x\) outside the known range
Reliability	Generally trustworthy with a sensible method	Fragile, unreliable, sometimes catastrophically wrong
Use case	Upsampling, regridding, filling missing data	Forecasting, extending a trend — but you probably shouldn't
Opinion	Go ahead	Don't. Or at least know exactly why you're doing it.

Linear Interpolation

The simplest method: connect the dots with straight lines. Between \((x_i, y_i)\) and \((x_{i+1}, y_{i+1})\):

\[ y(x) = y_i + \frac{(y_{i+1} - y_i)(x - x_i)}{x_{i+1} - x_i} \]

When to use it

Densely sampled data (point spacing much smaller than feature scale)
You don't care about derivative continuity
Quick visualization

When not to

Sparse data with curvature — you'll get visible kinks
You need a smooth derivative
Almost never for extrapolation (the slope at the last segment is arbitrary)

Python

import numpy as np

def linear_interpolate(x_known, y_known, x_query):
    """Linear interpolation. x_known must be sorted."""
    idx = np.searchsorted(x_known, x_query)
    # Clamp to valid range
    idx = np.clip(idx, 1, len(x_known) - 1)
    x_left, x_right = x_known[idx - 1], x_known[idx]
    y_left, y_right = y_known[idx - 1], y_known[idx]
    t = (x_query - x_left) / (x_right - x_left)
    return y_left + t * (y_right - y_left)

Or just use NumPy:

y_query = np.interp(x_query, x_known, y_known)

np.interp is fast, vectorized, and does exactly what you think. It's the right tool about 40% of the time.

Polynomial Interpolation

Fit a single polynomial \(P_n(x)\) of degree \(n-1\) through \(n\) points. The polynomial exists and is unique, but that doesn't mean it's a good idea.

Vandermonde (don't do this)

The obvious approach — solve \(V \mathbf{c} = \mathbf{y}\) where \(V_{ij} = x_i^{\,j}\) — is numerically unstable for anything beyond degree 5 or 6. The Vandermonde matrix is ill-conditioned.

Lagrange Form

\[ P(x) = \sum_{i=0}^{n-1} y_i \prod_{j \neq i} \frac{x - x_j}{x_i - x_j} \]

Mathematically elegant, computationally inefficient (\(O(n^2)\) per evaluation), and prone to massive oscillations between points — Runge's phenomenon.

Runge's Phenomenon

If you interpolate \(f(x) = 1/(1 + 25x^2)\) on \([-1,1]\) with equispaced points, a high-degree polynomial goes berserk near the endpoints. The cure is *not* more points — that makes it worse. The cure is don't use high-degree polynomials.

Python (for when you must)

from numpy.polynomial import Polynomial

p = Polynomial.fit(x_known, y_known, deg=len(x_known)-1)
y_query = p(x_query)

Polynomial.fit uses a shifted-and-scaled basis internally, which helps with conditioning but does nothing to fix Runge's phenomenon.

When to use polynomial interpolation

Degree ≤ 3, and you know what you're doing
You're deriving a numerical method (e.g., Gaussian quadrature nodes)
That's about it

Cubic Splines — The Workhorse

A cubic spline fits piecewise cubic polynomials that match in value, first derivative, and second derivative at each knot. The result is visually smooth, computationally cheap to evaluate, and free from Runge-style oscillations.

Varieties

Natural spline: \(f = 0\) at endpoints. Simple, but can wobble near boundaries.

Clamped / "complete" spline: You specify \(f'\) at endpoints if you know it. Uncommon in practice.

Not-a-knot spline: Third derivative continuous across the second and penultimate knots. Default in MATLAB and scipy.interpolate.CubicSpline. Usually the safe choice.

Monotone spline (PCHIP): Preserves monotonicity of the data. If the data is monotone increasing, the interpolant is too — no spurious overshoots. Less smooth (only \(C^1\)).

Python: Cubic Spline

from scipy.interpolate import CubicSpline
import numpy as np

x = np.array([0., 1., 2., 3., 4.])
y = np.array([0., 2., 1., 3., 2.])

cs = CubicSpline(x, y, bc_type='not-a-knot')
# bc_type='natural' for natural spline
# bc_type='clamped' for derivative-specified

x_dense = np.linspace(0, 4, 200)
y_dense = cs(x_dense)          # evaluate
dy_dense = cs(x_dense, 1)      # first derivative
d2y_dense = cs(x_dense, 2)     # second derivative

Python: PCHIP (monotone)

from scipy.interpolate import PchipInterpolator

pchip = PchipInterpolator(x, y)
y_dense = pchip(x_dense)

When to use splines

Your default choice for smooth 1-D interpolation
Data with curvature that linear can't capture
You want \(C^2\) continuity and visually pleasing results
PCHIP when you can't tolerate overshoots (e.g., physical quantities that must stay positive)

Nearest-Neighbor Interpolation

Trivial: pick the value at the nearest known point. Piecewise constant, \(C^{-1}\) (discontinuous).

from scipy.interpolate import interp1d

nn = interp1d(x, y, kind='nearest')
y_query = nn(x_query)

Useful for categorical data, color lookups, or when you explicitly want to avoid introducing new values.

Radial Basis Functions (RBF)

For scattered (non-gridded) data in multiple dimensions, RBF interpolation fits:

\[ s(\mathbf{x}) = \sum_i w_i \,\phi(\|\mathbf{x} - \mathbf{x}_i\|) \]

where \(\phi\) is a radial basis function (Gaussian, multiquadric, thin-plate spline, etc.).

from scipy.interpolate import RBFInterpolator

rbf = RBFInterpolator(xy_known, z_known, kernel='thin_plate_spline')
z_query = rbf(xy_query)

RBF scales poorly (\(O(n^2)\) memory, \(O(n^3)\) solve). For large scattered datasets, consider local methods or Gaussian process regression.

Extrapolation: Here Be Dragons

Extrapolation assumes the pattern observed in the data continues outside the measured range. This assumption is wildly unjustified in most real-world settings.

Why extrapolation is dangerous

A polynomial that fits your data perfectly can diverge to \(\pm\infty\) immediately outside the range.
A spline extrapolates linearly by default — the slope at the last segment is an artifact of your last few points.
The real world has phase transitions, saturation, and regime changes you haven't observed.

The only defensible extrapolation

You have a physical model (e.g., \(y = A e^{-Bx}\)) and you're fitting parameters, not blindly extending a curve.
You're extrapolating a tiny distance relative to your point spacing and the function is known to be smooth.
You explicitly propagate the uncertainty and it's enormous — which is honest.

Python: enabling extrapolation in SciPy

By default, CubicSpline and interp1d raise ValueError outside bounds (which is the right default). To override:

cs = CubicSpline(x, y, extrapolate=True)   # linear extrapolation
# or
f = interp1d(x, y, kind='cubic', fill_value='extrapolate')

If you find yourself setting extrapolate=True, stop and ask why.

Opinionated Decision Flowchart

Do you have a physical model?
 ├─ Yes → Fit the model parameters. Don't interpolate arbitrarily.
 └─ No  → Are the x-values outside your known range?
           ├─ Yes → STOP. You're extrapolating. Reconsider.
           └─ No  → Is your data dense relative to feature scale?
                     ├─ Yes → np.interp (linear). Done.
                     └─ No  → Do you need C² smoothness?
                               ├─ Yes → CubicSpline (not-a-knot)
                               └─ No  → Is monotonicity critical?
                                         ├─ Yes → PchipInterpolator
                                         └─ No  → CubicSpline anyway; it's fine.

Summary Table

Method	Smoothness	Speed	Numerical Stability	Recommendation
`np.interp` (linear)	\(C^0\)	Fastest	Excellent	Good for dense data
Polynomial (high degree)	\(C^\infty\)	Fast eval, slow fit	Terrible	Avoid
`CubicSpline`	\(C^2\)	Fast	Good	Default choice
`PchipInterpolator`	\(C^1\)	Fast	Good	When overshoots are unacceptable
RBF	\(C^\infty\)	Slow (\(O(n^3)\))	OK for moderate n	Scattered multi-D data
Nearest-neighbor	\(C^{-1}\)	Fastest	Trivial	Categorical data only

@@ Line 7: / Line 7: @@
 == Interpolation vs. Extrapolation ==
-{| class="wikitable"
+{| class="wikitable" border="1"
 ! !! Interpolation !! Extrapolation
 |-
@@ Line 39: / Line 39: @@
 === Python ===
-```python
+<syntaxhighlight lang="python">
 import numpy as np
@@ Line 51: / Line 51: @@
      t = (x_query - x_left) / (x_right - x_left)
      return y_left + t * (y_right - y_left)
-```
+</syntaxhighlight>
 Or just use NumPy:
-```python
+<syntaxhighlight lang="python">
 y_query = np.interp(x_query, x_known, y_known)
-```
+</syntaxhighlight>
 <code>np.interp</code> is fast, vectorized, and does exactly what you think. It's the right tool about 40% of the time.
@@ Line 83: / Line 83: @@
 === Python (for when you must) ===
-```python
+<syntaxhighlight lang="python">
 from numpy.polynomial import Polynomial
 p = Polynomial.fit(x_known, y_known, deg=len(x_known)-1)
 y_query = p(x_query)
-```
+</syntaxhighlight>
 <code>Polynomial.fit</code> uses a shifted-and-scaled basis internally, which helps with conditioning but does nothing to fix Runge's phenomenon.
@@ Line 117: / Line 117: @@
 === Python: Cubic Spline ===
-```python
+<syntaxhighlight lang="python">
 from scipy.interpolate import CubicSpline
 import numpy as np
@@ Line 132: / Line 132: @@
 dy_dense = cs(x_dense, 1)      # first derivative
 d2y_dense = cs(x_dense, 2)     # second derivative
-```
+</syntaxhighlight>
 === Python: PCHIP (monotone) ===
-```python
+<syntaxhighlight lang="python">
 from scipy.interpolate import PchipInterpolator
 pchip = PchipInterpolator(x, y)
 y_dense = pchip(x_dense)
-```
+</syntaxhighlight>
 === When to use splines ===
@@ Line 153: / Line 153: @@
 Trivial: pick the value at the nearest known point. Piecewise constant, \(C^{-1}\) (discontinuous).
-```python
+<syntaxhighlight lang="python">
 from scipy.interpolate import interp1d
 nn = interp1d(x, y, kind='nearest')
 y_query = nn(x_query)
-```
+</syntaxhighlight>
 Useful for categorical data, color lookups, or when you explicitly want to avoid introducing new values.
@@ Line 172: / Line 172: @@
 where \(\phi\) is a radial basis function (Gaussian, multiquadric, thin-plate spline, etc.).
-```python
+<syntaxhighlight lang="python">
 from scipy.interpolate import RBFInterpolator
 rbf = RBFInterpolator(xy_known, z_known, kernel='thin_plate_spline')
 z_query = rbf(xy_query)
-```
+</syntaxhighlight>
 RBF scales poorly (\(O(n^2)\) memory, \(O(n^3)\) solve). For large scattered datasets, consider local methods or Gaussian process regression.
@@ Line 199: / Line 199: @@
 By default, <code>CubicSpline</code> and <code>interp1d</code> raise <code>ValueError</code> outside bounds (which is the right default). To override:
-```python
+<syntaxhighlight lang="python">
 cs = CubicSpline(x, y, extrapolate=True)   # linear extrapolation
 # or
 f = interp1d(x, y, kind='cubic', fill_value='extrapolate')
-```
+</syntaxhighlight>
 If you find yourself setting <code>extrapolate=True</code>, stop and ask why.
@@ Line 209: / Line 209: @@
 == Opinionated Decision Flowchart ==
-```
+<pre>
 Do you have a physical model?
   ├─ Yes → Fit the model parameters. Don't interpolate arbitrarily.
@@ Line 221: / Line 221: @@
                                           ├─ Yes → PchipInterpolator
                                           └─ No  → CubicSpline anyway; it's fine.
-```
+</pre>
 == Summary Table ==
-{| class="wikitable"
+{| class="wikitable" border="1"
 ! Method !! Smoothness !! Speed !! Numerical Stability !! Recommendation
 |-