From charlesreid1

(Create opinionated, Python-centric random numbers page (via create-page on MediaWiki MCP Server))
 
(replace all emdashes with regular dashes (via update-page on MediaWiki MCP Server))
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
'''DEVELOPERS! DEVELOPERS! DEVELOPERS!''' and the random numbers they need.
'''Random numbers''' are one of those things that sound dead simple - "just give me a number I can't predict, right?" - but then the rabbit hole opens up and suddenly you're comparing Mersenne Twister state spaces and wondering if your Monte Carlo simulation is actually converged.  


If you are reaching for C++ to generate random numbers, '''STOP.''' Put down that `<random>` header. Step away from the Mersenne Twister boilerplate. You are about to write 15 lines of ceremonial incantation just to get a number between 1 and 6. Python gives you that in '''one line'''. And it will be correct. And it will be readable. And you will not have to explain to your cryptographer friend why you seeded with `time(NULL)` (spoiler: that is how you get owned).
This page is the Python-centric tour through random number generation: what actually works, what's secretly broken, and how to wield these tools without accidentally turning your simulation into a deterministic paperweight. If you stumbled here looking for C++ <code>rand()</code> quirks, check out the sister page '''[[Random Numbers]]''' - but for numerical work, you want to be here.


Python's random-number story is the best in the business for numerical work. It ships with three battle-tested modules — `random`, `secrets`, and `numpy.random` — and each one knows its job. This page covers all three, calls out the footguns, and shows you the one obvious way to do it.
== The Two Kinds of Random ==


== The Holy Trinity of Python Randomness ==
There are really only two flavors of random numbers in the universe, and knowing which one you're dealing with changes everything:


{| class="wikitable" style="width:100%"
{| class="wikitable" border="1"
|-
|-
! Module !! Use Case !! Speed !! Cryptographic?
! Type !! Where it comes from !! Use it for...
|-
|-
| `random` || Simulations, games, shuffling, sampling || Fast (Mersenne Twister) || '''NO — never for security'''
| '''True random''' || Physical entropy - radioactive decay, thermal noise, keyboard mash-timing, hardware RNGs, <code>/dev/random</code> || Cryptography, key generation, anything where predictability = disaster
|-
|-
| `secrets` || Passwords, tokens, session keys, auth || Slower (OS entropy) || '''YES — designed for it'''
| '''Pseudorandom (PRNG)''' || A deterministic algorithm starting from a seed || Simulations, Monte Carlo, randomized algorithms, games, reproducible science
|-
| `numpy.random` || Massive arrays, Monte Carlo, stats || Vectorized, GPU-ready || '''NO — never for security'''
|}
|}


Memorize this table. Getting it wrong is how you ship a broken cryptosystem or a simulation that takes 100× too long.
True random is beautiful but slow and you can't replay it. Pseudorandom is fast, reproducible, and - if the algorithm is good - statistically indistinguishable from the real thing for most purposes. The rest of this page is about PRNGs, because that's what 99% of numerical work needs.


== `random` — Your Daily Driver ==
== Pseudorandom Number Generators: Controlled Chaos ==


The `random` module uses the Mersenne Twister (MT19937): period of <math>2^{19937} - 1</math>, passes Diehard and TestU01 (mostly), and ships with CPython. It is not cryptographically secure. Do not use it for secrets. '''Are we clear? Good.'''
The basic idea behind a PRNG is almost disappointingly simple: you take a number (the ''state''), scramble it through some mathematical function, output part of the result as your "random" number, and keep the rest as the new state. Lather, rinse, repeat.


=== The One-Liners You Will Use Every Day ===
=== The OG: Linear Congruential Generators ===


<syntaxhighlight lang="python">
<math>X_{n+1} = (aX_n + c) \bmod m</math>
import random
 
This is the LCG - the Honda Civic of random number generators. It's everywhere (including C's <code>rand()</code>), it's fast, and it's... fine, for very loose definitions of "fine." LCGs have lousy statistical properties in low dimensions - if you plot consecutive pairs <math>(X_n, X_{n+1})</math> as points in 2D, they fall on a small number of parallel planes. This is the Marsaglia effect and it's kind of horrifying once you see it.


# Integer in [a, b] — inclusive on BOTH ends
=== The Workhorse: Mersenne Twister (MT19937) ===
die_roll = random.randint(1, 6)


# Integer in [a, b) — exclusive upper, like range()
Python's <code>random</code> module uses the Mersenne Twister under the hood - specifically MT19937, with a period of <math>2^{19937} - 1</math> (yes, that's a Mersenne prime, hence the name). The period is so absurdly long that you will never, ever exhaust it. The state is 624 32-bit integers, or about 2.5 KB. It passes most statistical test suites with flying colors.
idx = random.randrange(0, len(my_list))


# Float in [0.0, 1.0)
It is ''not'' cryptographically secure. If an attacker can observe 624 consecutive outputs, they can reconstruct the entire internal state and predict every future number. For science and simulation this is irrelevant - for crypto it's a dealbreaker. Keep reading.
u = random.random()


# Float in [a, b]
== Python's <code>random</code> Module: Batteries Included ==
f = random.uniform(2.5, 7.5)


# Pick one
The <code>random</code> module is your everyday driver for random numbers in Python. It wraps MT19937 and gives you a clean, Pythonic API.
winner = random.choice(["Alice", "Bob", "Charlie", "Dana"])


# Pick k WITHOUT replacement (no duplicates)
=== The Core Functions You'll Actually Use ===
sample = random.sample(population, k=10)


# Shuffle IN PLACE — Fisher-Yates under the hood
{| class="wikitable" border="1"
random.shuffle(deck)
|-
</syntaxhighlight>
! Function !! What it gives you
|-
| <code>random.random()</code> || Float in [0.0, 1.0). This is the one you'll call 90% of the time.
|-
| <code>random.randint(a, b)</code> || Random integer in [a, b] - inclusive on both ends. Yes, <code>randint(0, 10)</code> can give you 10.
|-
| <code>random.randrange(start, stop, step)</code> || Like <code>range()</code> but returns one random element. <code>randrange(0, 10)</code> gives 0-9.
|-
| <code>random.choice(seq)</code> || Grab a random element from a sequence. O(1) for lists.
|-
| <code>random.shuffle(seq)</code> || Fisher-Yates in-place shuffle. Kind of elegant.
|-
| <code>random.sample(population, k)</code> || <code>k</code> unique random elements without replacement. Reservoir sampling when needed.
|-
| <code>random.gauss(mu, sigma)</code> || Gaussian (normal) distribution. Slightly faster than <code>normalvariate</code> but uses a different algorithm.
|-
| <code>random.uniform(a, b)</code> || Float in [a, b]. Also <code>random.triangular()</code>, <code>random.betavariate()</code>, <code>random.expovariate()</code>, etc.
|}


=== Seeds: Control Your Chaos ===
=== Seeding: The Ritual ===


<syntaxhighlight lang="python">
<syntaxhighlight lang="python">
random.seed(42)         # Deterministic run — great for tests
import random
random.seed()           # System time + os.urandom fallback
random.seed(42)
random.seed("hello world", version=2)  # Hash a string into a seed
print(random.random()# Always 0.6394267984578837 on CPython 3.x
</syntaxhighlight>
</syntaxhighlight>


'''OPINION: Always seed explicitly in scientific code.''' Reproducibility is not optional. If your results cannot be regenerated from a known seed, you do not have results — you have an anecdote. Save the seed alongside your output. Future you will weep with gratitude.
Call <code>seed()</code> once at the start of your program if you want reproducible runs. If you don't seed, Python seeds from <code>os.urandom()</code> on modern versions, which is fine - but your results won't be reproducible between runs.


=== Distributions: Beyond Uniform ===
'''One seed to rule them all:''' if you import other modules that also use <code>random</code>, seeding the global instance affects them too. This is either a feature or a deeply confusing bug depending on your perspective.


<syntaxhighlight lang="python">
== NumPy's Random: When the Stakes Get Higher ==
# Gaussian (mu=0, sigma=1)
random.gauss(0, 1)          # Slightly faster
random.normalvariate(0, 1)  # Thread-safe


# Exponentially distributed, lambda=1.5
The <code>random</code> module is fine for one-at-a-time random numbers, but numerical work often needs ''millions'' of them, fast. NumPy's random subsystem does exactly that - vectorized, tight C loops, and way more distributions than the stdlib.
random.expovariate(1.5)


# Gamma, Beta, von Mises, Pareto, Weibull...
=== The API Schism: Old Way vs. New Way ===
random.gammavariate(alpha=2.0, beta=3.0)
random.betavariate(alpha=0.5, beta=0.5)
random.paretovariate(alpha=1.5)
random.weibullvariate(alpha=1.5, beta=2.0)
</syntaxhighlight>


=== The Trap: `SystemRandom` (Just Use `secrets` Instead) ===
NumPy has been through a random-number glow-up. Here's the deal:


<syntaxhighlight lang="python">
{| class="wikitable" border="1"
# Exists, works, but WHY ARE YOU DOING THIS?
|-
sr = random.SystemRandom()
! Approach !! API !! Verdict
token = sr.randrange(2**256)
|-
</syntaxhighlight>
| '''Old school''' || <code>numpy.random.rand()</code>, <code>numpy.random.randn()</code>, <code>numpy.random.randint()</code> || Still works, still everywhere in legacy code. Uses a global <code>RandomState</code> singleton. Fine for quick scripts, annoying for reproducibility.
|-
| '''The new hotness''' || <code>numpy.random.default_rng()</code> → <code>Generator</code> object || Way better. Uses PCG64 by default (better statistical properties than MT19937). Explicit state, explicit seeding. No global state leakage.
|}


`SystemRandom` wraps `/dev/urandom` through the `random.Random` API. Fine, it works. But Python 3.6 gave us `secrets`, which has a purpose-built API for exactly this. Use the right tool.
The new way is the recommended path and it's just cleaner:
 
== `secrets` — When Random Isn't Just Random ==
 
If the outcome affects '''money, privacy, authentication, or user safety''', you are in `secrets` territory. This module pulls entropy directly from the operating system's CSPRNG (`/dev/urandom` on Linux, `CryptGenRandom` on Windows) and the API is deliberately narrow — you cannot accidentally use it for Monte Carlo.


<syntaxhighlight lang="python">
<syntaxhighlight lang="python">
import secrets
import numpy as np
 
rng = np.random.default_rng(seed=42)
# Cryptographically random integer in [0, n)
x = rng.random(1_000_000)       # A million uniform floats, just like that
secrets.randbelow(2**256)
y = rng.normal(0, 1, size=1000) # A thousand standard normals
 
z = rng.integers(0, 100, size=500) # 500 ints in [0, 100)
# Random integer with k random bits
secrets.randbits(256)
 
# Token — URL-safe Base64, nbytes → ceil(nbytes * 4/3) characters
secrets.token_hex(32)       # 64 hex chars
secrets.token_urlsafe(32)   # ~43 URL-safe chars
 
# Pick one — constant-time-ish choice
secrets.choice(["primary", "secondary", "fallback"])
</syntaxhighlight>
</syntaxhighlight>


'''OPINION: If you type `random` when you should have typed `secrets`, you have introduced a vulnerability.''' Full stop. The Mersenne Twister is predictable after 624 consecutive outputs. That is not a theoretical attack — it is a Saturday-afternoon script. Use `secrets` for anything adversarial.
The <code>Generator</code> API gives you basically every distribution known to statistics - uniform, normal, exponential, gamma, beta, binomial, Poisson, chi-square, F, t, Laplace, logistic, lognormal, multinomial, multivariate normal, Dirichlet, Wishart... the list goes on. It's a statistical candy store.


== `numpy.random` — The Heavy Artillery ==
=== Why PCG64 Over MT19937? ===


When you need '''millions''' of random numbers, Python's `random` module becomes a bottleneck. Each call crosses the Python/C boundary. NumPy's `numpy.random` generates entire arrays in C, vectorized, and the new API (1.17+) uses the superior PCG-64 generator by default.
PCG64 is a permuted congruential generator. It has better statistical properties than MT19937, a smaller state (128 bits - just two integers), and supports ''multiple independent streams'' via <code>SeedSequence</code>. It's also faster. The only real win MT19937 has is that the period is absurdly large, but PCG64's period of <math>2^{128}</math> is still "you'll never exhaust it in a billion lifetimes" territory.


=== The New API (Use This, Not the Old One) ===
== A Tour Through the Distributions ==


<syntaxhighlight lang="python">
Half the power of random numbers is in the transformations. You start with uniform [0,1) and then shape it into whatever distribution your model needs. Here are the greatest hits:
import numpy as np


# Create a generator — PCG64 is the new default
=== The Inverse Transform Method ===
rng = np.random.default_rng(seed=2024)


# Uniform [0, 1) — 10 million floats in < 100 ms
This is one of those ideas that's so elegant it hurts. If you have a CDF <math>F(x)</math> with inverse <math>F^{-1}</math>, and you generate <math>U \sim \text{Uniform}(0,1)</math>, then <math>X = F^{-1}(U)</math> follows the distribution with CDF <math>F</math>.
u = rng.random(10_000_000)


# Integers
That's it. That's the whole trick. It's why the exponential distribution is <math>-\lambda \ln(U)</math> - the CDF inversion just works out that way. Some distributions (normal) don't have a closed-form inverse CDF, so fancier methods like Box-Muller or Ziggurat step in. But the inverse transform is the conceptual backbone.
dice = rng.integers(1, 7, size=1000, endpoint=True)  # d6 × 1000


# Normal, vectorized
=== NumPy's Distribution Buffet ===
z = rng.standard_normal((1000, 1000))  # 1M standard normals


# Shuffle along axis
<syntaxhighlight lang="python">
rng.shuffle(arr, axis=0)
rng = np.random.default_rng()


# Choice with replacement and probabilities
rng.uniform(0, 1, size=1000)      # The building block
rng.choice(["H", "T"], size=1000, p=[0.5, 0.5])
rng.normal(mu, sigma, size=1000)    # The bell curve
rng.exponential(scale, size=1000)  # Waiting times
rng.poisson(lam, size=1000)        # Count data
rng.binomial(n, p, size=1000)      # k successes in n trials
rng.gamma(shape, scale, size=1000)  # Waiting time for k events
rng.beta(a, b, size=1000)          # Proportions, Bayesian priors
rng.chisquare(df, size=1000)        # Sum of squared normals
rng.choice([1,2,3,4,5], size=100)  # Discrete sampling with/without replacement
rng.permutation(arr)               # Shuffle in place
</syntaxhighlight>
</syntaxhighlight>


=== The Old API (Detect It, Then Kill It) ===
== Seeds and Reproducibility: Control Your Chaos ==


<syntaxhighlight lang="python">
Reproducibility is the quiet superpower of PRNGs. Set the same seed, get the same sequence - every single time, on any machine, any OS. This makes debugging stochastic code actually possible and lets other people replicate your results exactly.
# LEGACY: Global state, unpredictable seeding, slower PCG
np.random.seed(42)
np.random.rand(100)      # Uniform
np.random.randn(100)      # Normal
np.random.randint(0, 10, 100)


# EVERY call after np.random.seed() is a footgun in threaded code.
=== The Golden Rules ===
# Use np.random.default_rng() instead.
</syntaxhighlight>


'''OPINION: `np.random.seed()` is a code smell in 2024.''' The global RandomState is shared across all threads, libraries, and modules that import NumPy. If any of them call `np.random.seed()` or draw from the global state, your "deterministic" run is silently corrupted. The `Generator` API (`default_rng`) gives each component its own isolated stream. Use it.
# '''Seed once, at the top.''' Never inside a loop. You already knew this, but everyone does it at least once.
 
# '''Use <code>SeedSequence</code> for parallel work.''' If you're running 8 MPI processes or 32 <code>joblib</code> workers, you don't want them all using the same stream. <code>SeedSequence</code> derives independent seeds from one parent seed - it's basically entropy-aware seed spawning and it's kind of genius.
=== Generators: Pick Your Poison ===
# '''Record your seed.''' Stick it in a config file, a log line, or a comment. Future you will thank present you.


<syntaxhighlight lang="python">
<syntaxhighlight lang="python">
from numpy.random import PCG64, Philox, SFC64, MT19937
from numpy.random import SeedSequence, default_rng


# PCG64 — default, excellent all-rounder, tiny state (128 bits)
ss = SeedSequence(12345)
rng = np.random.Generator(PCG64(seed=42))
child_seeds = ss.spawn(4) # 4 independent streams
 
rngs = [default_rng(s) for s in child_seeds]
# Philox — counter-based, parallel- and GPU-friendly
rng = np.random.Generator(Philox(seed=42))
 
# SFC64 — fastest, small state, good statistical quality
rng = np.random.Generator(SFC64(seed=42))
 
# MT19937 — the old warhorse. For compatibility only.
rng = np.random.Generator(MT19937(seed=42))
</syntaxhighlight>
</syntaxhighlight>


== The Laws of Random Number Hygiene ==
=== Don't Seed in a Loop ===
 
=== Law 1: Seeds Are Sacred ===
 
Log your seed. Better yet, log the entire RNG state if you checkpoint. If you cannot replay your simulation bit-for-bit, your paper is a PDF full of hopes and feelings.
 
=== Law 2: `random` ≠ `secrets` ===
 
Print this on a sticky note and attach it to your monitor:
 
'''`random` is for dice. `secrets` is for keys.'''
 
If your code generates a session ID, password reset token, or API key with `random`, delete it and start over.
 
=== Law 3: Vectorize or Die ===
 
Generating 10 million random numbers in a Python `for` loop calling `random.random()` is the computational equivalent of eating soup with a fork. Use `numpy.random.Generator.random(10_000_000)`. It will be 50–200× faster and your CPU will not file a grievance.
 
=== Law 4: Never Seed With System Time in a Loop ===
 
Seeding with `time.time()` inside a tight loop (we see this in the wild) produces identical "random" sequences every iteration because the clock hasn't ticked. This is not a subtle bug — it is a disaster with a straight face. If you must reseed quickly, use `secrets.randbits(128)` as the seed.
 
=== Law 5: Beware the Birthday Paradox ===
 
You need only <math>\sqrt{n}</math> samples before collisions appear. For 32-bit random IDs, that is ~77,000. If you are generating IDs with `random.getrandbits(32)`, expect a duplicate by row 65,536. Use 128-bit tokens (`secrets.token_hex(16)`) and the collision probability becomes cosmically negligible.
 
== Common Patterns, Done Right ==
 
=== Pattern: Monte Carlo Integration ===


<syntaxhighlight lang="python">
<syntaxhighlight lang="python">
import numpy as np
# DON'T DO THIS
for i in range(1000):
    random.seed(42)
    x = random.random()  # x is the SAME every iteration
</syntaxhighlight>


def estimate_pi(n: int, rng=None) -> float:
You'd be surprised how often this shows up in actual code. The first random number after seeding with a fixed value is a deterministic function of that seed. Seeding inside a loop with a constant seed just gives you the same "random" number over and over. Use a single seed at the top and let the generator do its thing.
    """Estimate π by throwing darts at a unit square."""
    if rng is None:
        rng = np.random.default_rng()
    x = rng.random(n)
    y = rng.random(n)
    inside = np.sum(x**2 + y**2 <= 1.0)
    return 4.0 * inside / n


rng = np.random.default_rng(seed=42)
== When Random Isn't Random Enough: Cryptography ==
print(estimate_pi(10_000_000, rng=rng))  # 3.1415...
</syntaxhighlight>


=== Pattern: Reservoir Sampling (Streaming Data) ===
Sometimes "statistically random" isn't good enough - you need "your adversary, armed with a supercomputer and a copy of your algorithm, cannot predict the next bit." That's the crypto-grade bar.


<syntaxhighlight lang="python">
=== <code>secrets</code> and <code>os.urandom()</code> ===
import random
 
def reservoir_sample(stream, k: int, rng=None):
    """Reservoir-sample k items from a streaming iterable."""
    if rng is None:
        rng = random.Random()
    reservoir = []
    for i, item in enumerate(stream):
        if i < k:
            reservoir.append(item)
        else:
            j = rng.randrange(i + 1)
            if j < k:
                reservoir[j] = item
    return reservoir
</syntaxhighlight>


=== Pattern: Cryptographic Salt / Token ===
Python 3.6+ ships the <code>secrets</code> module, which wraps <code>os.urandom()</code> (the OS's cryptographically secure RNG - on Linux this comes from <code>/dev/urandom</code> and the kernel's entropy pool):


<syntaxhighlight lang="python">
<syntaxhighlight lang="python">
import secrets
import secrets
 
token = secrets.token_hex(32)    # 64-char hex string
def make_session_id() -> str:
key = secrets.randbits(256)       # 256 random bits
     """256-bit session ID, URL-safe. ~43 characters."""
url_safe = secrets.token_urlsafe() # For session IDs, CSRF tokens, etc.
    return secrets.token_urlsafe(32)
 
def make_api_key() -> str:
    """Hex-encoded 256-bit API key. 64 characters."""
    return secrets.token_hex(32)
</syntaxhighlight>
</syntaxhighlight>


=== Pattern: Train/Test Split (Reproducible) ===
=== Why MT19937 Is NOT Crypto-Safe ===


<syntaxhighlight lang="python">
The Mersenne Twister is a linear recurrence. Observe 624 consecutive 32-bit outputs and you can solve for the entire internal state with linear algebra. After that you can predict every future output and reconstruct every past output. The <code>random</code> module even warns about this in its docs - it's not suitable for security purposes. Use <code>secrets</code> or <code>os.urandom()</code> directly.
import numpy as np


rng = np.random.default_rng(seed=8675309)
== Practical Gotchas: Stuff That Will Bite You ==
indices = rng.permutation(len(data))
split = int(0.8 * len(data))
train_idx, test_idx = indices[:split], indices[split:]
</syntaxhighlight>


== What About `os.urandom`? ==
=== The <code>randint</code> Inclusive Surprise ===


`os.urandom(n)` is the bedrock — it returns ''n'' bytes from the OS CSPRNG. `secrets` is a thin, opinionated wrapper around it. Use `secrets` for structured randomness (tokens, integers, choices). Use `os.urandom` directly only when you need raw bytes or are building your own crypto primitives (and if you are doing that, you already know why you are here).
<code>random.randint(a, b)</code> includes <code>b</code>. <code>numpy.random.randint(a, b)</code> in the new API (<code>rng.integers</code>) does NOT include <code>b</code>. And to make it extra confusing, <code>numpy.random.randint</code> in the OLD API follows the Python convention and DOES include <code>b</code> for some argument patterns. Read the docs for whatever API you're using. This has caused real bugs in real papers.


== What About C++'s `<random>`? ==
=== Slow Generation Patterns ===


Look, I have written C++. I have loved C++. But generating a random integer in modern C++ looks like this:
Generating numbers one at a time in a Python loop is death by a thousand function calls. If you need a million random numbers, ask NumPy for them all at once:


<syntaxhighlight lang="cpp">
<syntaxhighlight lang="python">
#include <random>
# Slow (Python-loop overhead dominates):
#include <iostream>
vals = [random.random() for _ in range(1_000_000)]


int main() {
# Fast (vectorized, C-level):
    std::random_device rd;
vals = rng.random(1_000_000)
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dist(1, 6);
    std::cout << dist(gen) << '\n';
}
</syntaxhighlight>
</syntaxhighlight>


Python:
=== Parallel RNG: Independent Streams or Bust ===


<syntaxhighlight lang="python">
If you spawn 8 parallel processes and each seeds with the current time, you might get collisions. Worse, if you seed them all with the same seed (or with seeds 0,1,2,...7), the streams might be correlated. <code>SeedSequence</code> + PCG64 makes this basically foolproof - use it.
import random
print(random.randint(1, 6))
</syntaxhighlight>


The C++ version is five lines. It pulls in three headers. It instantiates three objects from three different classes. '''For a die roll.''' And every single one of those lines has a sharp edge: `std::random_device` can be deterministic on MinGW. `std::mt19937` produces biased results if you use modulo instead of `uniform_int_distribution`. The boilerplate-to-value ratio is off the charts.
=== The <code>random.random() * N</code> Subtle Bias ===


'''Use Python. Your numerical code will be shorter, correcter, and you will finish before lunch.'''
Multiplying a uniform float by N and flooring to get an integer index can introduce tiny biases due to floating-point representation. For casual use it's fine, but if you're doing something like lottery draw simulations where every bit of uniformity counts, use <code>randint</code> or <code>randrange</code> - they use rejection sampling internally to guarantee uniformity.


== See Also ==
== See Also ==


* [[Numerics]] the full numerical recipes catalog
* '''[[Random Numbers]]''' - The C++ page exploring <code>rand()</code> weirdness (the one that kicked all this off)
* [[Numerics/Monte Carlo]] — Monte Carlo methods done right
* '''[[Numerics/Statistical Descriptions of Data]]''' - Making sense of the numbers once you've generated them
* [[Numerics/Statistical Descriptions of Data]] — descriptive statistics
* [[Numerics/Classification and Inference]] — Bayesian and frequentist inference
* [https://docs.python.org/3/library/random.html Python `random` documentation]
* [https://docs.python.org/3/library/secrets.html Python `secrets` documentation]
* [https://numpy.org/doc/stable/reference/random/index.html NumPy random documentation]
* [https://www.pcg-random.org/ PCG Random — the PCG family explained]


[[Category:Numerics]]
[[Category:Numerics]]
[[Category:Python]]
[[Category:Python]]
[[Category:Random]]
[[Category:Random]]
{{NumericsFlag}}

Latest revision as of 10:58, 24 June 2026

Random numbers are one of those things that sound dead simple - "just give me a number I can't predict, right?" - but then the rabbit hole opens up and suddenly you're comparing Mersenne Twister state spaces and wondering if your Monte Carlo simulation is actually converged.

This page is the Python-centric tour through random number generation: what actually works, what's secretly broken, and how to wield these tools without accidentally turning your simulation into a deterministic paperweight. If you stumbled here looking for C++ rand() quirks, check out the sister page Random Numbers - but for numerical work, you want to be here.

The Two Kinds of Random

There are really only two flavors of random numbers in the universe, and knowing which one you're dealing with changes everything:

Type Where it comes from Use it for...
True random Physical entropy - radioactive decay, thermal noise, keyboard mash-timing, hardware RNGs, /dev/random Cryptography, key generation, anything where predictability = disaster
Pseudorandom (PRNG) A deterministic algorithm starting from a seed Simulations, Monte Carlo, randomized algorithms, games, reproducible science

True random is beautiful but slow and you can't replay it. Pseudorandom is fast, reproducible, and - if the algorithm is good - statistically indistinguishable from the real thing for most purposes. The rest of this page is about PRNGs, because that's what 99% of numerical work needs.

Pseudorandom Number Generators: Controlled Chaos

The basic idea behind a PRNG is almost disappointingly simple: you take a number (the state), scramble it through some mathematical function, output part of the result as your "random" number, and keep the rest as the new state. Lather, rinse, repeat.

The OG: Linear Congruential Generators

$ X_{n+1} = (aX_n + c) \bmod m $

This is the LCG - the Honda Civic of random number generators. It's everywhere (including C's rand()), it's fast, and it's... fine, for very loose definitions of "fine." LCGs have lousy statistical properties in low dimensions - if you plot consecutive pairs $ (X_n, X_{n+1}) $ as points in 2D, they fall on a small number of parallel planes. This is the Marsaglia effect and it's kind of horrifying once you see it.

The Workhorse: Mersenne Twister (MT19937)

Python's random module uses the Mersenne Twister under the hood - specifically MT19937, with a period of $ 2^{19937} - 1 $ (yes, that's a Mersenne prime, hence the name). The period is so absurdly long that you will never, ever exhaust it. The state is 624 32-bit integers, or about 2.5 KB. It passes most statistical test suites with flying colors.

It is not cryptographically secure. If an attacker can observe 624 consecutive outputs, they can reconstruct the entire internal state and predict every future number. For science and simulation this is irrelevant - for crypto it's a dealbreaker. Keep reading.

Python's random Module: Batteries Included

The random module is your everyday driver for random numbers in Python. It wraps MT19937 and gives you a clean, Pythonic API.

The Core Functions You'll Actually Use

Function What it gives you
random.random() Float in [0.0, 1.0). This is the one you'll call 90% of the time.
random.randint(a, b) Random integer in [a, b] - inclusive on both ends. Yes, randint(0, 10) can give you 10.
random.randrange(start, stop, step) Like range() but returns one random element. randrange(0, 10) gives 0-9.
random.choice(seq) Grab a random element from a sequence. O(1) for lists.
random.shuffle(seq) Fisher-Yates in-place shuffle. Kind of elegant.
random.sample(population, k) k unique random elements without replacement. Reservoir sampling when needed.
random.gauss(mu, sigma) Gaussian (normal) distribution. Slightly faster than normalvariate but uses a different algorithm.
random.uniform(a, b) Float in [a, b]. Also random.triangular(), random.betavariate(), random.expovariate(), etc.

Seeding: The Ritual

import random
random.seed(42)
print(random.random())  # Always 0.6394267984578837 on CPython 3.x

Call seed() once at the start of your program if you want reproducible runs. If you don't seed, Python seeds from os.urandom() on modern versions, which is fine - but your results won't be reproducible between runs.

One seed to rule them all: if you import other modules that also use random, seeding the global instance affects them too. This is either a feature or a deeply confusing bug depending on your perspective.

NumPy's Random: When the Stakes Get Higher

The random module is fine for one-at-a-time random numbers, but numerical work often needs millions of them, fast. NumPy's random subsystem does exactly that - vectorized, tight C loops, and way more distributions than the stdlib.

The API Schism: Old Way vs. New Way

NumPy has been through a random-number glow-up. Here's the deal:

Approach API Verdict
Old school numpy.random.rand(), numpy.random.randn(), numpy.random.randint() Still works, still everywhere in legacy code. Uses a global RandomState singleton. Fine for quick scripts, annoying for reproducibility.
The new hotness numpy.random.default_rng()Generator object Way better. Uses PCG64 by default (better statistical properties than MT19937). Explicit state, explicit seeding. No global state leakage.

The new way is the recommended path and it's just cleaner:

import numpy as np
rng = np.random.default_rng(seed=42)
x = rng.random(1_000_000)        # A million uniform floats, just like that
y = rng.normal(0, 1, size=1000)  # A thousand standard normals
z = rng.integers(0, 100, size=500)  # 500 ints in [0, 100)

The Generator API gives you basically every distribution known to statistics - uniform, normal, exponential, gamma, beta, binomial, Poisson, chi-square, F, t, Laplace, logistic, lognormal, multinomial, multivariate normal, Dirichlet, Wishart... the list goes on. It's a statistical candy store.

Why PCG64 Over MT19937?

PCG64 is a permuted congruential generator. It has better statistical properties than MT19937, a smaller state (128 bits - just two integers), and supports multiple independent streams via SeedSequence. It's also faster. The only real win MT19937 has is that the period is absurdly large, but PCG64's period of $ 2^{128} $ is still "you'll never exhaust it in a billion lifetimes" territory.

A Tour Through the Distributions

Half the power of random numbers is in the transformations. You start with uniform [0,1) and then shape it into whatever distribution your model needs. Here are the greatest hits:

The Inverse Transform Method

This is one of those ideas that's so elegant it hurts. If you have a CDF $ F(x) $ with inverse $ F^{-1} $, and you generate $ U \sim \text{Uniform}(0,1) $, then $ X = F^{-1}(U) $ follows the distribution with CDF $ F $.

That's it. That's the whole trick. It's why the exponential distribution is $ -\lambda \ln(U) $ - the CDF inversion just works out that way. Some distributions (normal) don't have a closed-form inverse CDF, so fancier methods like Box-Muller or Ziggurat step in. But the inverse transform is the conceptual backbone.

NumPy's Distribution Buffet

rng = np.random.default_rng()

rng.uniform(0, 1, size=1000)       # The building block
rng.normal(mu, sigma, size=1000)    # The bell curve
rng.exponential(scale, size=1000)   # Waiting times
rng.poisson(lam, size=1000)         # Count data
rng.binomial(n, p, size=1000)       # k successes in n trials
rng.gamma(shape, scale, size=1000)  # Waiting time for k events
rng.beta(a, b, size=1000)           # Proportions, Bayesian priors
rng.chisquare(df, size=1000)        # Sum of squared normals
rng.choice([1,2,3,4,5], size=100)   # Discrete sampling with/without replacement
rng.permutation(arr)                # Shuffle in place

Seeds and Reproducibility: Control Your Chaos

Reproducibility is the quiet superpower of PRNGs. Set the same seed, get the same sequence - every single time, on any machine, any OS. This makes debugging stochastic code actually possible and lets other people replicate your results exactly.

The Golden Rules

  1. Seed once, at the top. Never inside a loop. You already knew this, but everyone does it at least once.
  2. Use SeedSequence for parallel work. If you're running 8 MPI processes or 32 joblib workers, you don't want them all using the same stream. SeedSequence derives independent seeds from one parent seed - it's basically entropy-aware seed spawning and it's kind of genius.
  3. Record your seed. Stick it in a config file, a log line, or a comment. Future you will thank present you.
from numpy.random import SeedSequence, default_rng

ss = SeedSequence(12345)
child_seeds = ss.spawn(4)  # 4 independent streams
rngs = [default_rng(s) for s in child_seeds]

Don't Seed in a Loop

# DON'T DO THIS
for i in range(1000):
    random.seed(42)
    x = random.random()  # x is the SAME every iteration

You'd be surprised how often this shows up in actual code. The first random number after seeding with a fixed value is a deterministic function of that seed. Seeding inside a loop with a constant seed just gives you the same "random" number over and over. Use a single seed at the top and let the generator do its thing.

When Random Isn't Random Enough: Cryptography

Sometimes "statistically random" isn't good enough - you need "your adversary, armed with a supercomputer and a copy of your algorithm, cannot predict the next bit." That's the crypto-grade bar.

secrets and os.urandom()

Python 3.6+ ships the secrets module, which wraps os.urandom() (the OS's cryptographically secure RNG - on Linux this comes from /dev/urandom and the kernel's entropy pool):

import secrets
token = secrets.token_hex(32)     # 64-char hex string
key = secrets.randbits(256)        # 256 random bits
url_safe = secrets.token_urlsafe() # For session IDs, CSRF tokens, etc.

Why MT19937 Is NOT Crypto-Safe

The Mersenne Twister is a linear recurrence. Observe 624 consecutive 32-bit outputs and you can solve for the entire internal state with linear algebra. After that you can predict every future output and reconstruct every past output. The random module even warns about this in its docs - it's not suitable for security purposes. Use secrets or os.urandom() directly.

Practical Gotchas: Stuff That Will Bite You

The randint Inclusive Surprise

random.randint(a, b) includes b. numpy.random.randint(a, b) in the new API (rng.integers) does NOT include b. And to make it extra confusing, numpy.random.randint in the OLD API follows the Python convention and DOES include b for some argument patterns. Read the docs for whatever API you're using. This has caused real bugs in real papers.

Slow Generation Patterns

Generating numbers one at a time in a Python loop is death by a thousand function calls. If you need a million random numbers, ask NumPy for them all at once:

# Slow (Python-loop overhead dominates):
vals = [random.random() for _ in range(1_000_000)]

# Fast (vectorized, C-level):
vals = rng.random(1_000_000)

Parallel RNG: Independent Streams or Bust

If you spawn 8 parallel processes and each seeds with the current time, you might get collisions. Worse, if you seed them all with the same seed (or with seeds 0,1,2,...7), the streams might be correlated. SeedSequence + PCG64 makes this basically foolproof - use it.

The random.random() * N Subtle Bias

Multiplying a uniform float by N and flooring to get an integer index can introduce tiny biases due to floating-point representation. For casual use it's fine, but if you're doing something like lottery draw simulations where every bit of uniformity counts, use randint or randrange - they use rejection sampling internally to guarantee uniformity.

See Also