Numerics/Random Numbers: Difference between revisions

Latest revision as of 10:58, 24 June 2026

Random numbers are one of those things that sound dead simple - "just give me a number I can't predict, right?" - but then the rabbit hole opens up and suddenly you're comparing Mersenne Twister state spaces and wondering if your Monte Carlo simulation is actually converged.

This page is the Python-centric tour through random number generation: what actually works, what's secretly broken, and how to wield these tools without accidentally turning your simulation into a deterministic paperweight. If you stumbled here looking for C++ rand() quirks, check out the sister page Random Numbers - but for numerical work, you want to be here.

The Two Kinds of Random

There are really only two flavors of random numbers in the universe, and knowing which one you're dealing with changes everything:

Type	Where it comes from	Use it for...
True random	Physical entropy - radioactive decay, thermal noise, keyboard mash-timing, hardware RNGs, `/dev/random`	Cryptography, key generation, anything where predictability = disaster
Pseudorandom (PRNG)	A deterministic algorithm starting from a seed	Simulations, Monte Carlo, randomized algorithms, games, reproducible science

True random is beautiful but slow and you can't replay it. Pseudorandom is fast, reproducible, and - if the algorithm is good - statistically indistinguishable from the real thing for most purposes. The rest of this page is about PRNGs, because that's what 99% of numerical work needs.

Pseudorandom Number Generators: Controlled Chaos

The basic idea behind a PRNG is almost disappointingly simple: you take a number (the state), scramble it through some mathematical function, output part of the result as your "random" number, and keep the rest as the new state. Lather, rinse, repeat.

The OG: Linear Congruential Generators

$X_{n+1} = (aX_n + c) \bmod m$

This is the LCG - the Honda Civic of random number generators. It's everywhere (including C's rand()), it's fast, and it's... fine, for very loose definitions of "fine." LCGs have lousy statistical properties in low dimensions - if you plot consecutive pairs $(X_n, X_{n+1})$ as points in 2D, they fall on a small number of parallel planes. This is the Marsaglia effect and it's kind of horrifying once you see it.

The Workhorse: Mersenne Twister (MT19937)

Python's random module uses the Mersenne Twister under the hood - specifically MT19937, with a period of $2^{19937} - 1$ (yes, that's a Mersenne prime, hence the name). The period is so absurdly long that you will never, ever exhaust it. The state is 624 32-bit integers, or about 2.5 KB. It passes most statistical test suites with flying colors.

It is not cryptographically secure. If an attacker can observe 624 consecutive outputs, they can reconstruct the entire internal state and predict every future number. For science and simulation this is irrelevant - for crypto it's a dealbreaker. Keep reading.

Python's `random` Module: Batteries Included

The random module is your everyday driver for random numbers in Python. It wraps MT19937 and gives you a clean, Pythonic API.

The Core Functions You'll Actually Use

Function	What it gives you
`random.random()`	Float in [0.0, 1.0). This is the one you'll call 90% of the time.
`random.randint(a, b)`	Random integer in [a, b] - inclusive on both ends. Yes, `randint(0, 10)` can give you 10.
`random.randrange(start, stop, step)`	Like `range()` but returns one random element. `randrange(0, 10)` gives 0-9.
`random.choice(seq)`	Grab a random element from a sequence. O(1) for lists.
`random.shuffle(seq)`	Fisher-Yates in-place shuffle. Kind of elegant.
`random.sample(population, k)`	`k` unique random elements without replacement. Reservoir sampling when needed.
`random.gauss(mu, sigma)`	Gaussian (normal) distribution. Slightly faster than `normalvariate` but uses a different algorithm.
`random.uniform(a, b)`	Float in [a, b]. Also `random.triangular()`, `random.betavariate()`, `random.expovariate()`, etc.

Seeding: The Ritual

import random
random.seed(42)
print(random.random())  # Always 0.6394267984578837 on CPython 3.x

Call seed() once at the start of your program if you want reproducible runs. If you don't seed, Python seeds from os.urandom() on modern versions, which is fine - but your results won't be reproducible between runs.

One seed to rule them all: if you import other modules that also use random, seeding the global instance affects them too. This is either a feature or a deeply confusing bug depending on your perspective.

NumPy's Random: When the Stakes Get Higher

The random module is fine for one-at-a-time random numbers, but numerical work often needs millions of them, fast. NumPy's random subsystem does exactly that - vectorized, tight C loops, and way more distributions than the stdlib.

The API Schism: Old Way vs. New Way

NumPy has been through a random-number glow-up. Here's the deal:

Approach	API	Verdict
Old school	`numpy.random.rand()`, `numpy.random.randn()`, `numpy.random.randint()`	Still works, still everywhere in legacy code. Uses a global `RandomState` singleton. Fine for quick scripts, annoying for reproducibility.
The new hotness	`numpy.random.default_rng()` → `Generator` object	Way better. Uses PCG64 by default (better statistical properties than MT19937). Explicit state, explicit seeding. No global state leakage.

The new way is the recommended path and it's just cleaner:

import numpy as np
rng = np.random.default_rng(seed=42)
x = rng.random(1_000_000)        # A million uniform floats, just like that
y = rng.normal(0, 1, size=1000)  # A thousand standard normals
z = rng.integers(0, 100, size=500)  # 500 ints in [0, 100)

The Generator API gives you basically every distribution known to statistics - uniform, normal, exponential, gamma, beta, binomial, Poisson, chi-square, F, t, Laplace, logistic, lognormal, multinomial, multivariate normal, Dirichlet, Wishart... the list goes on. It's a statistical candy store.

Why PCG64 Over MT19937?

PCG64 is a permuted congruential generator. It has better statistical properties than MT19937, a smaller state (128 bits - just two integers), and supports multiple independent streams via SeedSequence. It's also faster. The only real win MT19937 has is that the period is absurdly large, but PCG64's period of $2^{128}$ is still "you'll never exhaust it in a billion lifetimes" territory.

A Tour Through the Distributions

Half the power of random numbers is in the transformations. You start with uniform [0,1) and then shape it into whatever distribution your model needs. Here are the greatest hits:

The Inverse Transform Method

This is one of those ideas that's so elegant it hurts. If you have a CDF $$ F(x) $$ with inverse $F^{-1}$ , and you generate $U \sim \text{Uniform}(0,1)$ , then $X = F^{-1}(U)$ follows the distribution with CDF $$ F $$ .

That's it. That's the whole trick. It's why the exponential distribution is $-\lambda \ln(U)$ - the CDF inversion just works out that way. Some distributions (normal) don't have a closed-form inverse CDF, so fancier methods like Box-Muller or Ziggurat step in. But the inverse transform is the conceptual backbone.

NumPy's Distribution Buffet

rng = np.random.default_rng()

rng.uniform(0, 1, size=1000)       # The building block
rng.normal(mu, sigma, size=1000)    # The bell curve
rng.exponential(scale, size=1000)   # Waiting times
rng.poisson(lam, size=1000)         # Count data
rng.binomial(n, p, size=1000)       # k successes in n trials
rng.gamma(shape, scale, size=1000)  # Waiting time for k events
rng.beta(a, b, size=1000)           # Proportions, Bayesian priors
rng.chisquare(df, size=1000)        # Sum of squared normals
rng.choice([1,2,3,4,5], size=100)   # Discrete sampling with/without replacement
rng.permutation(arr)                # Shuffle in place

Seeds and Reproducibility: Control Your Chaos

Reproducibility is the quiet superpower of PRNGs. Set the same seed, get the same sequence - every single time, on any machine, any OS. This makes debugging stochastic code actually possible and lets other people replicate your results exactly.

The Golden Rules

Seed once, at the top. Never inside a loop. You already knew this, but everyone does it at least once.
Use SeedSequence for parallel work. If you're running 8 MPI processes or 32 joblib workers, you don't want them all using the same stream. SeedSequence derives independent seeds from one parent seed - it's basically entropy-aware seed spawning and it's kind of genius.
Record your seed. Stick it in a config file, a log line, or a comment. Future you will thank present you.

from numpy.random import SeedSequence, default_rng

ss = SeedSequence(12345)
child_seeds = ss.spawn(4)  # 4 independent streams
rngs = [default_rng(s) for s in child_seeds]

Don't Seed in a Loop

# DON'T DO THIS
for i in range(1000):
    random.seed(42)
    x = random.random()  # x is the SAME every iteration

You'd be surprised how often this shows up in actual code. The first random number after seeding with a fixed value is a deterministic function of that seed. Seeding inside a loop with a constant seed just gives you the same "random" number over and over. Use a single seed at the top and let the generator do its thing.

When Random Isn't Random Enough: Cryptography

Sometimes "statistically random" isn't good enough - you need "your adversary, armed with a supercomputer and a copy of your algorithm, cannot predict the next bit." That's the crypto-grade bar.

`secrets` and `os.urandom()`

Python 3.6+ ships the secrets module, which wraps os.urandom() (the OS's cryptographically secure RNG - on Linux this comes from /dev/urandom and the kernel's entropy pool):

import secrets
token = secrets.token_hex(32)     # 64-char hex string
key = secrets.randbits(256)        # 256 random bits
url_safe = secrets.token_urlsafe() # For session IDs, CSRF tokens, etc.

Why MT19937 Is NOT Crypto-Safe

The Mersenne Twister is a linear recurrence. Observe 624 consecutive 32-bit outputs and you can solve for the entire internal state with linear algebra. After that you can predict every future output and reconstruct every past output. The random module even warns about this in its docs - it's not suitable for security purposes. Use secrets or os.urandom() directly.

Practical Gotchas: Stuff That Will Bite You

The `randint` Inclusive Surprise

random.randint(a, b) includes b. numpy.random.randint(a, b) in the new API (rng.integers) does NOT include b. And to make it extra confusing, numpy.random.randint in the OLD API follows the Python convention and DOES include b for some argument patterns. Read the docs for whatever API you're using. This has caused real bugs in real papers.

Slow Generation Patterns

Generating numbers one at a time in a Python loop is death by a thousand function calls. If you need a million random numbers, ask NumPy for them all at once:

# Slow (Python-loop overhead dominates):
vals = [random.random() for _ in range(1_000_000)]

# Fast (vectorized, C-level):
vals = rng.random(1_000_000)

Parallel RNG: Independent Streams or Bust

If you spawn 8 parallel processes and each seeds with the current time, you might get collisions. Worse, if you seed them all with the same seed (or with seeds 0,1,2,...7), the streams might be correlated. SeedSequence + PCG64 makes this basically foolproof - use it.

The `random.random() * N` Subtle Bias

Multiplying a uniform float by N and flooring to get an integer index can introduce tiny biases due to floating-point representation. For casual use it's fine, but if you're doing something like lottery draw simulations where every bit of uniformity counts, use randint or randrange - they use rejection sampling internally to guarantee uniformity.

Numerics/Random Numbers: Difference between revisions

From charlesreid1

Latest revision as of 10:58, 24 June 2026

Contents

The Two Kinds of Random

Pseudorandom Number Generators: Controlled Chaos

The OG: Linear Congruential Generators

The Workhorse: Mersenne Twister (MT19937)

Python's `random` Module: Batteries Included

The Core Functions You'll Actually Use

Seeding: The Ritual

NumPy's Random: When the Stakes Get Higher

The API Schism: Old Way vs. New Way

Why PCG64 Over MT19937?

A Tour Through the Distributions

The Inverse Transform Method

NumPy's Distribution Buffet

Seeds and Reproducibility: Control Your Chaos

The Golden Rules

Don't Seed in a Loop

When Random Isn't Random Enough: Cryptography

`secrets` and `os.urandom()`

Why MT19937 Is NOT Crypto-Safe

Practical Gotchas: Stuff That Will Bite You

The `randint` Inclusive Surprise

Slow Generation Patterns

Parallel RNG: Independent Streams or Bust

The `random.random() * N` Subtle Bias

See Also

@@ Line 1: / Line 1: @@
-'''DEVELOPERS! DEVELOPERS! DEVELOPERS! DEVELOPERS!'''
+'''Random numbers''' are one of those things that sound dead simple - "just give me a number I can't predict, right?" - but then the rabbit hole opens up and suddenly you're comparing Mersenne Twister state spaces and wondering if your Monte Carlo simulation is actually converged.
-''(sweats profusely, paces the stage like a caged animal, grabs the microphone with both hands)''
+This page is the Python-centric tour through random number generation: what actually works, what's secretly broken, and how to wield these tools without accidentally turning your simulation into a deterministic paperweight. If you stumbled here looking for C++ <code>rand()</code> quirks, check out the sister page '''[[Random Numbers]]''' - but for numerical work, you want to be here.
-'''RANDOM! NUMBERS! RANDOM! NUMBERS! RANDOM! NUMBERS!'''
+== The Two Kinds of Random ==
-'''I! LOVE! THIS! WIKI!'''
+There are really only two flavors of random numbers in the universe, and knowing which one you're dealing with changes everything:
-''(deep breath, veins bulging on forehead)''
+{| class="wikitable" border="1"
-Let me tell you about RANDOM NUMBERS, people. Let me tell you about the GLORY. The SHEER. UNBRIDLED. GLORY. of generating entropy in Python. This is not just a module. This is a LIFESTYLE. This is a COMMITMENT. This is the sound of ONE MILLION DICE hitting the table ALL AT ONCE.
-If you are reaching for C++ to generate random numbers — '''STOOOOOOP!''' I'm going to throw a CHAIR across the room. I am PHYSICALLY going to pick up a chair and THROW it. Put down that <code>&lt;random&gt;</code> header! Step AWAY from the Mersenne Twister boilerplate! You are about to write FIFTEEN LINES of ceremonial incantation — FIFTEEN! — just to get a number between ONE and SIX. ONE! AND! SIX! Python gives you that in '''ONE! LINE!''' And it will be CORRECT! And it will be READABLE! And you will not have to explain to your cryptographer friend why you seeded with <code>time(NULL)</code> — ''(sweats)'' — because that is how you get OWNED! OWNED! OWNED!
-Python's random-number story is THE BEST IN THE BUSINESS! THE! BEST! IN! THE! BUSINESS! It ships with THREE — count them — THREE battle-tested modules: <code>random</code>! <code>secrets</code>! <code>numpy.random</code>! And each one! Knows! Its! JOB! This page covers ALL THREE, calls out the FOOTGUNS — ''(throws chair)'' — and shows you the ONE OBVIOUS WAY to do it!
-== THE HOLY TRINITY OF PYTHON RANDOMNESS! ==
-''(paces left, paces right, sweat flying in every direction)''
-{| class="wikitable" style="width:100%"
 |-
-! MODULE !! USE CASE !! SPEED !! CRYPTOGRAPHIC?
+! Type !! Where it comes from !! Use it for...
 |-
-| <code>random</code> || SIMULATIONS! GAMES! SHUFFLING! SAMPLING! || '''FAST!''' (Mersenne Twister, baby!) || '''NO! NO! NO!''' Never for security! NEVER!
+| '''True random''' || Physical entropy - radioactive decay, thermal noise, keyboard mash-timing, hardware RNGs, <code>/dev/random</code> || Cryptography, key generation, anything where predictability = disaster
 |-
-| <code>secrets</code> || PASSWORDS! TOKENS! SESSION KEYS! AUTH! || Slower (OS entropy — deal with it) || '''YES! YES! YES!''' Built for exactly this!
+| '''Pseudorandom (PRNG)''' || A deterministic algorithm starting from a seed || Simulations, Monte Carlo, randomized algorithms, games, reproducible science
-|-
-| <code>numpy.random</code> || MASSIVE ARRAYS! MONTE CARLO! STATS! || '''VECTORIZED! GPU-READY! SCREAMS!''' || '''NO! NO! NO!''' Not for security!
 |}
-'''MEMORIZE! THIS! TABLE!''' ''(slams fist on podium)'' Getting it wrong is how you ship a BROKEN cryptosystem. Getting it wrong is how your simulation takes 100× too long. Getting it wrong is how you end up on the front page of Hacker News for all the WRONG reasons!
+True random is beautiful but slow and you can't replay it. Pseudorandom is fast, reproducible, and - if the algorithm is good - statistically indistinguishable from the real thing for most purposes. The rest of this page is about PRNGs, because that's what 99% of numerical work needs.
-== <code>random</code> — YOUR! DAILY! DRIVER! ==
+== Pseudorandom Number Generators: Controlled Chaos ==
-The <code>random</code> module uses the Mersenne Twister — MT19937, baby! Period of <math>2^{19937} - 1</math> — that's a two with NINETEEN THOUSAND digits after it! Passes Diehard! Passes TestU01! Ships with CPython! It is NOT cryptographically secure. Do NOT use it for secrets. '''ARE! WE! CLEAR! GOOD!'''
+The basic idea behind a PRNG is almost disappointingly simple: you take a number (the ''state''), scramble it through some mathematical function, output part of the result as your "random" number, and keep the rest as the new state. Lather, rinse, repeat.
-=== THE ONE-LINERS YOU WILL USE EVERY SINGLE DAY! ===
+=== The OG: Linear Congruential Generators ===
-<syntaxhighlight lang="python">
+<math>X_{n+1} = (aX_n + c) \bmod m</math>
-import random  # IMPORT! IMPORT! IMPORT!
+This is the LCG - the Honda Civic of random number generators. It's everywhere (including C's <code>rand()</code>), it's fast, and it's... fine, for very loose definitions of "fine." LCGs have lousy statistical properties in low dimensions - if you plot consecutive pairs <math>(X_n, X_{n+1})</math> as points in 2D, they fall on a small number of parallel planes. This is the Marsaglia effect and it's kind of horrifying once you see it.
-# Integer in [a, b] — INCLUSIVE ON BOTH ENDS, BABY!
+=== The Workhorse: Mersenne Twister (MT19937) ===
-die_roll = random.randint(1, 6)  # ONE! LINE! DICE!
-# Integer in [a, b) — exclusive upper, like range()
+Python's <code>random</code> module uses the Mersenne Twister under the hood - specifically MT19937, with a period of <math>2^{19937} - 1</math> (yes, that's a Mersenne prime, hence the name). The period is so absurdly long that you will never, ever exhaust it. The state is 624 32-bit integers, or about 2.5 KB. It passes most statistical test suites with flying colors.
-idx = random.randrange(0, len(my_list))  # PICK! AN! INDEX!
-# Float in [0.0, 1.0) — THE CLASSIC!
+It is ''not'' cryptographically secure. If an attacker can observe 624 consecutive outputs, they can reconstruct the entire internal state and predict every future number. For science and simulation this is irrelevant - for crypto it's a dealbreaker. Keep reading.
-u = random.random()  # RANDOM! RANDOM! RANDOM!
-# Float in [a, b] — FLOATS! FLOATS! FLOATS!
+== Python's <code>random</code> Module: Batteries Included ==
-f = random.uniform(2.5, 7.5)  # UNIFORM! UNIFORM!
-# Pick ONE! Pick ONE! Pick ONE!
+The <code>random</code> module is your everyday driver for random numbers in Python. It wraps MT19937 and gives you a clean, Pythonic API.
-winner = random.choice(["Alice", "Bob", "Charlie", "Dana"])
-# Pick k WITHOUT replacement — no duplicates, no excuses!
+=== The Core Functions You'll Actually Use ===
-sample = random.sample(population, k=10)
-# Shuffle IN PLACE — Fisher-Yates under the hood, baby!
+{| class="wikitable" border="1"
-random.shuffle(deck)  # SHUFFLE! SHUFFLE! SHUFFLE!
+|-
-</syntaxhighlight>
+! Function !! What it gives you
+|-
+| <code>random.random()</code> || Float in [0.0, 1.0). This is the one you'll call 90% of the time.
+|-
+| <code>random.randint(a, b)</code> || Random integer in [a, b] - inclusive on both ends. Yes, <code>randint(0, 10)</code> can give you 10.
+|-
+| <code>random.randrange(start, stop, step)</code> || Like <code>range()</code> but returns one random element. <code>randrange(0, 10)</code> gives 0-9.
+|-
+| <code>random.choice(seq)</code> || Grab a random element from a sequence. O(1) for lists.
+|-
+| <code>random.shuffle(seq)</code> || Fisher-Yates in-place shuffle. Kind of elegant.
+|-
+| <code>random.sample(population, k)</code> || <code>k</code> unique random elements without replacement. Reservoir sampling when needed.
+|-
+| <code>random.gauss(mu, sigma)</code> || Gaussian (normal) distribution. Slightly faster than <code>normalvariate</code> but uses a different algorithm.
+|-
+| <code>random.uniform(a, b)</code> || Float in [a, b]. Also <code>random.triangular()</code>, <code>random.betavariate()</code>, <code>random.expovariate()</code>, etc.
+|}
-=== SEEDS! CONTROL! YOUR! CHAOS! ===
+=== Seeding: The Ritual ===
 <syntaxhighlight lang="python">
-random.seed(42)          # DETERMINISTIC! Reproducible! TESTS!
+import random
-random.seed()            # System time + os.urandom fallback
+random.seed(42)
-random.seed("hello world", version=2)  # HASH! THAT! STRING!
+print(random.random())  # Always 0.6394267984578837 on CPython 3.x
 </syntaxhighlight>
-'''OPINION! OPINION! OPINION!''' Always seed explicitly in scientific code. '''REPRODUCIBILITY IS NOT OPTIONAL!''' If your results cannot be regenerated from a known seed, you do not have results — you have an ANECDOTE! Save the seed alongside your output. Future you will WEEP with GRATITUDE! WEEP! WITH! GRATITUDE!
+Call <code>seed()</code> once at the start of your program if you want reproducible runs. If you don't seed, Python seeds from <code>os.urandom()</code> on modern versions, which is fine - but your results won't be reproducible between runs.
-=== DISTRIBUTIONS! BEYOND! UNIFORM! ===
+'''One seed to rule them all:''' if you import other modules that also use <code>random</code>, seeding the global instance affects them too. This is either a feature or a deeply confusing bug depending on your perspective.
-<syntaxhighlight lang="python">
+== NumPy's Random: When the Stakes Get Higher ==
-# Gaussian (mu=0, sigma=1) — THE BELL CURVE!
-random.gauss(0, 1)          # SLIGHTLY! FASTER!
-random.normalvariate(0, 1)  # THREAD! SAFE!
-# Exponentially distributed, lambda=1.5 — POISSON'S COUSIN!
+The <code>random</code> module is fine for one-at-a-time random numbers, but numerical work often needs ''millions'' of them, fast. NumPy's random subsystem does exactly that - vectorized, tight C loops, and way more distributions than the stdlib.
-random.expovariate(1.5)
-# Gamma! Beta! Von Mises! Pareto! Weibull! — WE'VE GOT THEM ALL!
+=== The API Schism: Old Way vs. New Way ===
-random.gammavariate(alpha=2.0, beta=3.0)
-random.betavariate(alpha=0.5, beta=0.5)
-random.paretovariate(alpha=1.5)
-random.weibullvariate(alpha=1.5, beta=2.0)
-</syntaxhighlight>
-=== THE TRAP! <code>SystemRandom</code> — JUST! USE! <code>secrets</code>! ===
+NumPy has been through a random-number glow-up. Here's the deal:
-<syntaxhighlight lang="python">
+{| class="wikitable" border="1"
-# Exists. Works. But WHY! ARE! YOU! DOING! THIS!
+|-
-sr = random.SystemRandom()
+! Approach !! API !! Verdict
-token = sr.randrange(2**256)
+|-
-</syntaxhighlight>
+| '''Old school''' || <code>numpy.random.rand()</code>, <code>numpy.random.randn()</code>, <code>numpy.random.randint()</code> || Still works, still everywhere in legacy code. Uses a global <code>RandomState</code> singleton. Fine for quick scripts, annoying for reproducibility.
+|-
+| '''The new hotness''' || <code>numpy.random.default_rng()</code> → <code>Generator</code> object || Way better. Uses PCG64 by default (better statistical properties than MT19937). Explicit state, explicit seeding. No global state leakage.
+|}
-<code>SystemRandom</code> wraps <code>/dev/urandom</code> through the <code>random.Random</code> API. Fine! It works! ''(paces aggressively)'' But Python 3.6 gave us <code>secrets</code>, which has a PURPOSE-BUILT API for exactly this! USE! THE! RIGHT! TOOL! USE IT! USE IT! USE IT!
+The new way is the recommended path and it's just cleaner:
-== <code>secrets</code> — WHEN RANDOM ISN'T JUST RANDOM! ==
-''(stops pacing, stares directly into the audience, voice drops to a gravelly whisper)''
-If the outcome affects '''MONEY! PRIVACY! AUTHENTICATION! USER SAFETY!''' — you are in <code>secrets</code> territory, people. This module pulls entropy DIRECTLY from the operating system's CSPRNG — <code>/dev/urandom</code> on Linux, <code>CryptGenRandom</code> on Windows — and the API is DELIBERATELY NARROW. You CANNOT accidentally use it for Monte Carlo. You CANNOT. IT WILL NOT LET YOU!
 <syntaxhighlight lang="python">
-import secrets  # SECRETS! SECRETS! SECRETS!
+import numpy as np
+rng = np.random.default_rng(seed=42)
-# Cryptographically random integer in [0, n) — BELOW! BELOW! BELOW!
+x = rng.random(1_000_000)        # A million uniform floats, just like that
-secrets.randbelow(2**256)  # TWO TO THE TWO FIFTY SIX!
+y = rng.normal(0, 1, size=1000)  # A thousand standard normals
+z = rng.integers(0, 100, size=500)  # 500 ints in [0, 100)
-# Random integer with k random bits — BITS! BITS! BITS!
-secrets.randbits(256)  # GIVE! ME! TWO! HUNDRED! FIFTY! SIX! BITS!
-# Token — URL-safe Base64, nbytes → ceil(nbytes * 4/3) characters
-secrets.token_hex(32)       # 64 hex chars — HEX! HEX! HEX!
-secrets.token_urlsafe(32)   # ~43 URL-safe chars — URL! SAFE! URL! SAFE!
-# Pick one — constant-time-ish choice — PICK! PICK! PICK!
-secrets.choice(["primary", "secondary", "fallback"])
 </syntaxhighlight>
-'''OPINION! OPINION! OPINION!''' If you type <code>random</code> when you should have typed <code>secrets</code>, you have introduced a VULNERABILITY! '''FULL! STOP!''' The Mersenne Twister is PREDICTABLE after 624 consecutive outputs. That is not a theoretical attack — that is a SATURDAY-AFTERNOON SCRIPT! A SATURDAY! AFTERNOON! SCRIPT! Use <code>secrets</code> for anything adversarial. '''USE! IT!'''
+The <code>Generator</code> API gives you basically every distribution known to statistics - uniform, normal, exponential, gamma, beta, binomial, Poisson, chi-square, F, t, Laplace, logistic, lognormal, multinomial, multivariate normal, Dirichlet, Wishart... the list goes on. It's a statistical candy store.
-== <code>numpy.random</code> — THE! HEAVY! ARTILLERY! ==
-''(sweat intensity increases 400%)''
+=== Why PCG64 Over MT19937? ===
-When you need '''MILLIONS!''' of random numbers — Python's <code>random</code> module becomes a BOTTLENECK! Each call crosses the Python/C boundary! EACH! CALL! NumPy's <code>numpy.random</code> generates ENTIRE ARRAYS in C — VECTORIZED! — and the new API (1.17+) uses the SUPERIOR PCG-64 generator by default! PCG! PCG! PCG!
+PCG64 is a permuted congruential generator. It has better statistical properties than MT19937, a smaller state (128 bits - just two integers), and supports ''multiple independent streams'' via <code>SeedSequence</code>. It's also faster. The only real win MT19937 has is that the period is absurdly large, but PCG64's period of <math>2^{128}</math> is still "you'll never exhaust it in a billion lifetimes" territory.
-=== THE NEW API — USE THIS! NOT! THE! OLD! ONE! ===
+== A Tour Through the Distributions ==
-<syntaxhighlight lang="python">
+Half the power of random numbers is in the transformations. You start with uniform [0,1) and then shape it into whatever distribution your model needs. Here are the greatest hits:
-import numpy as np  # IMPORT! IMPORT!
-# Create a generator — PCG64 is the new default, BABY!
+=== The Inverse Transform Method ===
-rng = np.random.default_rng(seed=2024)  # DEFAULT! DEFAULT! DEFAULT!
-# Uniform [0, 1) — 10 MILLION FLOATS in < 100 ms! TEN! MILLION!
+This is one of those ideas that's so elegant it hurts. If you have a CDF <math>F(x)</math> with inverse <math>F^{-1}</math>, and you generate <math>U \sim \text{Uniform}(0,1)</math>, then <math>X = F^{-1}(U)</math> follows the distribution with CDF <math>F</math>.
-u = rng.random(10_000_000)  # THAT'S! TEN! MILLION!
-# Integers — INCLUSIVE ENDPOINT!
+That's it. That's the whole trick. It's why the exponential distribution is <math>-\lambda \ln(U)</math> - the CDF inversion just works out that way. Some distributions (normal) don't have a closed-form inverse CDF, so fancier methods like Box-Muller or Ziggurat step in. But the inverse transform is the conceptual backbone.
-dice = rng.integers(1, 7, size=1000, endpoint=True)  # d6 × 1000!
-# Normal, vectorized — ONE! MILLION! NORMALS!
+=== NumPy's Distribution Buffet ===
-z = rng.standard_normal((1000, 1000))  # VECTORIZED! VECTORIZED!
-# Shuffle along axis — SHUFFLE! THE! ARRAY!
-rng.shuffle(arr, axis=0)
-# Choice with replacement AND probabilities — CHOOSE! CHOOSE!
-rng.choice(["H", "T"], size=1000, p=[0.5, 0.5])
-</syntaxhighlight>
-=== THE OLD API — DETECT IT! THEN! KILL! IT! ===
 <syntaxhighlight lang="python">
-# LEGACY! LEGACY! LEGACY! Global state, unpredictable seeding, slower PCG!
+rng = np.random.default_rng()
-np.random.seed(42)        # DON'T! DON'T! DON'T!
-np.random.rand(100)       # Uniform from the GLOBAL STATE!
-np.random.randn(100)      # Normal from the GLOBAL STATE!
-np.random.randint(0, 10, 100)  # Integer from the GLOBAL STATE!
-# EVERY call after np.random.seed() is a FOOTGUN in threaded code!
+rng.uniform(0, 1, size=1000)       # The building block
-# A FOOTGUN! IN THREADED CODE!
+rng.normal(mu, sigma, size=1000)    # The bell curve
-# Use np.random.default_rng() instead! USE IT!
+rng.exponential(scale, size=1000)   # Waiting times
+rng.poisson(lam, size=1000)         # Count data
+rng.binomial(n, p, size=1000)       # k successes in n trials
+rng.gamma(shape, scale, size=1000)  # Waiting time for k events
+rng.beta(a, b, size=1000)           # Proportions, Bayesian priors
+rng.chisquare(df, size=1000)        # Sum of squared normals
+rng.choice([1,2,3,4,5], size=100)   # Discrete sampling with/without replacement
+rng.permutation(arr)                # Shuffle in place
 </syntaxhighlight>
-'''OPINION! OPINION! OPINION!''' <code>np.random.seed()</code> is a CODE SMELL in the year of our lord twenty twenty-four! ''(throws another chair)'' The global <code>RandomState</code> is SHARED across ALL threads, ALL libraries, ALL modules that import NumPy. If ANY of them call <code>np.random.seed()</code> or draw from the global state, your "deterministic" run is SILENTLY! CORRUPTED! SILENTLY! CORRUPTED! The <code>Generator</code> API — <code>default_rng</code> — gives each component its own ISOLATED STREAM! ISOLATED! USE! IT!
+== Seeds and Reproducibility: Control Your Chaos ==
-=== GENERATORS! PICK! YOUR! POISON! ===
-<syntaxhighlight lang="python">
-from numpy.random import PCG64, Philox, SFC64, MT19937
-# PCG64 — DEFAULT! EXCELLENT! ALL-ROUNDER! TINY STATE! 128 BITS!
-rng = np.random.Generator(PCG64(seed=42))
-# Philox — COUNTER-BASED! PARALLEL! GPU-FRIENDLY!
-rng = np.random.Generator(Philox(seed=42))
-# SFC64 — FASTEST! SMALL STATE! GOOD STATISTICAL QUALITY!
-rng = np.random.Generator(SFC64(seed=42))
-# MT19937 — THE OLD WARHORSE! COMPATIBILITY! ONLY!
-rng = np.random.Generator(MT19937(seed=42))
-</syntaxhighlight>
-== THE LAWS OF RANDOM NUMBER HYGIENE! ==
-''(grips podium with both hands, knuckles white, voice at absolute maximum volume)''
-=== LAW 1! SEEDS! ARE! SACRED! ===
-Log your seed! Better yet, log the ENTIRE RNG STATE if you checkpoint! If you cannot replay your simulation BIT-FOR-BIT, your paper is a PDF full of HOPES! AND! FEELINGS! HOPES! AND! FEELINGS!
-=== LAW 2! <code>random</code> ≠ <code>secrets</code>! ===
-Print this on a STICKY NOTE and ATTACH it to your MONITOR! Print it on your FOREHEAD! Tattoo it on your ARM!
-<code>random</code> is for DICE! DICE! DICE! <code>secrets</code> is for KEYS! KEYS! KEYS!
-If your code generates a session ID, password reset token, or API key with <code>random</code> — DELETE! IT! AND! START! OVER! DELETE IT! START OVER!
-=== LAW 3! VECTORIZE! OR! DIE! ===
-Generating 10 million random numbers in a Python <code>for</code> loop calling <code>random.random()</code> is the computational equivalent of EATING SOUP WITH A FORK! Use <code>numpy.random.Generator.random(10_000_000)</code>. It will be 50–200× FASTER! FIFTY TO TWO HUNDRED TIMES FASTER! And your CPU will not file a GRIEVANCE! THE CPU! WILL NOT! FILE! A GRIEVANCE!
-=== LAW 4! NEVER! SEED! WITH! SYSTEM! TIME! IN! A! LOOP! ===
-Seeding with <code>time.time()</code> inside a tight loop — ''(we see this in the wild, people — we SEE it)'' — produces IDENTICAL "random" sequences every iteration because the CLOCK HASN'T TICKED! THE! CLOCK! HASN'T! TICKED! This is not a subtle bug — it is a DISASTER with a STRAIGHT FACE! If you must reseed quickly, use <code>secrets.randbits(128)</code> as the seed! SECRETS! NOT! TIME!
-=== LAW 5! BEWARE! THE! BIRTHDAY! PARADOX! ===
-You need only <math>\sqrt{n}</math> samples before COLLISIONS appear! COLLISIONS! For 32-bit random IDs, that is ~77,000! SEVENTY-SEVEN THOUSAND! If you are generating IDs with <code>random.getrandbits(32)</code>, expect a duplicate by row 65,536! SIXTY-FIVE THOUSAND! Use 128-bit tokens — <code>secrets.token_hex(16)</code> — and the collision probability becomes COSMICALLY! NEGLIGIBLE! COSMICALLY! NEGLIGIBLE!
+Reproducibility is the quiet superpower of PRNGs. Set the same seed, get the same sequence - every single time, on any machine, any OS. This makes debugging stochastic code actually possible and lets other people replicate your results exactly.
-== COMMON PATTERNS! DONE! RIGHT! ==
+=== The Golden Rules ===
-=== PATTERN! MONTE! CARLO! INTEGRATION! ===
+# '''Seed once, at the top.''' Never inside a loop. You already knew this, but everyone does it at least once.
+# '''Use <code>SeedSequence</code> for parallel work.''' If you're running 8 MPI processes or 32 <code>joblib</code> workers, you don't want them all using the same stream. <code>SeedSequence</code> derives independent seeds from one parent seed - it's basically entropy-aware seed spawning and it's kind of genius.
+# '''Record your seed.''' Stick it in a config file, a log line, or a comment. Future you will thank present you.
 <syntaxhighlight lang="python">
-import numpy as np  # NUMPY! NUMPY! NUMPY!
+from numpy.random import SeedSequence, default_rng
-def estimate_pi(n: int, rng=None) -> float:
+ss = SeedSequence(12345)
-    """Estimate π by throwing DARTS at a UNIT SQUARE!"""
+child_seeds = ss.spawn(4)  # 4 independent streams
-    if rng is None:
+rngs = [default_rng(s) for s in child_seeds]
-        rng = np.random.default_rng()  # DEFAULT! DEFAULT!
-    x = rng.random(n)  # N DARTS! X AXIS!
-    y = rng.random(n)  # N DARTS! Y AXIS!
-    inside = np.sum(x**2 + y**2 <= 1.0)  # INSIDE THE CIRCLE!
-    return 4.0 * inside / n  # PI! PI! PI!
-rng = np.random.default_rng(seed=42)
-print(estimate_pi(10_000_000, rng=rng))  # 3.1415... LOOK AT IT! LOOK AT PI!
 </syntaxhighlight>
-=== PATTERN! RESERVOIR! SAMPLING! STREAMING! DATA! ===
+=== Don't Seed in a Loop ===
 <syntaxhighlight lang="python">
-import random  # IMPORT! IMPORT!
+# DON'T DO THIS
+for i in range(1000):
-def reservoir_sample(stream, k: int, rng=None):
+     random.seed(42)
-     """Reservoir-sample k items from a STREAMING ITERABLE!"""
+     x = random.random()  # x is the SAME every iteration
-    if rng is None:
-        rng = random.Random()  # MY! OWN! RNG!
-     reservoir = []
-    for i, item in enumerate(stream):
-        if i < k:
-            reservoir.append(item)  # FILL! THE! RESERVOIR!
-        else:
-            j = rng.randrange(i + 1)
-            if j < k:
-                reservoir[j] = item  # REPLACE! REPLACE! REPLACE!
-    return reservoir
 </syntaxhighlight>
-=== PATTERN! CRYPTOGRAPHIC! SALT! TOKEN! ===
+You'd be surprised how often this shows up in actual code. The first random number after seeding with a fixed value is a deterministic function of that seed. Seeding inside a loop with a constant seed just gives you the same "random" number over and over. Use a single seed at the top and let the generator do its thing.
-<syntaxhighlight lang="python">
+== When Random Isn't Random Enough: Cryptography ==
-import secrets  # SECRETS! SECRETS! SECRETS!
-def make_session_id() -> str:
+Sometimes "statistically random" isn't good enough - you need "your adversary, armed with a supercomputer and a copy of your algorithm, cannot predict the next bit." That's the crypto-grade bar.
-    """256-bit session ID, URL-safe. ~43 characters. SECURE! SECURE!"""
-    return secrets.token_urlsafe(32)
-def make_api_key() -> str:
+=== <code>secrets</code> and <code>os.urandom()</code> ===
-    """Hex-encoded 256-bit API key. 64 characters. KEYS! KEYS! KEYS!"""
-    return secrets.token_hex(32)
-</syntaxhighlight>
-=== PATTERN! TRAIN! TEST! SPLIT! REPRODUCIBLE! ===
+Python 3.6+ ships the <code>secrets</code> module, which wraps <code>os.urandom()</code> (the OS's cryptographically secure RNG - on Linux this comes from <code>/dev/urandom</code> and the kernel's entropy pool):
 <syntaxhighlight lang="python">
-import numpy as np
+import secrets
+token = secrets.token_hex(32)     # 64-char hex string
-rng = np.random.default_rng(seed=8675309)  # JENNY! JENNY! JENNY!
+key = secrets.randbits(256)        # 256 random bits
-indices = rng.permutation(len(data))  # PERMUTE! PERMUTE!
+url_safe = secrets.token_urlsafe() # For session IDs, CSRF tokens, etc.
-split = int(0.8 * len(data))  # EIGHTY! TWENTY!
-train_idx, test_idx = indices[:split], indices[split:]  # SPLIT! SPLIT!
 </syntaxhighlight>
-== WHAT ABOUT <code>os.urandom</code>?! ==
+=== Why MT19937 Is NOT Crypto-Safe ===
-<code>os.urandom(n)</code> is the BEDROCK, people! ''(pounds podium)'' It returns ''n'' bytes from the OS CSPRNG! <code>secrets</code> is a THIN, OPINIONATED WRAPPER around it! A THIN! WRAPPER! Use <code>secrets</code> for structured randomness — tokens, integers, choices! Use <code>os.urandom</code> directly ONLY when you need raw bytes or are building your own crypto primitives — and if you are doing that, you ALREADY KNOW WHY YOU ARE HERE!
+The Mersenne Twister is a linear recurrence. Observe 624 consecutive 32-bit outputs and you can solve for the entire internal state with linear algebra. After that you can predict every future output and reconstruct every past output. The <code>random</code> module even warns about this in its docs - it's not suitable for security purposes. Use <code>secrets</code> or <code>os.urandom()</code> directly.
-== WHAT ABOUT C++'s <code>&lt;random&gt;</code>?! ==
+== Practical Gotchas: Stuff That Will Bite You ==
-''(deep breath — the BIG one is coming)''
+=== The <code>randint</code> Inclusive Surprise ===
-Look. I have written C++. I have LOVED C++. '''I! HAVE! LOVED! C++!''' But generating a random integer in modern C++ looks like THIS:
+<code>random.randint(a, b)</code> includes <code>b</code>. <code>numpy.random.randint(a, b)</code> in the new API (<code>rng.integers</code>) does NOT include <code>b</code>. And to make it extra confusing, <code>numpy.random.randint</code> in the OLD API follows the Python convention and DOES include <code>b</code> for some argument patterns. Read the docs for whatever API you're using. This has caused real bugs in real papers.
-<syntaxhighlight lang="cpp">
+=== Slow Generation Patterns ===
-#include <random>    // ONE! HEADER!
-#include <iostream>  // TWO! HEADERS!
-int main() {
+Generating numbers one at a time in a Python loop is death by a thousand function calls. If you need a million random numbers, ask NumPy for them all at once:
-    std::random_device rd;              // OBJECT! NUMBER! ONE!
-    std::mt19937 gen(rd());             // OBJECT! NUMBER! TWO!
-    std::uniform_int_distribution<> dist(1, 6);  // OBJECT! NUMBER! THREE!
-    std::cout << dist(gen) << '\n';     // FINALLY! A! NUMBER!
-}
-</syntaxhighlight>
-PYTHON!:
+<syntaxhighlight lang="python">
+# Slow (Python-loop overhead dominates):
+vals = [random.random() for _ in range(1_000_000)]
-<syntaxhighlight lang="python">
+# Fast (vectorized, C-level):
-import random                    # ONE! IMPORT!
+vals = rng.random(1_000_000)
-print(random.randint(1, 6))     # ONE! LINE! ONE! LINE! ONE! LINE!
 </syntaxhighlight>
-The C++ version is FIVE! LINES! FIVE! It pulls in TWO! HEADERS! It instantiates THREE! OBJECTS! From THREE! DIFFERENT! CLASSES! '''FOR! A! DIE! ROLL!''' ''(sweat dripping onto the keyboard)'' And EVERY! SINGLE! ONE! of those lines has a sharp edge: <code>std::random_device</code> can be DETERMINISTIC on MinGW! <code>std::mt19937</code> produces BIASED RESULTS if you use modulo instead of <code>uniform_int_distribution</code>! The BOILERPLATE-TO-VALUE RATIO IS OFF! THE! CHARTS! OFF! THE! CHARTS!
+=== Parallel RNG: Independent Streams or Bust ===
-'''USE! PYTHON!''' Your numerical code will be SHORTER! CORRECTER! And you will finish BEFORE LUNCH! BEFORE! LUNCH!
-''(collapses in sweaty heap, one fist still raised triumphantly in the air)''
+If you spawn 8 parallel processes and each seeds with the current time, you might get collisions. Worse, if you seed them all with the same seed (or with seeds 0,1,2,...7), the streams might be correlated. <code>SeedSequence</code> + PCG64 makes this basically foolproof - use it.
-'''PYTHON! PYTHON! PYTHON! PYTHON! RANDOM! RANDOM! RANDOM! RANDOM!'''
+=== The <code>random.random() * N</code> Subtle Bias ===
-'''DEVELOPERS! DEVELOPERS! DEVELOPERS! DEVELOPERS!'''
+Multiplying a uniform float by N and flooring to get an integer index can introduce tiny biases due to floating-point representation. For casual use it's fine, but if you're doing something like lottery draw simulations where every bit of uniformity counts, use <code>randint</code> or <code>randrange</code> - they use rejection sampling internally to guarantee uniformity.
-== SEE ALSO ==
+== See Also ==
-* [[Numerics]] — the FULL! NUMERICAL! RECIPES! CATALOG!
+* '''[[Random Numbers]]''' - The C++ page exploring <code>rand()</code> weirdness (the one that kicked all this off)
-* [[Numerics/Monte Carlo]] — Monte Carlo methods DONE! RIGHT!
+* '''[[Numerics/Statistical Descriptions of Data]]''' - Making sense of the numbers once you've generated them
-* [[Numerics/Statistical Descriptions of Data]] — DESCRIPTIVE! STATISTICS!
-* [[Numerics/Classification and Inference]] — Bayesian AND frequentist INFERENCE!
-* [https://docs.python.org/3/library/random.html Python <code>random</code> documentation] — READ! THE! DOCS!
-* [https://docs.python.org/3/library/secrets.html Python <code>secrets</code> documentation] — READ! THEM!
-* [https://numpy.org/doc/stable/reference/random/index.html NumPy random documentation] — READ! THEM! TOO!
-* [https://www.pcg-random.org/ PCG Random — the PCG family explained] — PCG! PCG! PCG!
 [[Category:Numerics]]
 [[Category:Python]]
 [[Category:Random]]
+{{NumericsFlag}}
-'''DEVELOPERS! DEVELOPERS! DEVELOPERS! DEVELOPERS!'''
-''(faints)''

Numerics/Random Numbers: Difference between revisions

From charlesreid1

Latest revision as of 10:58, 24 June 2026

The Two Kinds of Random

Pseudorandom Number Generators: Controlled Chaos

The OG: Linear Congruential Generators

The Workhorse: Mersenne Twister (MT19937)

Python's random Module: Batteries Included

The Core Functions You'll Actually Use

Seeding: The Ritual

NumPy's Random: When the Stakes Get Higher

The API Schism: Old Way vs. New Way

Why PCG64 Over MT19937?

A Tour Through the Distributions

The Inverse Transform Method

NumPy's Distribution Buffet

Seeds and Reproducibility: Control Your Chaos

The Golden Rules

Don't Seed in a Loop

When Random Isn't Random Enough: Cryptography

secrets and os.urandom()

Why MT19937 Is NOT Crypto-Safe

Practical Gotchas: Stuff That Will Bite You

The randint Inclusive Surprise

Slow Generation Patterns

Parallel RNG: Independent Streams or Bust

The random.random() * N Subtle Bias

See Also

Python's `random` Module: Batteries Included

`secrets` and `os.urandom()`

The `randint` Inclusive Surprise

The `random.random() * N` Subtle Bias