Skip to main content

Probability and Statistics

1. Probability Spaces

1.1 Sample Spaces and Events

A probability space is a triple (Ω,F,P)(\Omega, \mathcal{F}, P) where:

  • Ω\Omega is the sample space (set of all possible outcomes).
  • F\mathcal{F} is a sigma-algebra (collection of events) on Ω\Omega.
  • P:F[0,1]P : \mathcal{F} \to [0,1] is a probability measure.

1.2 Axioms of Probability (Kolmogorov)

  1. Non-negativity: P(A)0P(A) \geq 0 for all AFA \in \mathcal{F}.
  2. Normalisation: P(Ω)=1P(\Omega) = 1.
  3. Countable additivity: If A1,A2,A_1, A_2, \ldots are pairwise disjoint events, then

P(i=1Ai)=i=1P(Ai)P\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i)

1.3 Basic Properties

Proposition 1.1. P()=0P(\emptyset) = 0.

Proof. Ω=Ω\Omega = \Omega \cup \emptyset (disjoint union), so P(Ω)=P(Ω)+P()P(\Omega) = P(\Omega) + P(\emptyset), hence P()=0P(\emptyset) = 0. \blacksquare

Proposition 1.2 (Complement). P(Ac)=1P(A)P(A^c) = 1 - P(A).

Proposition 1.3 (Monotonicity). If ABA \subseteq B, then P(A)P(B)P(A) \leq P(B).

Proposition 1.4 (Inclusion-Exclusion). For any two events A,BA, B:

P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)

Proposition 1.5 (Boole's Inequality). P(AB)P(A)+P(B)P(A \cup B) \leq P(A) + P(B). More generally,

P(i=1nAi)i=1nP(Ai)P\left(\bigcup_{i=1}^n A_i\right) \leq \sum_{i=1}^n P(A_i)

1.4 Conditional Probability

Definition. The conditional probability of AA given BB (where P(B)>0P(B) > 0) is

P(AB)=P(AB)P(B)P(A \mid B) = \frac{P(A \cap B)}{P(B)}

Theorem 1.1 (Law of Total Probability). If B1,,BnB_1, \ldots, B_n partition Ω\Omega with P(Bi)>0P(B_i) > 0:

P(A)=i=1nP(ABi)P(Bi)P(A) = \sum_{i=1}^n P(A \mid B_i) P(B_i)

Theorem 1.2 (Bayes' Theorem). For events AA and BB with P(B)>0P(B) > 0:

P(AB)=P(BA)P(A)P(B)P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}

In the partition form:

P(BjA)=P(ABj)P(Bj)i=1nP(ABi)P(Bi)P(B_j \mid A) = \frac{P(A \mid B_j) P(B_j)}{\sum_{i=1}^n P(A \mid B_i) P(B_i)}

2. Random Variables

2.1 Definition

A random variable is a measurable function X:ΩRX : \Omega \to \mathbb{R}.

  • Discrete random variable: takes values in a countable set.
  • Continuous random variable: has a probability density function (PDF).

2.2 Cumulative Distribution Function

The CDF of a random variable XX is

FX(x)=P(Xx)F_X(x) = P(X \leq x)

Properties:

  1. limxF(x)=0\lim_{x \to -\infty} F(x) = 0, limx+F(x)=1\lim_{x \to +\infty} F(x) = 1.
  2. FF is non-decreasing and right-continuous.
  3. P(a<Xb)=F(b)F(a)P(a < X \leq b) = F(b) - F(a).

2.3 Probability Mass Function (Discrete)

For a discrete random variable XX with values {x1,x2,}\{x_1, x_2, \ldots\}:

fX(x)=P(X=x)={piifx=xi0otherwisef_X(x) = P(X = x) = \begin{cases} p_i & \mathrm{if } x = x_i \\ 0 & \mathrm{otherwise} \end{cases}

2.4 Probability Density Function (Continuous)

A random variable XX is continuous if there exists a function fX0f_X \geq 0 such that

P(aXb)=abfX(x)dxP(a \leq X \leq b) = \int_a^b f_X(x)\, dx

and fX(x)dx=1\int_{-\infty}^{\infty} f_X(x)\, dx = 1.

Note: fX(x)f_X(x) is not a probability; it is a probability density. For continuous XX, P(X=x)=0P(X = x) = 0 for any individual xx.

3. Common Distributions

3.1 Discrete Distributions

Bernoulli. XBernoulli(p)X \sim \mathrm{Bernoulli}(p): P(X=1)=pP(X = 1) = p, P(X=0)=1pP(X = 0) = 1 - p.

E[X]=p,Var(X)=p(1p)E[X] = p, \quad \mathrm{Var}(X) = p(1 - p)

Binomial. XBin(n,p)X \sim \mathrm{Bin}(n, p): number of successes in nn independent Bernoulli trials.

P(X=k)=(nk)pk(1p)nk,k=0,1,,nP(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \ldots, n

E[X]=np,Var(X)=np(1p)E[X] = np, \quad \mathrm{Var}(X) = np(1-p)

Poisson. XPoisson(λ)X \sim \mathrm{Poisson}(\lambda): models rare events occurring at rate λ\lambda.

P(X=k)=eλλkk!,k=0,1,2,P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!}, \quad k = 0, 1, 2, \ldots

E[X]=λ,Var(X)=λE[X] = \lambda, \quad \mathrm{Var}(X) = \lambda

Proof that E[X]=λE[X] = \lambda:

E[X]=k=0keλλkk!=eλk=1λk(k1)!=eλλk=0λkk!=λE[X] = \sum_{k=0}^{\infty} k \frac{e^{-\lambda} \lambda^k}{k!} = e^{-\lambda} \sum_{k=1}^{\infty} \frac{\lambda^k}{(k-1)!} = e^{-\lambda} \lambda \sum_{k=0}^{\infty} \frac{\lambda^k}{k!} = \lambda

\blacksquare

3.2 Continuous Distributions

Uniform. XUniform(a,b)X \sim \mathrm{Uniform}(a, b):

f(x)={1baifaxb0otherwisef(x) = \begin{cases} \frac{1}{b - a} & \mathrm{if } a \leq x \leq b \\ 0 & \mathrm{otherwise} \end{cases}

E[X]=a+b2,Var(X)=(ba)212E[X] = \frac{a + b}{2}, \quad \mathrm{Var}(X) = \frac{(b-a)^2}{12}

Exponential. XExp(λ)X \sim \mathrm{Exp}(\lambda):

f(x)={λeλxifx00ifx<0f(x) = \begin{cases} \lambda e^{-\lambda x} & \mathrm{if } x \geq 0 \\ 0 & \mathrm{if } x < 0 \end{cases}

E[X]=1λ,Var(X)=1λ2E[X] = \frac{1}{\lambda}, \quad \mathrm{Var}(X) = \frac{1}{\lambda^2}

Normal (Gaussian). XN(μ,σ2)X \sim N(\mu, \sigma^2):

f(x)=1σ2πexp((xμ)22σ2),xRf(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right), \quad x \in \mathbb{R}

E[X]=μ,Var(X)=σ2E[X] = \mu, \quad \mathrm{Var}(X) = \sigma^2

The standard normal ZN(0,1)Z \sim N(0,1) has CDF denoted Φ(z)\Phi(z). For any XN(μ,σ2)X \sim N(\mu, \sigma^2):

Z=XμσN(0,1)Z = \frac{X - \mu}{\sigma} \sim N(0, 1)

Theorem 3.1. The sum of independent normal random variables is normal: if XiN(μi,σi2)X_i \sim N(\mu_i, \sigma_i^2) are independent, then XiN(μi,σi2)\sum X_i \sim N(\sum \mu_i, \sum \sigma_i^2).

4. Expectation, Variance, and Moment Generating Functions

4.1 Expectation

Definition. The expected value of XX is

E[X]={xxfX(x)(discrete)xfX(x)dx(continuous)E[X] = \begin{cases} \sum_x x\, f_X(x) & \mathrm{(discrete)} \\ \int_{-\infty}^{\infty} x\, f_X(x)\, dx & \mathrm{(continuous)} \end{cases}

Proposition 4.1 (LOTUS). For any function gg:

E[g(X)]={xg(x)fX(x)(discrete)g(x)fX(x)dx(continuous)E[g(X)] = \begin{cases} \sum_x g(x)\, f_X(x) & \mathrm{(discrete)} \\ \int_{-\infty}^{\infty} g(x)\, f_X(x)\, dx & \mathrm{(continuous)} \end{cases}

Theorem 4.1 (Properties of Expectation).

  1. E[aX+b]=aE[X]+bE[aX + b] = aE[X] + b (linearity).
  2. E[X+Y]=E[X]+E[Y]E[X + Y] = E[X] + E[Y] (always, even without independence).
  3. If XX and YY are independent, E[XY]=E[X]E[Y]E[XY] = E[X]E[Y].

4.2 Variance

Var(X)=E[(XE[X])2]=E[X2](E[X])2\mathrm{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2

Theorem 4.2.

  1. Var(aX+b)=a2Var(X)\mathrm{Var}(aX + b) = a^2 \mathrm{Var}(X).
  2. If X,YX, Y are independent: Var(X+Y)=Var(X)+Var(Y)\mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y).

4.3 Moment Generating Functions

The moment generating function (MGF) of XX is

MX(t)=E[etX]M_X(t) = E[e^{tX}]

(provided the expectation exists in a neighbourhood of t=0t = 0).

Theorem 4.3. If MX(t)M_X(t) exists in a neighbourhood of 00, then E[Xn]=MX(n)(0)E[X^n] = M_X^{(n)}(0).

Theorem 4.4 (Uniqueness). If MX(t)=MY(t)M_X(t) = M_Y(t) for all tt in a neighbourhood of 00, then XX and YY have the same distribution.

Theorem 4.5. If XX and YY are independent, MX+Y(t)=MX(t)MY(t)M_{X+Y}(t) = M_X(t) M_Y(t).

Worked Example. Find the MGF of XExp(λ)X \sim \mathrm{Exp}(\lambda).

MX(t)=E[etX]=0etxλeλxdx=λ0e(λt)xdxM_X(t) = E[e^{tX}] = \int_0^{\infty} e^{tx} \lambda e^{-\lambda x}\, dx = \lambda \int_0^{\infty} e^{-(\lambda - t)x}\, dx

For t<λt < \lambda: MX(t)=λλtM_X(t) = \frac{\lambda}{\lambda - t}.

MX(t)=λ(λt)2M_X'(t) = \frac{\lambda}{(\lambda - t)^2}, so E[X]=MX(0)=1/λE[X] = M_X'(0) = 1/\lambda. \blacksquare

5. Joint Distributions

5.1 Joint PDF and CDF

For two random variables XX and YY, the joint CDF is FX,Y(x,y)=P(Xx,Yy)F_{X,Y}(x,y) = P(X \leq x, Y \leq y).

The joint PDF (for continuous) satisfies P((X,Y)A)=AfX,Y(x,y)dxdyP((X,Y) \in A) = \iint_A f_{X,Y}(x,y)\, dx\, dy.

5.2 Marginal Distributions

The marginal PDF of XX is obtained by integrating out YY:

fX(x)=fX,Y(x,y)dyf_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y)\, dy

5.3 Independence

XX and YY are independent if and only if

fX,Y(x,y)=fX(x)fY(y)forallx,yf_{X,Y}(x,y) = f_X(x) f_Y(y) \quad \mathrm{for all } x, y

Theorem 5.1. If XX and YY are independent, then E[XY]=E[X]E[Y]E[XY] = E[X]E[Y] and Var(X+Y)=Var(X)+Var(Y)\mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y).

5.4 Covariance and Correlation

Cov(X,Y)=E[(XE[X])(YE[Y])]=E[XY]E[X]E[Y]\mathrm{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y]

The correlation coefficient is

ρX,Y=Cov(X,Y)Var(X)Var(Y)\rho_{X,Y} = \frac{\mathrm{Cov}(X,Y)}{\sqrt{\mathrm{Var}(X)\mathrm{Var}(Y)}}

Properties:

  • 1ρX,Y1-1 \leq \rho_{X,Y} \leq 1.
  • ρ=±1\rho = \pm 1 if and only if XX and YY are linearly related.
  • ρ=0\rho = 0 does not imply independence (only uncorrelatedness).

6. Limit Theorems

6.1 Law of Large Numbers

Theorem 6.1 (Weak Law of Large Numbers). Let X1,X2,X_1, X_2, \ldots be i.i.d. with E[Xi]=μE[X_i] = \mu and Var(Xi)=σ2<\mathrm{Var}(X_i) = \sigma^2 < \infty. Then for every ε>0\varepsilon > 0:

limnP(1ni=1nXiμ>ε)=0\lim_{n \to \infty} P\left(\left|\frac{1}{n}\sum_{i=1}^n X_i - \mu\right| > \varepsilon\right) = 0

Proof. Let Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i. Then E[Xˉn]=μE[\bar{X}_n] = \mu and Var(Xˉn)=σ2/n\mathrm{Var}(\bar{X}_n) = \sigma^2/n. By Chebyshev's inequality:

P(Xˉnμ>ε)Var(Xˉn)ε2=σ2nε20P(|\bar{X}_n - \mu| > \varepsilon) \leq \frac{\mathrm{Var}(\bar{X}_n)}{\varepsilon^2} = \frac{\sigma^2}{n\varepsilon^2} \to 0

\blacksquare

Theorem 6.2 (Strong Law of Large Numbers). Under the same hypotheses:

P(limn1ni=1nXi=μ)=1P\left(\lim_{n \to \infty} \frac{1}{n}\sum_{i=1}^n X_i = \mu\right) = 1

6.2 Central Limit Theorem

Theorem 6.3 (Central Limit Theorem). Let X1,X2,X_1, X_2, \ldots be i.i.d. with E[Xi]=μE[X_i] = \mu and Var(Xi)=σ2>0\mathrm{Var}(X_i) = \sigma^2 > 0. Then

Xˉnμσ/ndN(0,1)\frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \xrightarrow{d} N(0, 1)

That is, for all a<ba < b:

limnP(a<Xˉnμσ/n<b)=Φ(b)Φ(a)\lim_{n \to \infty} P\left(a < \frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} < b\right) = \Phi(b) - \Phi(a)

6.3 Worked Example

Problem. A factory produces light bulbs with mean lifetime 1000 hours and standard deviation 200 hours. What is the probability that the mean lifetime of 100 bulbs exceeds 1040 hours?

Solution. By the CLT, Xˉ100N(1000,2002/100)=N(1000,400)\bar{X}_{100} \approx N(1000, 200^2/100) = N(1000, 400).

P(Xˉ>1040)=P(Z>1040100020)=P(Z>2)0.0228P(\bar{X} > 1040) = P\left(Z > \frac{1040 - 1000}{20}\right) = P(Z > 2) \approx 0.0228

\blacksquare

7. Maximum Likelihood Estimation

7.1 Likelihood Function

Given i.i.d. observations x1,,xnx_1, \ldots, x_n from a distribution with parameter θ\theta, the likelihood function is

L(θ)=i=1nf(xiθ)L(\theta) = \prod_{i=1}^n f(x_i \mid \theta)

and the log-likelihood is

(θ)=logL(θ)=i=1nlogf(xiθ)\ell(\theta) = \log L(\theta) = \sum_{i=1}^n \log f(x_i \mid \theta)

7.2 MLE Procedure

The maximum likelihood estimator (MLE) θ^MLE\hat{\theta}_{\mathrm{MLE}} maximises L(θ)L(\theta) (equivalently, (θ)\ell(\theta)):

θ^MLE=argmaxθL(θ)\hat{\theta}_{\mathrm{MLE}} = \arg\max_\theta L(\theta)

Typically found by solving (θ)=0\ell'(\theta) = 0 and verifying (θ^)<0\ell''(\hat{\theta}) < 0.

7.3 Worked Example

Problem. Find the MLE for λ\lambda given i.i.d. observations x1,,xnx_1, \ldots, x_n from Exp(λ)\mathrm{Exp}(\lambda).

Solution.

L(λ)=i=1nλeλxi=λneλxiL(\lambda) = \prod_{i=1}^n \lambda e^{-\lambda x_i} = \lambda^n e^{-\lambda \sum x_i}

(λ)=nlogλλi=1nxi\ell(\lambda) = n \log \lambda - \lambda \sum_{i=1}^n x_i

ddλ=nλi=1nxi=0    λ^=nxi=1xˉ\frac{d\ell}{d\lambda} = \frac{n}{\lambda} - \sum_{i=1}^n x_i = 0 \implies \hat{\lambda} = \frac{n}{\sum x_i} = \frac{1}{\bar{x}}

Verify: d2dλ2=nλ2<0\frac{d^2\ell}{d\lambda^2} = -\frac{n}{\lambda^2} < 0, confirming this is a maximum. \blacksquare

Common Pitfall

The MLE is not always unbiased. For example, the MLE σ^2=1n(XiXˉ)2\hat{\sigma}^2 = \frac{1}{n}\sum (X_i - \bar{X})^2 for the normal variance is biased; the unbiased estimator uses n1n - 1 in the denominator.

8. Hypothesis Testing

8.1 Framework

A hypothesis test evaluates two competing statements:

  • Null hypothesis H0H_0: the status quo (e.g., μ=μ0\mu = \mu_0).
  • Alternative hypothesis H1H_1: what we want to show (e.g., μ>μ0\mu > \mu_0).

8.2 Test Statistics and Decisions

A test statistic TT is a function of the data. We reject H0H_0 if TT falls in the rejection region (critical region).

Type I error: rejecting H0H_0 when it is true (false positive). Probability = α\alpha (significance level).

Type II error: failing to reject H0H_0 when it is false (false negative). Probability = β\beta.

The power of a test is 1β=P(rejectH0H1istrue)1 - \beta = P(\mathrm{reject } H_0 \mid H_1 \mathrm{ is true}).

8.3 p-Values

The p-value is the probability of observing a test statistic at least as extreme as the one computed, assuming H0H_0 is true. We reject H0H_0 if the p-value is less than α\alpha.

8.4 Z-Test for a Mean

If X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim N(\mu, \sigma^2) with known σ\sigma, to test H0:μ=μ0H_0: \mu = \mu_0:

Z=Xˉμ0σ/nZ = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}}

Under H0H_0, ZN(0,1)Z \sim N(0, 1).

  • For H1:μ>μ0H_1: \mu > \mu_0: reject if Z>zαZ > z_\alpha.
  • For H1:μ<μ0H_1: \mu < \mu_0: reject if Z<zαZ < -z_\alpha.
  • For H1:μμ0H_1: \mu \neq \mu_0: reject if Z>zα/2|Z| > z_{\alpha/2}.

8.5 t-Test for a Mean (Unknown Variance)

If σ\sigma is unknown, replace σ\sigma with the sample standard deviation SS:

T=Xˉμ0S/nT = \frac{\bar{X} - \mu_0}{S / \sqrt{n}}

Under H0H_0, Ttn1T \sim t_{n-1} (Student's t-distribution with n1n - 1 degrees of freedom).

8.6 Worked Example

Problem. A process produces bolts with mean diameter 10mm. A sample of 25 bolts has mean 10.12mm and standard deviation 0.3mm. Test H0:μ=10H_0: \mu = 10 vs H1:μ10H_1: \mu \neq 10 at α=0.05\alpha = 0.05.

Solution. Use the t-test: T=10.12100.3/25=0.120.06=2T = \frac{10.12 - 10}{0.3/\sqrt{25}} = \frac{0.12}{0.06} = 2.

Under H0H_0, Tt24T \sim t_{24}. The critical values are t24,0.0252.064t_{24, 0.025} \approx 2.064.

Since T=2<2.064|T| = 2 < 2.064, we fail to reject H0H_0 at the 5% significance level. There is insufficient evidence to conclude the mean diameter differs from 10mm. \blacksquare

Common Pitfall

"Failing to reject H0H_0" is not the same as "accepting H0H_0". The test only provides evidence against H0H_0; absence of evidence is not evidence of absence. The distinction is critical in scientific reasoning.