Probability and Statistics 1. Probability Spaces
1.1 Sample Spaces and Events
A probability space is a triple ( Ω , F , P ) (\Omega, \mathcal{F}, P) ( Ω , F , P ) where:
Ω \Omega Ω is the sample space (set of all possible outcomes).
F \mathcal{F} F is a sigma-algebra (collection of events) on Ω \Omega Ω .
P : F → [ 0 , 1 ] P : \mathcal{F} \to [0,1] P : F → [ 0 , 1 ] is a probability measure .
1.2 Axioms of Probability (Kolmogorov)
Non-negativity : P ( A ) ≥ 0 P(A) \geq 0 P ( A ) ≥ 0 for all A ∈ F A \in \mathcal{F} A ∈ F .
Normalisation : P ( Ω ) = 1 P(\Omega) = 1 P ( Ω ) = 1 .
Countable additivity : If A 1 , A 2 , … A_1, A_2, \ldots A 1 , A 2 , … are pairwise disjoint events, then
P ( ⋃ i = 1 ∞ A i ) = ∑ i = 1 ∞ P ( A i ) P\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i) P ( ⋃ i = 1 ∞ A i ) = ∑ i = 1 ∞ P ( A i )
1.3 Basic Properties
Proposition 1.1. P ( ∅ ) = 0 P(\emptyset) = 0 P ( ∅ ) = 0 .
Proof. Ω = Ω ∪ ∅ \Omega = \Omega \cup \emptyset Ω = Ω ∪ ∅ (disjoint union), so P ( Ω ) = P ( Ω ) + P ( ∅ ) P(\Omega) = P(\Omega) + P(\emptyset) P ( Ω ) = P ( Ω ) + P ( ∅ ) ,
hence P ( ∅ ) = 0 P(\emptyset) = 0 P ( ∅ ) = 0 . ■ \blacksquare ■
Proposition 1.2 (Complement). P ( A c ) = 1 − P ( A ) P(A^c) = 1 - P(A) P ( A c ) = 1 − P ( A ) .
Proposition 1.3 (Monotonicity). If A ⊆ B A \subseteq B A ⊆ B , then P ( A ) ≤ P ( B ) P(A) \leq P(B) P ( A ) ≤ P ( B ) .
Proposition 1.4 (Inclusion-Exclusion). For any two events A , B A, B A , B :
P ( A ∪ B ) = P ( A ) + P ( B ) − P ( A ∩ B ) P(A \cup B) = P(A) + P(B) - P(A \cap B) P ( A ∪ B ) = P ( A ) + P ( B ) − P ( A ∩ B )
Proposition 1.5 (Boole's Inequality). P ( A ∪ B ) ≤ P ( A ) + P ( B ) P(A \cup B) \leq P(A) + P(B) P ( A ∪ B ) ≤ P ( A ) + P ( B ) . More generally,
P ( ⋃ i = 1 n A i ) ≤ ∑ i = 1 n P ( A i ) P\left(\bigcup_{i=1}^n A_i\right) \leq \sum_{i=1}^n P(A_i) P ( ⋃ i = 1 n A i ) ≤ ∑ i = 1 n P ( A i )
1.4 Conditional Probability
Definition. The conditional probability of A A A given B B B (where P ( B ) > 0 P(B) > 0 P ( B ) > 0 ) is
P ( A ∣ B ) = P ( A ∩ B ) P ( B ) P(A \mid B) = \frac{P(A \cap B)}{P(B)} P ( A ∣ B ) = P ( B ) P ( A ∩ B )
Theorem 1.1 (Law of Total Probability). If B 1 , … , B n B_1, \ldots, B_n B 1 , … , B n partition Ω \Omega Ω with P ( B i ) > 0 P(B_i) > 0 P ( B i ) > 0 :
P ( A ) = ∑ i = 1 n P ( A ∣ B i ) P ( B i ) P(A) = \sum_{i=1}^n P(A \mid B_i) P(B_i) P ( A ) = ∑ i = 1 n P ( A ∣ B i ) P ( B i )
Theorem 1.2 (Bayes' Theorem). For events A A A and B B B with P ( B ) > 0 P(B) > 0 P ( B ) > 0 :
P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)} P ( A ∣ B ) = P ( B ) P ( B ∣ A ) P ( A )
In the partition form:
P ( B j ∣ A ) = P ( A ∣ B j ) P ( B j ) ∑ i = 1 n P ( A ∣ B i ) P ( B i ) P(B_j \mid A) = \frac{P(A \mid B_j) P(B_j)}{\sum_{i=1}^n P(A \mid B_i) P(B_i)} P ( B j ∣ A ) = ∑ i = 1 n P ( A ∣ B i ) P ( B i ) P ( A ∣ B j ) P ( B j )
2. Random Variables
2.1 Definition
A random variable is a measurable function X : Ω → R X : \Omega \to \mathbb{R} X : Ω → R .
Discrete random variable : takes values in a countable set.
Continuous random variable : has a probability density function (PDF).
2.2 Cumulative Distribution Function
The CDF of a random variable X X X is
F X ( x ) = P ( X ≤ x ) F_X(x) = P(X \leq x) F X ( x ) = P ( X ≤ x )
Properties:
lim x → − ∞ F ( x ) = 0 \lim_{x \to -\infty} F(x) = 0 lim x → − ∞ F ( x ) = 0 , lim x → + ∞ F ( x ) = 1 \lim_{x \to +\infty} F(x) = 1 lim x → + ∞ F ( x ) = 1 .
F F F is non-decreasing and right-continuous.
P ( a < X ≤ b ) = F ( b ) − F ( a ) P(a < X \leq b) = F(b) - F(a) P ( a < X ≤ b ) = F ( b ) − F ( a ) .
2.3 Probability Mass Function (Discrete)
For a discrete random variable X X X with values { x 1 , x 2 , … } \{x_1, x_2, \ldots\} { x 1 , x 2 , … } :
f X ( x ) = P ( X = x ) = { p i i f x = x i 0 o t h e r w i s e f_X(x) = P(X = x) = \begin{cases} p_i & \mathrm{if } x = x_i \\ 0 & \mathrm{otherwise} \end{cases} f X ( x ) = P ( X = x ) = { p i 0 if x = x i otherwise
2.4 Probability Density Function (Continuous)
A random variable X X X is continuous if there exists a function f X ≥ 0 f_X \geq 0 f X ≥ 0 such that
P ( a ≤ X ≤ b ) = ∫ a b f X ( x ) d x P(a \leq X \leq b) = \int_a^b f_X(x)\, dx P ( a ≤ X ≤ b ) = ∫ a b f X ( x ) d x
and ∫ − ∞ ∞ f X ( x ) d x = 1 \int_{-\infty}^{\infty} f_X(x)\, dx = 1 ∫ − ∞ ∞ f X ( x ) d x = 1 .
Note: f X ( x ) f_X(x) f X ( x ) is not a probability; it is a probability density. For continuous X X X , P ( X = x ) = 0 P(X = x) = 0 P ( X = x ) = 0
for any individual x x x .
3. Common Distributions
3.1 Discrete Distributions
Bernoulli. X ∼ B e r n o u l l i ( p ) X \sim \mathrm{Bernoulli}(p) X ∼ Bernoulli ( p ) : P ( X = 1 ) = p P(X = 1) = p P ( X = 1 ) = p , P ( X = 0 ) = 1 − p P(X = 0) = 1 - p P ( X = 0 ) = 1 − p .
E [ X ] = p , V a r ( X ) = p ( 1 − p ) E[X] = p, \quad \mathrm{Var}(X) = p(1 - p) E [ X ] = p , Var ( X ) = p ( 1 − p )
Binomial. X ∼ B i n ( n , p ) X \sim \mathrm{Bin}(n, p) X ∼ Bin ( n , p ) : number of successes in n n n independent Bernoulli trials.
P ( X = k ) = ( n k ) p k ( 1 − p ) n − k , k = 0 , 1 , … , n P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \ldots, n P ( X = k ) = ( k n ) p k ( 1 − p ) n − k , k = 0 , 1 , … , n
E [ X ] = n p , V a r ( X ) = n p ( 1 − p ) E[X] = np, \quad \mathrm{Var}(X) = np(1-p) E [ X ] = n p , Var ( X ) = n p ( 1 − p )
Poisson. X ∼ P o i s s o n ( λ ) X \sim \mathrm{Poisson}(\lambda) X ∼ Poisson ( λ ) : models rare events occurring at rate λ \lambda λ .
P ( X = k ) = e − λ λ k k ! , k = 0 , 1 , 2 , … P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!}, \quad k = 0, 1, 2, \ldots P ( X = k ) = k ! e − λ λ k , k = 0 , 1 , 2 , …
E [ X ] = λ , V a r ( X ) = λ E[X] = \lambda, \quad \mathrm{Var}(X) = \lambda E [ X ] = λ , Var ( X ) = λ
Proof that E [ X ] = λ E[X] = \lambda E [ X ] = λ :
E [ X ] = ∑ k = 0 ∞ k e − λ λ k k ! = e − λ ∑ k = 1 ∞ λ k ( k − 1 ) ! = e − λ λ ∑ k = 0 ∞ λ k k ! = λ E[X] = \sum_{k=0}^{\infty} k \frac{e^{-\lambda} \lambda^k}{k!} = e^{-\lambda} \sum_{k=1}^{\infty} \frac{\lambda^k}{(k-1)!} = e^{-\lambda} \lambda \sum_{k=0}^{\infty} \frac{\lambda^k}{k!} = \lambda E [ X ] = ∑ k = 0 ∞ k k ! e − λ λ k = e − λ ∑ k = 1 ∞ ( k − 1 )! λ k = e − λ λ ∑ k = 0 ∞ k ! λ k = λ
■ \blacksquare ■
3.2 Continuous Distributions
Uniform. X ∼ U n i f o r m ( a , b ) X \sim \mathrm{Uniform}(a, b) X ∼ Uniform ( a , b ) :
f ( x ) = { 1 b − a i f a ≤ x ≤ b 0 o t h e r w i s e f(x) = \begin{cases} \frac{1}{b - a} & \mathrm{if } a \leq x \leq b \\ 0 & \mathrm{otherwise} \end{cases} f ( x ) = { b − a 1 0 if a ≤ x ≤ b otherwise
E [ X ] = a + b 2 , V a r ( X ) = ( b − a ) 2 12 E[X] = \frac{a + b}{2}, \quad \mathrm{Var}(X) = \frac{(b-a)^2}{12} E [ X ] = 2 a + b , Var ( X ) = 12 ( b − a ) 2
Exponential. X ∼ E x p ( λ ) X \sim \mathrm{Exp}(\lambda) X ∼ Exp ( λ ) :
f ( x ) = { λ e − λ x i f x ≥ 0 0 i f x < 0 f(x) = \begin{cases} \lambda e^{-\lambda x} & \mathrm{if } x \geq 0 \\ 0 & \mathrm{if } x < 0 \end{cases} f ( x ) = { λ e − λ x 0 if x ≥ 0 if x < 0
E [ X ] = 1 λ , V a r ( X ) = 1 λ 2 E[X] = \frac{1}{\lambda}, \quad \mathrm{Var}(X) = \frac{1}{\lambda^2} E [ X ] = λ 1 , Var ( X ) = λ 2 1
Normal (Gaussian). X ∼ N ( μ , σ 2 ) X \sim N(\mu, \sigma^2) X ∼ N ( μ , σ 2 ) :
f ( x ) = 1 σ 2 π exp ( − ( x − μ ) 2 2 σ 2 ) , x ∈ R f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right), \quad x \in \mathbb{R} f ( x ) = σ 2 π 1 exp ( − 2 σ 2 ( x − μ ) 2 ) , x ∈ R
E [ X ] = μ , V a r ( X ) = σ 2 E[X] = \mu, \quad \mathrm{Var}(X) = \sigma^2 E [ X ] = μ , Var ( X ) = σ 2
The standard normal Z ∼ N ( 0 , 1 ) Z \sim N(0,1) Z ∼ N ( 0 , 1 ) has CDF denoted Φ ( z ) \Phi(z) Φ ( z ) . For any X ∼ N ( μ , σ 2 ) X \sim N(\mu, \sigma^2) X ∼ N ( μ , σ 2 ) :
Z = X − μ σ ∼ N ( 0 , 1 ) Z = \frac{X - \mu}{\sigma} \sim N(0, 1) Z = σ X − μ ∼ N ( 0 , 1 )
Theorem 3.1. The sum of independent normal random variables is normal:
if X i ∼ N ( μ i , σ i 2 ) X_i \sim N(\mu_i, \sigma_i^2) X i ∼ N ( μ i , σ i 2 ) are independent, then
∑ X i ∼ N ( ∑ μ i , ∑ σ i 2 ) \sum X_i \sim N(\sum \mu_i, \sum \sigma_i^2) ∑ X i ∼ N ( ∑ μ i , ∑ σ i 2 ) .
4. Expectation, Variance, and Moment Generating Functions
4.1 Expectation
Definition. The expected value of X X X is
E [ X ] = { ∑ x x f X ( x ) ( d i s c r e t e ) ∫ − ∞ ∞ x f X ( x ) d x ( c o n t i n u o u s ) E[X] = \begin{cases} \sum_x x\, f_X(x) & \mathrm{(discrete)} \\ \int_{-\infty}^{\infty} x\, f_X(x)\, dx & \mathrm{(continuous)} \end{cases} E [ X ] = { ∑ x x f X ( x ) ∫ − ∞ ∞ x f X ( x ) d x ( discrete ) ( continuous )
Proposition 4.1 (LOTUS). For any function g g g :
E [ g ( X ) ] = { ∑ x g ( x ) f X ( x ) ( d i s c r e t e ) ∫ − ∞ ∞ g ( x ) f X ( x ) d x ( c o n t i n u o u s ) E[g(X)] = \begin{cases} \sum_x g(x)\, f_X(x) & \mathrm{(discrete)} \\ \int_{-\infty}^{\infty} g(x)\, f_X(x)\, dx & \mathrm{(continuous)} \end{cases} E [ g ( X )] = { ∑ x g ( x ) f X ( x ) ∫ − ∞ ∞ g ( x ) f X ( x ) d x ( discrete ) ( continuous )
Theorem 4.1 (Properties of Expectation).
E [ a X + b ] = a E [ X ] + b E[aX + b] = aE[X] + b E [ a X + b ] = a E [ X ] + b (linearity).
E [ X + Y ] = E [ X ] + E [ Y ] E[X + Y] = E[X] + E[Y] E [ X + Y ] = E [ X ] + E [ Y ] (always, even without independence).
If X X X and Y Y Y are independent, E [ X Y ] = E [ X ] E [ Y ] E[XY] = E[X]E[Y] E [ X Y ] = E [ X ] E [ Y ] .
4.2 Variance
V a r ( X ) = E [ ( X − E [ X ] ) 2 ] = E [ X 2 ] − ( E [ X ] ) 2 \mathrm{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2 Var ( X ) = E [( X − E [ X ] ) 2 ] = E [ X 2 ] − ( E [ X ] ) 2
Theorem 4.2.
V a r ( a X + b ) = a 2 V a r ( X ) \mathrm{Var}(aX + b) = a^2 \mathrm{Var}(X) Var ( a X + b ) = a 2 Var ( X ) .
If X , Y X, Y X , Y are independent: V a r ( X + Y ) = V a r ( X ) + V a r ( Y ) \mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y) Var ( X + Y ) = Var ( X ) + Var ( Y ) .
4.3 Moment Generating Functions
The moment generating function (MGF) of X X X is
M X ( t ) = E [ e t X ] M_X(t) = E[e^{tX}] M X ( t ) = E [ e tX ]
(provided the expectation exists in a neighbourhood of t = 0 t = 0 t = 0 ).
Theorem 4.3. If M X ( t ) M_X(t) M X ( t ) exists in a neighbourhood of 0 0 0 , then E [ X n ] = M X ( n ) ( 0 ) E[X^n] = M_X^{(n)}(0) E [ X n ] = M X ( n ) ( 0 ) .
Theorem 4.4 (Uniqueness). If M X ( t ) = M Y ( t ) M_X(t) = M_Y(t) M X ( t ) = M Y ( t ) for all t t t in a neighbourhood of 0 0 0 , then X X X
and Y Y Y have the same distribution.
Theorem 4.5. If X X X and Y Y Y are independent, M X + Y ( t ) = M X ( t ) M Y ( t ) M_{X+Y}(t) = M_X(t) M_Y(t) M X + Y ( t ) = M X ( t ) M Y ( t ) .
Worked Example. Find the MGF of X ∼ E x p ( λ ) X \sim \mathrm{Exp}(\lambda) X ∼ Exp ( λ ) .
M X ( t ) = E [ e t X ] = ∫ 0 ∞ e t x λ e − λ x d x = λ ∫ 0 ∞ e − ( λ − t ) x d x M_X(t) = E[e^{tX}] = \int_0^{\infty} e^{tx} \lambda e^{-\lambda x}\, dx = \lambda \int_0^{\infty} e^{-(\lambda - t)x}\, dx M X ( t ) = E [ e tX ] = ∫ 0 ∞ e t x λ e − λ x d x = λ ∫ 0 ∞ e − ( λ − t ) x d x
For t < λ t < \lambda t < λ : M X ( t ) = λ λ − t M_X(t) = \frac{\lambda}{\lambda - t} M X ( t ) = λ − t λ .
M X ′ ( t ) = λ ( λ − t ) 2 M_X'(t) = \frac{\lambda}{(\lambda - t)^2} M X ′ ( t ) = ( λ − t ) 2 λ , so E [ X ] = M X ′ ( 0 ) = 1 / λ E[X] = M_X'(0) = 1/\lambda E [ X ] = M X ′ ( 0 ) = 1/ λ . ■ \blacksquare ■
5. Joint Distributions
5.1 Joint PDF and CDF
For two random variables X X X and Y Y Y , the joint CDF is F X , Y ( x , y ) = P ( X ≤ x , Y ≤ y ) F_{X,Y}(x,y) = P(X \leq x, Y \leq y) F X , Y ( x , y ) = P ( X ≤ x , Y ≤ y ) .
The joint PDF (for continuous) satisfies P ( ( X , Y ) ∈ A ) = ∬ A f X , Y ( x , y ) d x d y P((X,Y) \in A) = \iint_A f_{X,Y}(x,y)\, dx\, dy P (( X , Y ) ∈ A ) = ∬ A f X , Y ( x , y ) d x d y .
5.2 Marginal Distributions
The marginal PDF of X X X is obtained by integrating out Y Y Y :
f X ( x ) = ∫ − ∞ ∞ f X , Y ( x , y ) d y f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y)\, dy f X ( x ) = ∫ − ∞ ∞ f X , Y ( x , y ) d y
5.3 Independence
X X X and Y Y Y are independent if and only if
f X , Y ( x , y ) = f X ( x ) f Y ( y ) f o r a l l x , y f_{X,Y}(x,y) = f_X(x) f_Y(y) \quad \mathrm{for all } x, y f X , Y ( x , y ) = f X ( x ) f Y ( y ) forall x , y
Theorem 5.1. If X X X and Y Y Y are independent, then E [ X Y ] = E [ X ] E [ Y ] E[XY] = E[X]E[Y] E [ X Y ] = E [ X ] E [ Y ] and
V a r ( X + Y ) = V a r ( X ) + V a r ( Y ) \mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y) Var ( X + Y ) = Var ( X ) + Var ( Y ) .
5.4 Covariance and Correlation
C o v ( X , Y ) = E [ ( X − E [ X ] ) ( Y − E [ Y ] ) ] = E [ X Y ] − E [ X ] E [ Y ] \mathrm{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y] Cov ( X , Y ) = E [( X − E [ X ]) ( Y − E [ Y ])] = E [ X Y ] − E [ X ] E [ Y ]
The correlation coefficient is
ρ X , Y = C o v ( X , Y ) V a r ( X ) V a r ( Y ) \rho_{X,Y} = \frac{\mathrm{Cov}(X,Y)}{\sqrt{\mathrm{Var}(X)\mathrm{Var}(Y)}} ρ X , Y = Var ( X ) Var ( Y ) Cov ( X , Y )
Properties:
− 1 ≤ ρ X , Y ≤ 1 -1 \leq \rho_{X,Y} \leq 1 − 1 ≤ ρ X , Y ≤ 1 .
ρ = ± 1 \rho = \pm 1 ρ = ± 1 if and only if X X X and Y Y Y are linearly related.
ρ = 0 \rho = 0 ρ = 0 does not imply independence (only uncorrelatedness).
6. Limit Theorems
6.1 Law of Large Numbers
Theorem 6.1 (Weak Law of Large Numbers). Let X 1 , X 2 , … X_1, X_2, \ldots X 1 , X 2 , … be i.i.d. with E [ X i ] = μ E[X_i] = \mu E [ X i ] = μ and
V a r ( X i ) = σ 2 < ∞ \mathrm{Var}(X_i) = \sigma^2 < \infty Var ( X i ) = σ 2 < ∞ . Then for every ε > 0 \varepsilon > 0 ε > 0 :
lim n → ∞ P ( ∣ 1 n ∑ i = 1 n X i − μ ∣ > ε ) = 0 \lim_{n \to \infty} P\left(\left|\frac{1}{n}\sum_{i=1}^n X_i - \mu\right| > \varepsilon\right) = 0 lim n → ∞ P ( n 1 ∑ i = 1 n X i − μ > ε ) = 0
Proof. Let X ˉ n = 1 n ∑ i = 1 n X i \bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i X ˉ n = n 1 ∑ i = 1 n X i . Then E [ X ˉ n ] = μ E[\bar{X}_n] = \mu E [ X ˉ n ] = μ and
V a r ( X ˉ n ) = σ 2 / n \mathrm{Var}(\bar{X}_n) = \sigma^2/n Var ( X ˉ n ) = σ 2 / n . By Chebyshev's inequality:
P ( ∣ X ˉ n − μ ∣ > ε ) ≤ V a r ( X ˉ n ) ε 2 = σ 2 n ε 2 → 0 P(|\bar{X}_n - \mu| > \varepsilon) \leq \frac{\mathrm{Var}(\bar{X}_n)}{\varepsilon^2} = \frac{\sigma^2}{n\varepsilon^2} \to 0 P ( ∣ X ˉ n − μ ∣ > ε ) ≤ ε 2 Var ( X ˉ n ) = n ε 2 σ 2 → 0
■ \blacksquare ■
Theorem 6.2 (Strong Law of Large Numbers). Under the same hypotheses:
P ( lim n → ∞ 1 n ∑ i = 1 n X i = μ ) = 1 P\left(\lim_{n \to \infty} \frac{1}{n}\sum_{i=1}^n X_i = \mu\right) = 1 P ( lim n → ∞ n 1 ∑ i = 1 n X i = μ ) = 1
6.2 Central Limit Theorem
Theorem 6.3 (Central Limit Theorem). Let X 1 , X 2 , … X_1, X_2, \ldots X 1 , X 2 , … be i.i.d. with E [ X i ] = μ E[X_i] = \mu E [ X i ] = μ and
V a r ( X i ) = σ 2 > 0 \mathrm{Var}(X_i) = \sigma^2 > 0 Var ( X i ) = σ 2 > 0 . Then
X ˉ n − μ σ / n → d N ( 0 , 1 ) \frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \xrightarrow{d} N(0, 1) σ / n X ˉ n − μ d N ( 0 , 1 )
That is, for all a < b a < b a < b :
lim n → ∞ P ( a < X ˉ n − μ σ / n < b ) = Φ ( b ) − Φ ( a ) \lim_{n \to \infty} P\left(a < \frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} < b\right) = \Phi(b) - \Phi(a) lim n → ∞ P ( a < σ / n X ˉ n − μ < b ) = Φ ( b ) − Φ ( a )
6.3 Worked Example
Problem. A factory produces light bulbs with mean lifetime 1000 hours and standard deviation 200
hours. What is the probability that the mean lifetime of 100 bulbs exceeds 1040 hours?
Solution. By the CLT, X ˉ 100 ≈ N ( 1000 , 200 2 / 100 ) = N ( 1000 , 400 ) \bar{X}_{100} \approx N(1000, 200^2/100) = N(1000, 400) X ˉ 100 ≈ N ( 1000 , 20 0 2 /100 ) = N ( 1000 , 400 ) .
P ( X ˉ > 1040 ) = P ( Z > 1040 − 1000 20 ) = P ( Z > 2 ) ≈ 0.0228 P(\bar{X} > 1040) = P\left(Z > \frac{1040 - 1000}{20}\right) = P(Z > 2) \approx 0.0228 P ( X ˉ > 1040 ) = P ( Z > 20 1040 − 1000 ) = P ( Z > 2 ) ≈ 0.0228
■ \blacksquare ■
7. Maximum Likelihood Estimation
7.1 Likelihood Function
Given i.i.d. observations x 1 , … , x n x_1, \ldots, x_n x 1 , … , x n from a distribution with parameter θ \theta θ , the
likelihood function is
L ( θ ) = ∏ i = 1 n f ( x i ∣ θ ) L(\theta) = \prod_{i=1}^n f(x_i \mid \theta) L ( θ ) = ∏ i = 1 n f ( x i ∣ θ )
and the log-likelihood is
ℓ ( θ ) = log L ( θ ) = ∑ i = 1 n log f ( x i ∣ θ ) \ell(\theta) = \log L(\theta) = \sum_{i=1}^n \log f(x_i \mid \theta) ℓ ( θ ) = log L ( θ ) = ∑ i = 1 n log f ( x i ∣ θ )
7.2 MLE Procedure
The maximum likelihood estimator (MLE) θ ^ M L E \hat{\theta}_{\mathrm{MLE}} θ ^ MLE maximises L ( θ ) L(\theta) L ( θ ) (equivalently,
ℓ ( θ ) \ell(\theta) ℓ ( θ ) ):
θ ^ M L E = arg max θ L ( θ ) \hat{\theta}_{\mathrm{MLE}} = \arg\max_\theta L(\theta) θ ^ MLE = arg max θ L ( θ )
Typically found by solving ℓ ′ ( θ ) = 0 \ell'(\theta) = 0 ℓ ′ ( θ ) = 0 and verifying ℓ ′ ′ ( θ ^ ) < 0 \ell''(\hat{\theta}) < 0 ℓ ′′ ( θ ^ ) < 0 .
7.3 Worked Example
Problem. Find the MLE for λ \lambda λ given i.i.d. observations x 1 , … , x n x_1, \ldots, x_n x 1 , … , x n from
E x p ( λ ) \mathrm{Exp}(\lambda) Exp ( λ ) .
Solution.
L ( λ ) = ∏ i = 1 n λ e − λ x i = λ n e − λ ∑ x i L(\lambda) = \prod_{i=1}^n \lambda e^{-\lambda x_i} = \lambda^n e^{-\lambda \sum x_i} L ( λ ) = ∏ i = 1 n λ e − λ x i = λ n e − λ ∑ x i
ℓ ( λ ) = n log λ − λ ∑ i = 1 n x i \ell(\lambda) = n \log \lambda - \lambda \sum_{i=1}^n x_i ℓ ( λ ) = n log λ − λ ∑ i = 1 n x i
d ℓ d λ = n λ − ∑ i = 1 n x i = 0 ⟹ λ ^ = n ∑ x i = 1 x ˉ \frac{d\ell}{d\lambda} = \frac{n}{\lambda} - \sum_{i=1}^n x_i = 0 \implies \hat{\lambda} = \frac{n}{\sum x_i} = \frac{1}{\bar{x}} d λ d ℓ = λ n − ∑ i = 1 n x i = 0 ⟹ λ ^ = ∑ x i n = x ˉ 1
Verify: d 2 ℓ d λ 2 = − n λ 2 < 0 \frac{d^2\ell}{d\lambda^2} = -\frac{n}{\lambda^2} < 0 d λ 2 d 2 ℓ = − λ 2 n < 0 , confirming this is a maximum. ■ \blacksquare ■
The MLE is not always unbiased. For example, the MLE σ ^ 2 = 1 n ∑ ( X i − X ˉ ) 2 \hat{\sigma}^2 = \frac{1}{n}\sum (X_i - \bar{X})^2 σ ^ 2 = n 1 ∑ ( X i − X ˉ ) 2
for the normal variance is biased; the unbiased estimator uses n − 1 n - 1 n − 1 in the denominator.
8. Hypothesis Testing
8.1 Framework
A hypothesis test evaluates two competing statements:
Null hypothesis H 0 H_0 H 0 : the status quo (e.g., μ = μ 0 \mu = \mu_0 μ = μ 0 ).
Alternative hypothesis H 1 H_1 H 1 : what we want to show (e.g., μ > μ 0 \mu > \mu_0 μ > μ 0 ).
8.2 Test Statistics and Decisions
A test statistic T T T is a function of the data. We reject H 0 H_0 H 0 if T T T falls in the rejection
region (critical region).
Type I error : rejecting H 0 H_0 H 0 when it is true (false positive). Probability = α \alpha α (significance
level).
Type II error : failing to reject H 0 H_0 H 0 when it is false (false negative). Probability = β \beta β .
The power of a test is 1 − β = P ( r e j e c t H 0 ∣ H 1 i s t r u e ) 1 - \beta = P(\mathrm{reject } H_0 \mid H_1 \mathrm{ is true}) 1 − β = P ( reject H 0 ∣ H 1 istrue ) .
8.3 p-Values
The p-value is the probability of observing a test statistic at least as extreme as the one
computed, assuming H 0 H_0 H 0 is true. We reject H 0 H_0 H 0 if the p-value is less than α \alpha α .
8.4 Z-Test for a Mean
If X 1 , … , X n ∼ N ( μ , σ 2 ) X_1, \ldots, X_n \sim N(\mu, \sigma^2) X 1 , … , X n ∼ N ( μ , σ 2 ) with known σ \sigma σ , to test H 0 : μ = μ 0 H_0: \mu = \mu_0 H 0 : μ = μ 0 :
Z = X ˉ − μ 0 σ / n Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}} Z = σ / n X ˉ − μ 0
Under H 0 H_0 H 0 , Z ∼ N ( 0 , 1 ) Z \sim N(0, 1) Z ∼ N ( 0 , 1 ) .
For H 1 : μ > μ 0 H_1: \mu > \mu_0 H 1 : μ > μ 0 : reject if Z > z α Z > z_\alpha Z > z α .
For H 1 : μ < μ 0 H_1: \mu < \mu_0 H 1 : μ < μ 0 : reject if Z < − z α Z < -z_\alpha Z < − z α .
For H 1 : μ ≠ μ 0 H_1: \mu \neq \mu_0 H 1 : μ = μ 0 : reject if ∣ Z ∣ > z α / 2 |Z| > z_{\alpha/2} ∣ Z ∣ > z α /2 .
8.5 t-Test for a Mean (Unknown Variance)
If σ \sigma σ is unknown, replace σ \sigma σ with the sample standard deviation S S S :
T = X ˉ − μ 0 S / n T = \frac{\bar{X} - \mu_0}{S / \sqrt{n}} T = S / n X ˉ − μ 0
Under H 0 H_0 H 0 , T ∼ t n − 1 T \sim t_{n-1} T ∼ t n − 1 (Student's t-distribution with n − 1 n - 1 n − 1 degrees of freedom).
8.6 Worked Example
Problem. A process produces bolts with mean diameter 10mm. A sample of 25 bolts has mean 10.12mm
and standard deviation 0.3mm. Test H 0 : μ = 10 H_0: \mu = 10 H 0 : μ = 10 vs H 1 : μ ≠ 10 H_1: \mu \neq 10 H 1 : μ = 10 at α = 0.05 \alpha = 0.05 α = 0.05 .
Solution. Use the t-test: T = 10.12 − 10 0.3 / 25 = 0.12 0.06 = 2 T = \frac{10.12 - 10}{0.3/\sqrt{25}} = \frac{0.12}{0.06} = 2 T = 0.3/ 25 10.12 − 10 = 0.06 0.12 = 2 .
Under H 0 H_0 H 0 , T ∼ t 24 T \sim t_{24} T ∼ t 24 . The critical values are t 24 , 0.025 ≈ 2.064 t_{24, 0.025} \approx 2.064 t 24 , 0.025 ≈ 2.064 .
Since ∣ T ∣ = 2 < 2.064 |T| = 2 < 2.064 ∣ T ∣ = 2 < 2.064 , we fail to reject H 0 H_0 H 0 at the 5% significance level. There is insufficient
evidence to conclude the mean diameter differs from 10mm. ■ \blacksquare ■
"Failing to reject H 0 H_0 H 0 " is not the same as "accepting H 0 H_0 H 0 ". The test only provides evidence against
H 0 H_0 H 0 ; absence of evidence is not evidence of absence. The distinction is critical in scientific
reasoning.