Ω is the sample space (set of all possible outcomes).
F is a sigma-algebra (collection of events) on Ω.
P:F→[0,1] is a probability measure.
1.2 Sigma-Algebras
Definition. A sigma-algebra (or σ-algebra) F on a set Ω is a collection of subsets Of Ω satisfying:
Ω∈F.
If A∈FThen Ac∈F (closed under complementation).
If A1,A2,A3,…∈FThen ⋃i=1∞Ai∈F (closed under countable unions).
Remark. From (2) and (3), F is also closed under countable intersections (by De Morgan”s laws). The pair (Ω,F) is called a measurable space.
Example 1.1. For any set ΩThe trivial sigma-algebra is \\{\\emptyset, \\Omega\\} and the power SetP(Ω) is also a sigma-algebra.
Example 1.2. If Ω={1,2,3,4,5,6} (a fair die), then F=P(Ω) contains all 26=64 subsets. This is the sigma-algebra used for finite Sample spaces.
Example 1.3. For Ω=RThe Borel sigma-algebraB is the smallest sigma-algebra Containing all open intervals. It is generated by taking countable unions, intersections, and complements of open Sets. We write (R,B).
Proposition 1.0. The intersection of any collection of sigma-algebras on Ω is a sigma-algebra.
Proof. Let \\{\\mathcal{{'}F{}'}_\\alpha\\} be a collection of sigma-algebras. Then: (1) Ω∈Fα for all αSo Ω∈⋂αFα. (2) If A∈⋂αFαThen A∈Fα for all αSo Ac∈Fα for all αHence Ac∈⋂αFα. (3) Countable unions follow similarly. ■
Intuition. This proposition guarantees that for any collection of subsets EThere exists a smallest Sigma-algebra containing ECalled the sigma-algebra generated byEDenoted σ(E).
1.3 Axioms of Probability (Kolmogorov)
Non-negativity: P(A)≥0 for all A∈F.
Normalisation: P(Ω)=1.
Countable additivity: If A1,A2,… are pairwise disjoint events, then
P(⋃i=1∞Ai)=∑i=1∞P(Ai)
1.4 Basic Properties
Proposition 1.1.P(∅)=0.
Proof.Ω=Ω∪∅ (disjoint union), so P(Ω)=P(Ω)+P(∅) Hence P(∅)=0. ■
Proposition 1.2 (Complement).P(Ac)=1−P(A).
Proposition 1.3 (Monotonicity). If A⊆BThen P(A)≤P(B).
Proof. Write B=A∪(B∖A)A disjoint union. By countable additivity, P(B)=P(A)+P(B∖A)≥P(A) since P(B∖A)≥0. ■
Proposition 1.4 (Inclusion-Exclusion). For any two events A,B:
P(A∪B)=P(A)+P(B)−P(A∩B)
Proof. Write A∪B=A∪(B∖A) as a disjoint union. Then P(A∪B)=P(A)+P(B∖A). Since B=(B∖A)∪(A∩B) is also disjoint, P(B)=P(B∖A)+P(A∩B)So P(B∖A)=P(B)−P(A∩B). Substituting gives the Result. ■
Proposition 1.5 (General Inclusion-Exclusion). For events A1,…,An:
Proposition 1.6 (Boole’s Inequality).P(A∪B)≤P(A)+P(B). More generally,
P(⋃i=1nAi)≤∑i=1nP(Ai)
1.5 Conditional Probability
Definition. The conditional probability of A given B (where P(B)>0) is
P(A∣B)=P(B)P(A∩B)
Intuition. Conditioning on B restricts the sample space to B and rescales so that P(B∣B)=1.
Theorem 1.1 (Law of Total Probability). If B1,…,Bn partition Ω with P(Bi)>0:
P(A)=∑i=1nP(A∣Bi)P(Bi)
Proof. Since B1,…,Bn partition ΩWe have A=⋃i=1n(A∩Bi) (disjoint union). By countable additivity:
P(A)=∑i=1nP(A∩Bi)=∑i=1nP(A∣Bi)P(Bi)
■
Theorem 1.2 (Bayes’ Theorem). For events A and B with P(B)>0:
P(A∣B)=P(B)P(B∣A)P(A)
In the partition form:
P(Bj∣A)=∑i=1nP(A∣Bi)P(Bi)P(A∣Bj)P(Bj)
Proof. By definition, P(A∣B)=P(A∩B)/P(B) and P(B∣A)=P(A∩B)/P(A). Solving the Second for P(A∩B)=P(B∣A)P(A) and substituting into the first gives Bayes’ theorem. The partition Form follows by applying the law of total probability to the denominator P(B). ■
1.6 Worked Examples
Problem 1.1. A bag contains 4 red and 6 blue marbles. Two marbles are drawn without replacement. What is the Probability that both are red?
Solution
Let R1 be the event “first marble is red” and R2 be “second marble is red.” Then:
P(R1∩R2)=P(R1)P(R2∣R1)=104⋅93=9012=152
Problem 1.2. A disease affects 1% of a population. A test has sensitivity 95% (P(positive∣disease)=0.95) And specificity 90% (P(negative∣healthy)=0.90). If a person tests positive, what is the Probability they have the disease?
Solution
Let D = “has disease” and + = “tests positive.” We want P(D∣+).
Applying the induction hypothesis to both union terms and rearranging yields the formula for n+1. ■
Problem 1.3. Three machines A, B, C produce items. Machine A produces 50% of items with 2% defective, B produces 30% with 1% defective, and C produces 20% with 3% defective. An item is found to be defective. What is the Probability it was produced by machine A?
Solution
Let D = “defective” and A, B, C denote production by each machine.
:::caution Common Pitfall People often confuse P(A∣B) with P(B∣A). In medical testing, P(disease∣positive) Is much lower than P(positive∣disease) due to low base rates. Always apply Bayes’ Theorem rigorously. :::
2. Random Variables
2.1 Definition
A random variable is a measurable function X:Ω→R. Measurability means that for every Borel Set B⊆R, {X−1(B):ω∈Ω,X(ω)∈B}∈F.
Discrete random variable: takes values in a countable set.
Continuous random variable: has a probability density function (PDF).
Example 2.1 (Discrete). Roll a fair die. Define X(ω)=ω. Then X takes values in {1,2,3,4,5,6} With P(X=k)=1/6 for each k.
Example 2.2 (Discrete — Indicator). For any event AThe indicator random variable1A equals 1 if A occurs and 0 otherwise. Then E[1A]=P(A) and Var(1A)=P(A)(1−P(A)).
Example 2.3 (Continuous). Let X∼Uniform(0,1). Then X is the identity on (0,1) with PDF f(x)=1 for x∈(0,1) and f(x)=0 otherwise.
Example 2.4 (Mixed). A random variable can be neither purely discrete nor purely continuous. For instance, if X=0 with probability 1/2 and X∼Exp(1) with probability 1/2Then X has an atom at 0 And a continuous part on (0,∞).
2.2 Cumulative Distribution Function
The CDF of a random variable X is
FX(x)=P(X≤x)
Theorem 2.1 (Properties of the CDF).
limx→−∞F(x)=0, limx→+∞F(x)=1.
F is non-decreasing: if a<bThen F(a)≤F(b).
F is right-continuous: limx→a+F(x)=F(a).
P(a<X≤b)=F(b)−F(a).
P(X=a)=F(a)−F(a−) where F(a−)=limx→a−F(x).
F has at most countably many points of discontinuity.
Proof. (1) By monotonicity of P: F(x)=P(X≤x)≤P(Ω)=1 and as x→∞{X≤x}↑ΩSo by continuity from below, F(x)→1. Similarly as x→−∞{X≤x}↓∅ and F(x)→0.
(2) If a<bThen {X≤a}⊆{X≤b}So by monotonicity of PF(a)≤F(b).
(3) Let xn↓a. Then {X≤xn}↓{X≤a} (since {X≤a}=⋂n{X≤xn}). By continuity from above of P, F(xn)→F(a).
(6) Since F is non-decreasing, it can have at most countably many jump discontinuities (the sum of all jumps must Be bounded by 1). ■
2.3 Probability Mass Function (Discrete)
For a discrete random variable X with values {x1,x2,…}:
fX(x)=P(X=x)={pi0ifx=xiotherwise
Where pi≥0 and ∑ipi=1.
2.4 Probability Density Function (Continuous)
A random variable X is continuous if there exists a function fX≥0 such that
P(a≤X≤b)=∫abfX(x)dx
And ∫−∞∞fX(x)dx=1.
Note: fX(x) is not a probability; it is a probability density. For continuous X, P(X=x)=0 For any individual x.
2.5 Functions of Random Variables
Proposition 2.1. If X is a continuous random variable with PDF fX and g is a strictly monotone Differentiable function, then Y=g(X) has PDF
fY(y)=fX(g−1(y))⋅dydg−1(y)
Proof. Suppose g is strictly increasing. Then FY(y)=P(Y≤y)=P(X≤g−1(y))=FX(g−1(y)). Differentiating: fY(y)=fX(g−1(y))⋅(g−1)′(y). For decreasing gThe inequality reverses, Introducing a minus sign. Both cases are captured by the absolute value. ■
Problem 2.1. Let X∼Uniform(0,1). Find the distribution of Y=−lnX.
Solution
Here g(x)=−lnxWhich is strictly decreasing on (0,1). The inverse is g−1(y)=e−y for y>0. We have (g−1)′(y)=−e−y.
fY(y)=fX(e−y)⋅∣−e−y∣=1⋅e−y=e−y,y>0
This is the Exp(1) distribution. ■
Problem 2.2. Let X∼N(0,1). Find the distribution of Y=X2.
Solution
The function g(x)=x2 is not monotone, so we must split into cases.
For y>0:
FY(y)=P(X2≤y)=P(−y≤X≤y)=Φ(y)−Φ(−y)=2Φ(y)−1
Differentiating:
fY(y)=2⋅ϕ(y)⋅2y1=y1⋅2π1e−y/2=2πy1e−y/2
This is the PDF of a χ12 (chi-squared with 1 degree of freedom) distribution, which equals Gamma(1/2,1/2). ■
2.6 Quantile Function
Definition. The quantile function (or inverse CDF) of a random variable X with CDF F is
F−1(p)=inf{x:F(x)≥p},0<p<1
Remark. If F is strictly increasing, then F−1(p) is the unique x such that F(x)=p. For discrete Distributions, F−1 is the generalised inverse.
Theorem 2.2 (Probability Integral Transform). If X has a continuous CDF FThen F(X)∼Uniform(0,1).
Proof. For u∈(0,1): P(F(X)≤u)=P(X≤F−1(u))=F(F−1(u))=u. ■
Intuition. This theorem is the foundation of inverse transform sampling: to generate from any distribution With CDF FDraw U∼Uniform(0,1) and compute X=F−1(U).
2.7 Order Statistics
Definition. Let X1,…,Xn be i.i.d. With CDF F and PDF f. The order …/4-statistics-and-probability/2_statisticsX(1)≤X(2)≤⋯≤X(n) are the sorted values.
Theorem 2.3. The PDF of the k-th order statistic X(k) is
Proof. For X(k)≤x to hold, at least k of the Xi must be ≤x. The event X(k)∈(x,x+dx) Requires exactly k−1 observations below xOne in (x,x+dx)And n−k above x+dx:
Remark. The exponential distribution is the only continuous distribution with the memoryless property. This Makes it the natural model for waiting times between Poisson events.
Normal (Gaussian).X∼N(μ,σ2):
f(x)=σ2π1exp(−2σ2(x−μ)2),x∈R
E[X]=μ,Var(X)=σ2
The standard normalZ∼N(0,1) has CDF denoted Φ(z). For any X∼N(μ,σ2):
Z=σX−μ∼N(0,1)
MX(t)=exp(μt+2σ2t2)
Verification that f integrates to 1. Consider I=∫−∞∞e−x2/2dx. Then I2=∫−∞∞∫−∞∞e−(x2+y2)/2dxdy. Switching to polar coordinates: x=rcosθ, y=rsinθ, dxdy=rdrdθ.
Since the integrand is the PDF of N(t,1) evaluated over all R. ■
For X=μ+σZ: MX(t)=E[et(μ+σZ)]=eμtE[e(σt)Z]=eμteσ2t2/2.
Theorem 3.2. The sum of independent normal random variables is normal: If Xi∼N(μi,σi2) are independent, then ∑Xi∼N(∑μi,∑σi2).
Proof (using MGFs).M∑Xi(t)=∏iMXi(t)=∏iexp(μit+σi2t2/2)=exp((∑μi)t+2(∑σi2)t2), which is the MGF of N(∑μi,∑σi2). By the uniqueness theorem for MGFs, the result follows. ■
3.3 Relationships Between Distributions
Theorem 3.3 (Poisson Limit Theorem). If Xn∼Bin(n,pn) where npn→λ as n→∞ Then XndPoisson(λ).
As n→∞: λn→λ; nn−j→1 for each fixed j; (1−nλn)n→e−λ; and (1−nλn)−k→1.
Therefore: P(Xn=k)→k!λke−λ=P(Poisson(λ)=k). ■
Intuition. The Poisson distribution approximates the binomial when n is large, p is small, and np is moderate.
Theorem 3.4 (Normal approximation to the Binomial). If X∼Bin(n,p) with n large, then Approximately X≈N(np,np(1−p)). More precisely, using a continuity correction:
Problem 3.2. The lifetime of a component is exponentially distributed with mean 500 hours. Given that the Component has lasted 300 hours, what is the probability it lasts at least another 200 hours?
Solution
The mean is 1/λ=500So λ=1/500. By the memoryless property:
(provided the expectation exists in a neighbourhood of t=0).
Theorem 4.3. If MX(t) exists in a neighbourhood of 0Then E[Xn]=MX(n)(0).
Proof.MX(t)=E[etX]=∑n=0∞n!tnE[Xn] (by expanding the Taylor series and exchanging Summation and expectation, justified by dominated convergence). The coefficient of tn/n! is E[Xn]So E[Xn]=MX(n)(0). ■
Theorem 4.4 (Uniqueness). If MX(t)=MY(t) for all t in a neighbourhood of 0Then X And Y have the same distribution.
Theorem 4.5. If X and Y are independent, MX+Y(t)=MX(t)MY(t).
Proof.MX+Y(t)=E[et(X+Y)]=E[etXetY]=E[etX]E[etY]=MX(t)MY(t) Where the third equality uses independence. ■
4.4 Important Inequalities
Theorem 4.6a (Markov’s Inequality). If X≥0 and a>0:
Theorem 4.6b (Chebyshev’s Inequality). For any random variable X with finite mean μ and variance σ2 And any k>0:
P(∣X−μ∣≥k)≤k2σ2
Proof. Apply Markov’s inequality to (X−μ)2 with a=k2: P(∣X−μ∣≥k)=P((X−μ)2≥k2)≤k2E[(X−μ)2]=k2σ2. ■
Theorem 4.6c (Jensen’s Inequality). If φ is convex, then E[φ(X)]≥φ(E[X]). If φ is concave, the inequality reverses.
Proof (sketch). For a convex function φThe tangent line at any point lies below the graph: φ(x)≥φ(μ)+φ′(μ)(x−μ) where μ=E[X]. Taking expectations of both sides: E[φ(X)]≥φ(μ)+φ′(μ)⋅0=φ(E[X]). ■
Remark. Important applications: E[X2]≥(E[X])2 (variance is non-negative, since x2 is convex); E[logX]≤logE[X] (logarithm is concave — this is used in proving the information inequality).
4.5 Cauchy-Schwarz Inequality for Random Variables
Theorem 4.6 (Cauchy-Schwarz). For any random variables X,Y with finite second moments:
(E[XY])2≤E[X2]E[Y2]
Proof. For any real t, E[(X+tY)2]=E[X2]+2tE[XY]+t2E[Y2]≥0. This is a quadratic in t That is always non-negative, so its discriminant must be non-positive:
(2E[XY])2−4E[Y2]E[X2]≤0⟹(E[XY])2≤E[X2]E[Y2]
■
Corollary 4.1.∣ρX,Y∣≤1.
Proof. Apply Cauchy-Schwarz to X−E[X] and Y−E[Y]:
Cov(X,Y)2≤Var(X)Var(Y)
So ∣ρX,Y∣=Var(X)Var(Y)∣Cov(X,Y)∣≤1. ■
Theorem 4.7 (Conditional Variance Formula).
Var(Y)=E[Var(Y∣X)]+Var(E[Y∣X])
Proof.E[Var(Y∣X)]=E[E[Y2∣X]]−E[(E[Y∣X])2]=E[Y2]−E[(E[Y∣X])2]. Also Var(E[Y∣X])=E[(E[Y∣X])2]−(E[E[Y∣X]])2=E[(E[Y∣X])2]−(E[Y])2. Adding: E[Y2]−(E[Y])2=Var(Y). ■
Intuition. Total variance equals the average “within-group” variance plus the “between-group” variance. This Decomposition is the foundation of ANOVA (Analysis of Variance).
4.6 Worked Examples
Worked Example. Find the MGF of X∼Bin(n,p) and use it to derive E[X] and Var(X).
Problem 4.3. The number of accidents at an intersection per week follows Poisson(λ) with λ=2. Find P(X≤1) and, using Markov’s inequality, give an upper bound for P(X≥6).
Solution
P(X≤1)=P(X=0)+P(X=1)=e−2+2e−2=3e−2≈0.406
By Markov’s inequality (since X≥0):
P(X≥6)≤6E[X]=62=31≈0.333
For comparison, the exact value: P(X≥6)=1−P(X≤5)=1−e−2(1+2+2+8/6+16/24+32/120)≈0.0166. Markov’s bound is very loose but requires no knowledge beyond the mean.
MX(t)=E[etX]=∫0∞etxλe−λxdx=λ∫0∞e−(λ−t)xdx
For t<λ: MX(t)=λ−tλ.
MX′(t)=(λ−t)2λSo E[X]=MX′(0)=1/λ. ■
Problem 4.1. Find E[X2] and Var(X) for X∼Exp(λ) using the MGF.
Solution
MX′′(t)=(λ−t)32λSo E[X2]=MX′′(0)=λ32λ=λ22.
Var(X)=E[X2]−(E[X])2=λ22−λ21=λ21
Problem 4.2. A fair die is rolled. Let X be the value shown. Compute E[X], E[X2]And Var(X).
:::caution Common Pitfall Var(X)=E[X2]−(E[X])2not(E[X])2−E[X2]. The variance is always non-negative, so if you Obtain a negative value, you have made an arithmetic error. :::
5. Joint Distributions
5.1 Joint PDF and CDF
For two random variables X and YThe joint CDF is FX,Y(x,y)=P(X≤x,Y≤y).
The joint PDF (for continuous) satisfies P((X,Y)∈A)=∬AfX,Y(x,y)dxdy.
5.2 Marginal Distributions
The marginal PDF of X is obtained by integrating out Y:
fX(x)=∫−∞∞fX,Y(x,y)dy
Similarly, fY(y)=∫−∞∞fX,Y(x,y)dx.
5.3 Conditional Distributions
Definition. The conditional PDF of X given Y=y (when fY(y)>0) is
fX∣Y(x∣y)=fY(y)fX,Y(x,y)
Definition. The conditional expectation is
E[X∣Y=y]=∫−∞∞xfX∣Y(x∣y)dx
Theorem 5.0 (Law of Iterated Expectations / Tower Property).
Proof. Expand Var(∑Xi)=E[(∑Xi)2]−(E[∑Xi])2 and collect terms. ■
Remark. If the Xi are pairwise uncorrelated (which includes independence as a special case), the covariance Terms vanish and the variance of the sum equals the sum of the variances.
5.6 Transformation of Random Variables (Jacobian Method)
Theorem 5.2. Let (X,Y) have joint PDF fX,Y(x,y) and let (U,V)=g(X,Y) where g is a bijection From R2 to R2 with inverse g−1:
u=u(x,y),v=v(x,y)
Then the joint PDF of (U,V) is
fU,V(u,v)=fX,Y(x(u,v),y(u,v))⋅∣J∣
Where the Jacobian determinant is
J=det(∂u∂x∂u∂y∂v∂x∂v∂y)
Problem 5.1. Let X,Y be independent with X,Y∼Exp(λ). Find the joint distribution of U=X+Y and V=X/(X+Y).
Solution
The inverse transformation is X=UV, Y=U(1−V) for u>0,0<v<1.
The Jacobian:
J=det(v1−vu−u)=−uv−u(1−v)=−u
So ∣J∣=u.
The joint PDF of (X,Y) is fX,Y(x,y)=λ2e−λ(x+y) for x,y>0.
fU,V(u,v)=λ2e−λ(uv+u(1−v))⋅u=λ2ue−λu,u>0,0<v<1
This factors as fU(u)⋅fV(v) where fU(u)=λ2ue−λu (Gamma(2,λ)) And fV(v)=1 on (0,1) (Uniform(0,1)). Hence U and V are independent. ■
5.7 Bivariate Normal Distribution
Definition.(X,Y) has a bivariate normal distribution with parameters μX,μY,σX2,σY2,ρ If the joint PDF is
Problem 5.3. Let X∼N(0,1) and Y=X2. Compute Cov(X,Y) and ρX,Y.
Solution
E[X]=0, E[Y]=E[X2]=1And E[XY]=E[X3]=0 (since X3 is an odd function of a symmetric distribution).
Cov(X,Y)=E[XY]−E[X]E[Y]=0−0=0
So ρX,Y=0. However, X and Y are not independent (knowing X determines Y exactly). This demonstrates that zero correlation does not imply independence. ■
6. Limit Theorems
6.1 Convergence in Probability and Distribution
Definition.XnpX (convergence in probability) if for every ε>0:
limn→∞P(∣Xn−X∣>ε)=0
Definition.XndX (convergence in distribution) if limn→∞FXn(x)=FX(x) at all Continuity points of FX.
Remark. Convergence in probability implies convergence in distribution. The converse does not hold , But does hold when the limit is a constant.
Proposition 6.1. If Xnpc (a constant), then Xndc.
Proof. If Fc(x) is the CDF of the constant cThen Fc(x)=0 for x<c and Fc(x)=1 for x>c. For x>c: FXn(x)=P(Xn≤x)=1−P(Xn>x)→1−0=1=Fc(x). For x<c: FXn(x)=P(Xn≤x)≤P(∣Xn−c∣>c−x)→0=Fc(x). Since Fc is continuous at all x=cConvergence holds. ■
Definition.Xna.s.X (almost sure convergence) if P(limn→∞Xn=X)=1.
Remark. The hierarchy of convergence is: almost sure ⟹ in probability ⟹ in distribution. None of the reverse implications hold .
6.2 Law of Large Numbers
Theorem 6.1 (Weak Law of Large Numbers). Let X1,X2,… be i.i.d. With E[Xi]=μ and Var(Xi)=σ2<∞. Then for every ε>0:
limn→∞P(n1∑i=1nXi−μ>ε)=0
Proof. Let Xˉn=n1∑i=1nXi. Then E[Xˉn]=μ and Var(Xˉn)=σ2/n. By Chebyshev’s inequality:
P(∣Xˉn−μ∣>ε)≤ε2Var(Xˉn)=nε2σ2→0
■
Theorem 6.2 (Strong Law of Large Numbers). Under the same hypotheses:
P(limn→∞n1∑i=1nXi=μ)=1
6.3 Central Limit Theorem
Theorem 6.3 (Central Limit Theorem). Let X1,X2,… be i.i.d. With E[Xi]=μ and Var(Xi)=σ2>0. Then
σ/nXˉn−μdN(0,1)
That is, for all a<b:
limn→∞P(a<σ/nXˉn−μ<b)=Φ(b)−Φ(a)
Proof (sketch using MGFs). Let Yi=(Xi−μ)/σSo E[Yi]=0 and Var(Yi)=1. Define Zn=n1∑i=1nYi. We show MZn(t)→et2/2 (the standard normal MGF).
Let M(t)=E[etY1] be the MGF of Y1. Then:
MZn(t)=[M(nt)]n
By Taylor expansion of M around 0: M(h)=1+hM′(0)+2h2M′′(0)+o(h2)=1+0+2h2+o(h2) (since E[Y1]=0 and E[Y12]=1).
Therefore:
MZn(t)=[1+2nt2+o(n1)]n→et2/2
As n→∞. By the continuity theorem for MGFs, ZndN(0,1). ■
6.4 Slutsky’s Theorem
Theorem 6.4 (Slutsky). If XndX and Ynpc (a constant), then:
Xn+YndX+c.
YnXndcX.
Xn/YndX/c (provided c=0).
Intuition. Slutsky’s theorem allows us to replace a convergent-in-probability sequence by its limit inside Expressions that converge in distribution. This is essential for deriving the asymptotic distribution of t-…/4-statistics-and-probability/2_statistics, for instance.
Corollary 6.1 (Asymptotic distribution of the t-statistic). If X1,…,Xn are i.i.d. With mean μ Variance σ2And fourth moment, then
Sn/nXˉn−μdN(0,1)
Where Sn2=n−11∑i=1n(Xi−Xˉn)2.
Proof. By the CLT, n(Xˉn−μ)/σdN(0,1). By the WLLN, Sn2pσ2 So Snpσ. By the continuous mapping theorem, σ/Snp1. Applying Slutsky’s Theorem:
σn(Xˉn−μ)⋅SnσdN(0,1)⋅1=N(0,1)■
6.5 Delta Method
Theorem 6.5 (Delta Method). If n(Tn−θ)dN(0,σ2) and g is differentiable At θ with g′(θ)=0Then
n(g(Tn)−g(θ))dN(0,σ2[g′(θ)]2)
Proof (sketch). By a first-order Taylor expansion: g(Tn)≈g(θ)+g′(θ)(Tn−θ). Multiplying by n: n(g(Tn)−g(θ))≈g′(θ)⋅n(Tn−θ). The right side converges in distribution to g′(θ)⋅N(0,σ2)=N(0,σ2[g′(θ)]2). Slutsky’s theorem makes this rigorous. ■
Problem 6.4. Let X1,…,Xn be i.i.d. Poisson(λ). Find the asymptotic distribution of n(Xˉn−e−Xˉn) using the delta method.
Solution
By the CLT, n(Xˉn−λ)dN(0,λ) (since Var(Xi)=λ).
Let g(t)=t−e−t. Then g′(t)=1+e−tSo g′(λ)=1+e−λ.
By the delta method:
n(g(Xˉn)−g(λ))dN(0,λ(1+e−λ)2)
6.5 Worked Examples
Problem 6.1. A factory produces light bulbs with mean lifetime 1000 hours and standard deviation 200 Hours. What is the probability that the mean lifetime of 100 bulbs exceeds 1040 hours?
Solution
By the CLT, Xˉ100≈N(1000,2002/100)=N(1000,400).
P(Xˉ>1040)=P(Z>201040−1000)=P(Z>2)≈0.0228
■
Problem 6.2. A coin is flipped 10,000 times. Approximate the probability that the number of heads is between 4,900 and 5,100.
Solution
Let X∼Bin(10000,0.5)So E[X]=5000 and Var(X)=2500. By the normal approximation With continuity correction:
Problem 6.3. Let X1,…,Xn be i.i.d. With mean μ and variance σ2. Let S2=n−11∑(Xi−Xˉ)2. Show that S2pσ2.
Solution
First, note E[S2]=σ2 (it is unbiased). We need to show Var(S2)→0 as n→∞. Since S2 is a sample average of i.i.d. Random variables (after centering), by the weak law of large numbers, S2pσ2. ■
7. Maximum Likelihood Estimation
7.1 Likelihood Function
Given i.i.d. Observations x1,…,xn from a distribution with parameter θThe likelihood function is
L(θ)=∏i=1nf(xi∣θ)
And the log-likelihood is
ℓ(θ)=logL(θ)=∑i=1nlogf(xi∣θ)
7.2 MLE Procedure
The maximum likelihood estimator (MLE) θ^MLE maximises L(θ) (equivalently, ℓ(θ)):
θ^MLE=argmaxθL(θ)
found by solving ℓ′(θ)=0 and verifying ℓ′′(θ^)<0.
7.3 Properties of MLEs
Theorem 7.1 (Consistency — Sketch). Under regularity conditions, θ^MLEpθ0 (the true parameter).
Proof sketch. By the law of large numbers, n1ℓ(θ)pEθ0[logf(X∣θ)] For each θ. The Kullback-Leibler divergence D(θ0∥θ)=−Eθ0[logf(X∣θ)]+Eθ0[logf(X∣θ0)] Is minimised (at zero) when θ=θ0 by the information inequality. Therefore the maximiser of ℓ(θ) converges in probability to θ0.
Theorem 7.2 (Asymptotic Normality). Under regularity conditions:
n(θ^MLE−θ0)dN(0,I(θ0)1)
Where I(θ0) is the Fisher information.
7.4 Fisher Information and the Cramer-Rao Bound
Definition. The Fisher information for a single observation is
I(θ)=E[(∂θ∂logf(X∣θ))2]=−E[∂θ2∂2logf(X∣θ)]
Provided the interchange of differentiation and integration is valid.
Theorem 7.3 (Cramer-Rao Lower Bound). For any unbiased estimator T of θ:
Var(T)≥nI(θ)1
Intuition. The Cramer-Rao bound gives a theoretical minimum for the variance of any unbiased estimator. An Estimator that achieves this bound is called efficient.
Example 7.1. For X∼N(μ,σ2) with σ2 known, the Fisher information about μ is I(μ)=1/σ2. The MLE μ^=Xˉ has Var(Xˉ)=σ2/n=1/(n⋅I(μ)) So the sample mean achieves the Cramer-Rao bound and is efficient.
7.5 Confidence Intervals
Definition. A 100(1−α)%confidence interval for θ is a random interval [L,U] such that
Pθ(L≤θ≤U)=1−α
Theorem 7.4. Under the asymptotic normality of the MLE, an approximate 100(1−α)% confidence interval For θ is
θ^±zα/2⋅nI(θ^)1
Where zα/2=Φ−1(1−α/2).
Example 7.2. For X1,…,Xn∼N(μ,σ2) with σ known, the exact 100(1−α)% Confidence interval for μ is:
Xˉ±zα/2⋅nσ
When σ is unknown, replace σ with S and zα/2 with tn−1,α/2:
Xˉ±tn−1,α/2⋅nS
Theorem 7.5 (Wald Confidence Interval). For a scalar parameter θ with MLE θ^ satisfying Asymptotic normality, the Wald confidence interval is
θ^±zα/2SE(θ^)
Where SE(θ^)=1/nI(θ^) is the estimated standard error.
Problem 7.4. In a survey of 400 people, 220 support a policy. Construct a 95% confidence interval for the True proportion p.
Solution
p^=220/400=0.55. For a Bernoulli: I(p)=1/[p(1−p)]So SE=p^(1−p^)/n=0.55×0.45/400=0.000619≈0.02488.
95%CI:0.55±1.96×0.02488=0.55±0.0488=(0.501,0.599)
7.6 Worked Examples
Problem 7.1. Find the MLE for λ given i.i.d. Observations x1,…,xn from Exp(λ).
Solution
L(λ)=∏i=1nλe−λxi=λne−λ∑xi
ℓ(λ)=nlogλ−λ∑i=1nxi
dλdℓ=λn−∑i=1nxi=0⟹λ^=∑xin=xˉ1
Verify: dλ2d2ℓ=−λ2n<0Confirming this is a maximum. ■
Problem 7.2. Find the MLE for p given i.i.d. Observations from Bin(n,p) (observed counts x1,…,xm).
Problem 7.3. Compute the Fisher information for λ in the exponential model and construct a 95% confidence Interval.
Solution
For X∼Exp(λ): f(x∣λ)=λe−λxSo logf=logλ−λx.
∂λ∂logf=λ1−x
I(λ)=−E[∂λ2∂2logf]=−E[−λ21]=λ21
The MLE λ^=1/Xˉ is approximately N(λ,1/(n⋅I(λ)))=N(λ,λ2/n).
A 95% confidence interval is:
λ^±1.96⋅nλ^
:::caution Common Pitfall The MLE is not always unbiased. For example, the MLE σ^2=n1∑(Xi−Xˉ)2 For the normal variance is biased; the unbiased estimator uses n−1 in the denominator. :::
8. Hypothesis Testing
8.1 Framework
A hypothesis test evaluates two competing statements:
Null hypothesisH0: the status quo (e.g., μ=μ0).
Alternative hypothesisH1: what we want to show (e.g., μ>μ0).
8.2 Test Statistics and Decisions
A test statisticT is a function of the data. We reject H0 if T falls in the rejection Region (critical region).
Type I error: rejecting H0 when it is true (false positive). Probability = α (significance Level).
Type II error: failing to reject H0 when it is false (false negative). Probability = β.
The power of a test is 1−β=P(rejectH0∣H1istrue).
8.3 Neyman-Pearson Lemma
Theorem 8.1 (Neyman-Pearson Lemma). Consider testing H0:θ=θ0 versus H1:θ=θ1 Based on a single observation X with PDF f(x∣θ). The most powerful test of level α rejects H0 when the likelihood ratio exceeds a threshold:
Λ(x)=f(x∣θ0)f(x∣θ1)>k
For some k chosen so that P(Λ(X)>k∣H0)=α.
Proof (sketch). Consider any test ϕ with level α (i.e., Eθ0[ϕ(X)]≤α). The power Under H1 is Eθ1[ϕ(X)]=∫ϕ(x)f(x∣θ1)dx. Write this as ∫ϕ(x)Λ(x)f(x∣θ0)dx. The likelihood ratio test ϕ∗ rejects when Λ>k and Randomises on the boundary, so it assigns the largest ϕ∗(x) values to the largest Λ(x) values. Any other Level-α test assigns less rejection probability to large-Λ regions and cannot exceed the power of ϕ∗. ■
8.4 Likelihood Ratio Tests (General)
For composite hypotheses H0:θ∈Θ0 vs H1:θ∈Θ1The generalised likelihood Ratio statistic is
Λ=supθ∈ΘL(θ)supθ∈Θ0L(θ)
We reject H0 when Λ is small (equivalently, when −2logΛ is large).
Theorem 8.2 (Wilks’ Theorem). Under H0 and regularity conditions:
−2logΛdχd2
Where d=dim(Θ)−dim(Θ0) is the difference in the number of free parameters.
8.5 p-Values
The p-value is the probability of observing a test statistic at least as extreme as the one Computed, assuming H0 is true. We reject H0 if the p-value is less than α.
8.6 Z-Test for a Mean
If X1,…,Xn∼N(μ,σ2) with known σTo test H0:μ=μ0:
Z=σ/nXˉ−μ0
Under H0, Z∼N(0,1).
For H1:μ>μ0: reject if Z>zα.
For H1:μ<μ0: reject if Z<−zα.
For H1:μ=μ0: reject if ∣Z∣>zα/2.
8.7 t-Test for a Mean (Unknown Variance)
If σ is unknown, replace σ with the sample standard deviation S:
T=S/nXˉ−μ0
Under H0, T∼tn−1 (Student’s t-distribution with n−1 degrees of freedom).
8.8 Chi-Squared Test for Variance
To test H0:σ2=σ02 for X1,…,Xn∼N(μ,σ2):
χ2=σ02(n−1)S2
Under H0, χ2∼χn−12.
8.9 Chi-Squared Goodness-of-Fit Test
To test whether observed data follow a specified discrete distribution, partition the sample space into k cells With expected counts ei and observed counts oi. The test statistic is
χ2=∑i=1kei(oi−ei)2
Under H0 (and provided expected counts are sufficiently large), χ2∼χk−1−p2 where p is the Number of parameters estimated from the data.
Problem 8.4. A die is rolled 60 times with the following frequencies: \\{1: 8, 2: 12, 3: 9, 4: 11, 5: 13, 6: 7\\}. Test whether the die is fair at α=0.05.
Solution
H0: die is fair (each face has probability 1/6). Expected count per face: ei=60/6=10.
Problem 8.5. A study compares two teaching methods. Method A (20 students): mean score 78, standard deviation 8. Method B (25 students): mean score 72, standard deviation 10. Test H0:μA=μB vs H1:μA=μB At α=0.05.
Use ν≈43 degrees of freedom. The critical values for a two-sided test at α=0.05 are Approximately ±2.017.
Since ∣T∣=2.236>2.017We reject H0 at the 5% significance level. There is evidence that the two Teaching methods produce different mean scores. ■
8.9 Worked Examples
Problem 8.1. A process produces bolts with mean diameter 10mm. A sample of 25 bolts has mean 10.12mm And standard deviation 0.3mm. Test H0:μ=10 vs H1:μ=10 at α=0.05.
Solution
Use the t-test: T=0.3/2510.12−10=0.060.12=2.
Under H0, T∼t24. The critical values are t24,0.025≈2.064.
Since ∣T∣=2<2.064We fail to reject H0 at the 5% significance level. There is insufficient Evidence to conclude the mean diameter differs from 10mm. ■
Problem 8.2. A pharmaceutical company claims a drug reduces blood pressure by 5mmHg on average. In a trial Of 50 patients, the observed mean reduction was 4.2mmHg with standard deviation 3.1mmHg. Test the claim at α=0.05.
Solution
H0:μ=5 vs H1:μ=5.
T=3.1/504.2−5=0.4384−0.8≈−1.825
Under H0, T∼t49. The critical values for a two-sided test at α=0.05 are approximately ±2.010.
Since ∣T∣=1.825<2.010We fail to reject H0. There is insufficient evidence to refute the company’s Claim. ■
Problem 8.3. Let X1,…,Xn be i.i.d. N(μ,σ2) with σ2 known. Derive the likelihood ratio Test for H0:μ=μ0 vs H1:μ=μ0.
We reject H0 when Λ is small, i.e., when σ2n(xˉ−μ0)2 is large, which is Equivalent to σ/nXˉ−μ0>c. This recovers the Z-test. ■
:::caution Common Pitfall “Failing to reject H0” is not the same as “accepting H0”. The test only provides evidence against H0; absence of evidence is not evidence of absence. The distinction is critical in scientific Reasoning. :::
9. Problem Set
Problem 1. Let A,B,C be events with P(A)=0.4, P(B)=0.5, P(C)=0.3, P(A∩B)=0.2P(A∩C)=0.1, P(B∩C)=0.15And P(A∩B∩C)=0.05. Compute P(A∪B∪C).
If you get this wrong, revise: Section 4.2 (Variance).
Problem 18. Let X1,…,Xn∼N(μ,1) with μ unknown. Find the likelihood ratio test statistic for H0:μ=μ0 vs H1:μ=μ0 and show it is equivalent to the Z-test.
If you get this wrong, revise: Section 7.2 (MLE Procedure).
Problem 20. Use the delta method to find the asymptotic distribution of p^(1−p^) where p^=n1∑i=1nXi and Xi∼Bernoulli(p).
Solution
By the CLT, n(p^−p)dN(0,p(1−p)).
Let g(t)=t(1−t)=t−t2. Then g′(t)=1−2tSo g′(p)=1−2p.
By the delta method:
n(p^(1−p^)−p(1−p))dN(0,p(1−p)(1−2p)2)
If you get this wrong, revise: Section 6.5 (Delta Method) and Section 6.3 (CLT).
Problem 21. Let X1,…,Xn be i.i.d. From a distribution with finite mean μ and finite variance σ2. Show that the sample mean is a consistent estimator of μ using Chebyshev’s inequality.
Solution
E[Xˉn]=μ (unbiased) and Var(Xˉn)=σ2/n.
By Chebyshev’s inequality, for any ε>0:
P(∣Xˉn−μ∣≥ε)≤ε2Var(Xˉn)=nε2σ2
As n→∞The right side goes to 0, so Xˉnpμ. ■
If you get this wrong, revise: Section 6.2 (Law of Large Numbers).
Problem 22. A random sample of size 64 is drawn from a population with unknown mean and standard deviation σ=4. Find the probability that the sample mean differs from the population mean by more than 1.
This is the standard Cauchy distribution. Note that E[∣X/Y∣]=∞So the mean does not exist. ■
If you get this wrong, revise: Section 5.6 (Jacobian Method).
Problem 25. Prove that for any events A,B,C:
P(A∩B∩C)=P(A)P(B∣A)P(C∣A∩B)
Solution
By the definition of conditional probability applied twice:
P(B∩C∣A)=P(A)P(A∩B∩C)
P(C∣A∩B)=P(A∩B)P(A∩B∩C)
From the second: P(A∩B∩C)=P(C∣A∩B)P(A∩B). And P(A∩B)=P(B∣A)P(A).
Substituting: P(A∩B∩C)=P(A)P(B∣A)P(C∣A∩B). ■
This is the chain rule of probability, which generalises to n events.
If you get this wrong, revise: Section 1.5 (Conditional Probability).
Problem 26. Show that the Poisson distribution is infinitely divisible: if X∼Poisson(λ) Then X can be expressed as the sum of n i.i.d. Random variables for any positive integer n.
Solution
The MGF of X∼Poisson(λ) is MX(t)=exp(λ(et−1)).
For any integer n≥1We can write:
MX(t)=[exp(nλ(et−1))]n
Each factor exp(nλ(et−1)) is the MGF of Poisson(λ/n). Therefore X=Y1+⋯+Yn where Yi∼Poisson(λ/n) are i.i.d. ■
If you get this wrong, revise: Section 4.3 (MGFs) and Section 3.1 (Poisson Distribution).
Common Pitfalls
Confusing PDF and CDF. PDF f(x): probability density; CDF F(x)=P(X≤x)=∫−∞xf(t)dt. Fix:F′(x)=f(x); P(a<X<b)=F(b)−F(a).
Wrong central limit theorem application. The CLT applies to the sample mean, not individual observations, and requires sufficiently large n. Fix:XˉndN(μ,σ2/n) as n→∞.
Confusing type I and type II errors. Type I: rejecting H0 when it is true (α). Type II: failing to reject H0 when it is false (β). Fix: Type I = false positive; Type II = false negative. Decreasing one increases the other.