1. Probability Spaces 1.1 Sample Spaces and Events A probability space is a triple ( Ω , F , P ) (\Omega, \mathcal{F}, P) ( Ω , F , P ) where:
Ω \Omega Ω is the sample space (set of all possible outcomes).F \mathcal{F} F is a sigma-algebra on Ω \Omega Ω .P : F → [ 0 , 1 ] P : \mathcal{F} \to [0, 1] P : F → [ 0 , 1 ] is a probability measure .Definition. A sigma-algebra F \mathcal{F} F on Ω \Omega Ω is a collection of subsets satisfying:
Ω ∈ F \Omega \in \mathcal{F} Ω ∈ F .If A ∈ F A \in \mathcal{F} A ∈ F Then A c ∈ F A^c \in \mathcal{F} A c ∈ F (closed under complementation). If A 1 , A 2 , … ∈ F A_1, A_2, \ldots \in \mathcal{F} A 1 , A 2 , … ∈ F Then ⋃ i = 1 ∞ A i ∈ F \bigcup_{i=1}^{\infty} A_i \in \mathcal{F} ⋃ i = 1 ∞ A i ∈ F (closed under countable unions). Definition. A probability measure P P P satisfies:
Non-negativity: P ( A ) ≥ 0 P(A) \geq 0 P ( A ) ≥ 0 for all A ∈ F A \in \mathcal{F} A ∈ F .Normalisation: P ( Ω ) = 1 P(\Omega) = 1 P ( Ω ) = 1 .Countable additivity: If A 1 , A 2 , … A_1, A_2, \ldots A 1 , A 2 , … are pairwise disjoint, then P ( ⋃ i = 1 ∞ A i ) = ∑ i = 1 ∞ P ( A i ) P\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i) P ( ⋃ i = 1 ∞ A i ) = ∑ i = 1 ∞ P ( A i ) .1.2 Basic Properties Proposition 1.1. For any probability space:
P ( ∅ ) = 0 P(\emptyset) = 0 P ( ∅ ) = 0 .P ( A c ) = 1 − P ( A ) P(A^c) = 1 - P(A) P ( A c ) = 1 − P ( A ) .If A ⊆ B A \subseteq B A ⊆ B Then P ( A ) ≤ P ( B ) P(A) \leq P(B) P ( A ) ≤ P ( B ) . P ( A ∪ B ) = P ( A ) + P ( B ) − P ( A ∩ B ) P(A \cup B) = P(A) + P(B) - P(A \cap B) P ( A ∪ B ) = P ( A ) + P ( B ) − P ( A ∩ B ) (inclusion-exclusion).Boole”s inequality: P ( ⋃ i = 1 n A i ) ≤ ∑ i = 1 n P ( A i ) P\left(\bigcup_{i=1}^{n} A_i\right) \leq \sum_{i=1}^{n} P(A_i) P ( ⋃ i = 1 n A i ) ≤ ∑ i = 1 n P ( A i ) .Bonferroni inequality: P ( ⋂ i = 1 n A i ) ≥ 1 − ∑ i = 1 n ( 1 − P ( A i ) ) P\left(\bigcap_{i=1}^{n} A_i\right) \geq 1 - \sum_{i=1}^{n} (1 - P(A_i)) P ( ⋂ i = 1 n A i ) ≥ 1 − ∑ i = 1 n ( 1 − P ( A i )) .Proof. (1) Apply countable additivity to the disjoint union Ω = Ω ∪ ∅ ∪ ∅ ∪ ⋯ \Omega = \Omega \cup \emptyset \cup \emptyset \cup \cdots Ω = Ω ∪ ∅ ∪ ∅ ∪ ⋯ : 1 = 1 + P ( ∅ ) + P ( ∅ ) + ⋯ 1 = 1 + P(\emptyset) + P(\emptyset) + \cdots 1 = 1 + P ( ∅ ) + P ( ∅ ) + ⋯ So P ( ∅ ) = 0 P(\emptyset) = 0 P ( ∅ ) = 0 .
(3) B = A ∪ ( B ∖ A ) B = A \cup (B \setminus A) B = A ∪ ( B ∖ A ) is a disjoint union, so P ( B ) = P ( A ) + P ( B ∖ A ) ≥ P ( A ) P(B) = P(A) + P(B \setminus A) \geq P(A) P ( B ) = P ( A ) + P ( B ∖ A ) ≥ P ( A ) .
(4) P ( A ∪ B ) = P ( A ) + P ( B ∖ A ) = P ( A ) + P ( B ) − P ( A ∩ B ) P(A \cup B) = P(A) + P(B \setminus A) = P(A) + P(B) - P(A \cap B) P ( A ∪ B ) = P ( A ) + P ( B ∖ A ) = P ( A ) + P ( B ) − P ( A ∩ B ) . ■ \blacksquare ■
1.3 Conditional Probability and Independence Definition. The conditional probability of A A A given B B B (with P ( B ) > 0 P(B) > 0 P ( B ) > 0 ) is
P ( A ∣ B ) = P ( A ∩ B ) P ( B ) P(A \mid B) = \frac{P(A \cap B)}{P(B)} P ( A ∣ B ) = P ( B ) P ( A ∩ B )
Theorem 1.2 (Law of Total Probability). If B 1 , … , B n B_1, \ldots, B_n B 1 , … , B n form a partition of Ω \Omega Ω with P ( B i ) > 0 P(B_i) > 0 P ( B i ) > 0 for all i i i Then
P ( A ) = ∑ i = 1 n P ( A ∣ B i ) P ( B i ) P(A) = \sum_{i=1}^{n} P(A \mid B_i)\, P(B_i) P ( A ) = ∑ i = 1 n P ( A ∣ B i ) P ( B i )
Theorem 1.3 (Bayes’ Theorem). Under the same conditions:
P ( B j ∣ A ) = P ( A ∣ B j ) P ( B j ) ∑ i = 1 n P ( A ∣ B i ) P ( B i ) P(B_j \mid A) = \frac{P(A \mid B_j)\, P(B_j)}{\sum_{i=1}^{n} P(A \mid B_i)\, P(B_i)} P ( B j ∣ A ) = ∑ i = 1 n P ( A ∣ B i ) P ( B i ) P ( A ∣ B j ) P ( B j )
Definition. Events A A A and B B B are independent if P ( A ∩ B ) = P ( A ) P ( B ) P(A \cap B) = P(A)\,P(B) P ( A ∩ B ) = P ( A ) P ( B ) .
Proposition 1.4. If A A A and B B B are independent with P ( B ) > 0 P(B) > 0 P ( B ) > 0 Then P ( A ∣ B ) = P ( A ) P(A \mid B) = P(A) P ( A ∣ B ) = P ( A ) .
Proof. P ( A ∣ B ) = P ( A ∩ B ) / P ( B ) = P ( A ) P ( B ) / P ( B ) = P ( A ) P(A \mid B) = P(A \cap B)/P(B) = P(A)P(B)/P(B) = P(A) P ( A ∣ B ) = P ( A ∩ B ) / P ( B ) = P ( A ) P ( B ) / P ( B ) = P ( A ) . ■ \blacksquare ■
Definition. Events A 1 , … , A n A_1, \ldots, A_n A 1 , … , A n are mutually independent if for every subset J ⊆ { 1 , … , n } J \subseteq \{1, \ldots, n\} J ⊆ { 1 , … , n } :
P ( ⋂ j ∈ J A j ) = ∏ j ∈ J P ( A j ) P\left(\bigcap_{j \in J} A_j\right) = \prod_{j \in J} P(A_j) P ( ⋂ j ∈ J A j ) = ∏ j ∈ J P ( A j )
Pairwise independence does not imply mutual independence.
Worked Example: Pairwise vs Mutual Independence Solution. Roll two fair dice. Let A A A = “first die is even”, B B B = “second die is even”, C C C = “sum is even”.
P ( A ) = P ( B ) = P ( C ) = 1 / 2 P(A) = P(B) = P(C) = 1/2 P ( A ) = P ( B ) = P ( C ) = 1/2 .
P ( A ∩ B ) = 1 / 4 = P ( A ) P ( B ) P(A \cap B) = 1/4 = P(A)P(B) P ( A ∩ B ) = 1/4 = P ( A ) P ( B ) . P ( A ∩ C ) = P ( first e v e n , s u m e v e n ) = P ( second e v e n ) = 1 / 4 = P ( A ) P ( C ) P(A \cap C) = P(\text{first} even, sum even) = P(\text{second} even) = 1/4 = P(A)P(C) P ( A ∩ C ) = P ( first e v e n , s u m e v e n ) = P ( second e v e n ) = 1/4 = P ( A ) P ( C ) .
P ( B ∩ C ) = 1 / 4 = P ( B ) P ( C ) P(B \cap C) = 1/4 = P(B)P(C) P ( B ∩ C ) = 1/4 = P ( B ) P ( C ) . So A A A , B B B , C C C are pairwise independent.
But P ( A ∩ B ∩ C ) = P ( both e v e n , s u m e v e n ) = P ( both e v e n ) = 1 / 4 ≠ 1 / 8 = P ( A ) P ( B ) P ( C ) P(A \cap B \cap C) = P(\text{both} even, sum even) = P(\text{both} even) = 1/4 \neq 1/8 = P(A)P(B)P(C) P ( A ∩ B ∩ C ) = P ( both e v e n , s u m e v e n ) = P ( both e v e n ) = 1/4 = 1/8 = P ( A ) P ( B ) P ( C ) .
So A A A , B B B , C C C are pairwise independent but not mutually independent. ■ \blacksquare ■
2. Random Variables 2.1 Definition and Distribution Functions Definition. A random variable is a measurable function X : Ω → R X : \Omega \to \mathbb{R} X : Ω → R . The cumulative distribution function (CDF) of X X X is
F X ( x ) = P ( X ≤ x ) F_X(x) = P(X \leq x) F X ( x ) = P ( X ≤ x )
Proposition 2.1 (Properties of the CDF).
F F F is non-decreasing: if a ≤ b a \leq b a ≤ b Then F ( a ) ≤ F ( b ) F(a) \leq F(b) F ( a ) ≤ F ( b ) .lim x → − ∞ F ( x ) = 0 \lim_{x \to -\infty} F(x) = 0 lim x → − ∞ F ( x ) = 0 and lim x → + ∞ F ( x ) = 1 \lim_{x \to +\infty} F(x) = 1 lim x → + ∞ F ( x ) = 1 .F F F is right-continuous: lim x → a + F ( x ) = F ( a ) \lim_{x \to a^+} F(x) = F(a) lim x → a + F ( x ) = F ( a ) .Proof. (1) If a ≤ b a \leq b a ≤ b Then { X ≤ a } ⊆ { X ≤ b } \{X \leq a\} \subseteq \{X \leq b\} { X ≤ a } ⊆ { X ≤ b } So F ( a ) = P ( X ≤ a ) ≤ P ( X ≤ b ) = F ( b ) F(a) = P(X \leq a) \leq P(X \leq b) = F(b) F ( a ) = P ( X ≤ a ) ≤ P ( X ≤ b ) = F ( b ) by Proposition 1.1(3).
(2) As x → − ∞ x \to -\infty x → − ∞ The events { X ≤ x } \{X \leq x\} { X ≤ x } decrease to ∅ \emptyset ∅ So by continuity from above of probability measures, F ( x ) → 0 F(x) \to 0 F ( x ) → 0 . As x → + ∞ x \to +\infty x → + ∞ The events increase to Ω \Omega Ω So F ( x ) → 1 F(x) \to 1 F ( x ) → 1 .
(3) As x → a + x \to a^+ x → a + The events { X ≤ x } \{X \leq x\} { X ≤ x } decrease to { X ≤ a } \{X \leq a\} { X ≤ a } Giving right-continuity. ■ \blacksquare ■
2.2 Discrete Random Variables A random variable is discrete if its range is countable. The probability mass function (PMF) is p X ( x ) = P ( X = x ) p_X(x) = P(X = x) p X ( x ) = P ( X = x ) .
Definition (Expected Value). For a discrete random variable:
E [ X ] = ∑ x x p X ( x ) E[X] = \sum_{x} x\, p_X(x) E [ X ] = ∑ x x p X ( x )
Provided the sum converges absolutely.
Definition (Variance). V a r ( X ) = E [ ( X − μ ) 2 ] = E [ X 2 ] − ( E [ X ] ) 2 \mathrm{Var}(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2 Var ( X ) = E [( X − μ ) 2 ] = E [ X 2 ] − ( E [ X ] ) 2 where μ = E [ X ] \mu = E[X] μ = E [ X ] .
Proposition 2.2 (Linearity of Expectation). E [ a X + b Y ] = a E [ X ] + b E [ Y ] E[aX + bY] = aE[X] + bE[Y] E [ a X + bY ] = a E [ X ] + b E [ Y ] for any random variables X X X , Y Y Y and constants a a a , b b b .
Proof. Direct computation from the definition of expected value. For the discrete case:
E [ a X + b Y ] = ∑ x , y ( a x + b y ) p X , Y ( x , y ) = a ∑ x x p X ( x ) + b ∑ y y p Y ( y ) = a E [ X ] + b E [ Y ] E[aX + bY] = \sum_{x,y} (ax + by)\, p_{X,Y}(x,y) = a\sum_x x\, p_X(x) + b\sum_y y\, p_Y(y) = aE[X] + bE[Y] E [ a X + bY ] = ∑ x , y ( a x + b y ) p X , Y ( x , y ) = a ∑ x x p X ( x ) + b ∑ y y p Y ( y ) = a E [ X ] + b E [ Y ]
■ \blacksquare ■
2.3 Continuous Random Variables A random variable is continuous if its CDF is absolutely continuous, i.e., there exists a probability density function (PDF) f X f_X f X such that
F X ( x ) = ∫ − ∞ x f X ( t ) d t F_X(x) = \int_{-\infty}^{x} f_X(t)\, dt F X ( x ) = ∫ − ∞ x f X ( t ) d t
Key properties:
f X ( x ) ≥ 0 f_X(x) \geq 0 f X ( x ) ≥ 0 for all x x x .∫ − ∞ ∞ f X ( x ) d x = 1 \int_{-\infty}^{\infty} f_X(x)\, dx = 1 ∫ − ∞ ∞ f X ( x ) d x = 1 .P ( a ≤ X ≤ b ) = ∫ a b f X ( x ) d x P(a \leq X \leq b) = \int_a^b f_X(x)\, dx P ( a ≤ X ≤ b ) = ∫ a b f X ( x ) d x .P ( X = a ) = 0 P(X = a) = 0 P ( X = a ) = 0 for any single point a a a .2.4 Common Distributions Discrete distributions:
Distribution PMF E [ X ] E[X] E [ X ] V a r ( X ) \mathrm{Var}(X) Var ( X ) Bernoulli( p ) (p) ( p ) p x ( 1 − p ) 1 − x p^x(1-p)^{1-x} p x ( 1 − p ) 1 − x , x ∈ { 0 , 1 } x \in \{0,1\} x ∈ { 0 , 1 } p p p p ( 1 − p ) p(1-p) p ( 1 − p ) Binomial( n , p ) (n,p) ( n , p ) ( n x ) p x ( 1 − p ) n − x \binom{n}{x}p^x(1-p)^{n-x} ( x n ) p x ( 1 − p ) n − x n p np n p n p ( 1 − p ) np(1-p) n p ( 1 − p ) Poisson( λ ) (\lambda) ( λ ) e − λ λ x / x ! e^{-\lambda}\lambda^x / x! e − λ λ x / x ! λ \lambda λ λ \lambda λ Geometric( p ) (p) ( p ) ( 1 − p ) x − 1 p (1-p)^{x-1}p ( 1 − p ) x − 1 p , x ≥ 1 x \geq 1 x ≥ 1 1 / p 1/p 1/ p ( 1 − p ) / p 2 (1-p)/p^2 ( 1 − p ) / p 2
Continuous distributions:
Distribution PDF E [ X ] E[X] E [ X ] V a r ( X ) \mathrm{Var}(X) Var ( X ) Uniform( a , b ) (a,b) ( a , b ) 1 / ( b − a ) 1/(b-a) 1/ ( b − a ) on [ a , b ] [a,b] [ a , b ] ( a + b ) / 2 (a+b)/2 ( a + b ) /2 ( b − a ) 2 / 12 (b-a)^2/12 ( b − a ) 2 /12 Exponential( λ ) (\lambda) ( λ ) λ e − λ x \lambda e^{-\lambda x} λ e − λ x , x ≥ 0 x \geq 0 x ≥ 0 1 / λ 1/\lambda 1/ λ 1 / λ 2 1/\lambda^2 1/ λ 2 N ( μ , σ 2 ) N(\mu, \sigma^2) N ( μ , σ 2 ) 1 σ 2 π e − ( x − μ ) 2 / ( 2 σ 2 ) \frac{1}{\sigma\sqrt{2\pi}}e^{-(x-\mu)^2/(2\sigma^2)} σ 2 π 1 e − ( x − μ ) 2 / ( 2 σ 2 ) μ \mu μ σ 2 \sigma^2 σ 2
2.5 The Normal Distribution Definition. X ∼ N ( μ , σ 2 ) X \sim N(\mu, \sigma^2) X ∼ N ( μ , σ 2 ) if X X X has PDF f ( x ) = 1 σ 2 π exp ( − ( x − μ ) 2 2 σ 2 ) f(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) f ( x ) = σ 2 π 1 exp ( − 2 σ 2 ( x − μ ) 2 ) .
Theorem 2.3 (Standardisation). If X ∼ N ( μ , σ 2 ) X \sim N(\mu, \sigma^2) X ∼ N ( μ , σ 2 ) Then Z = ( X − μ ) / σ ∼ N ( 0 , 1 ) Z = (X - \mu)/\sigma \sim N(0, 1) Z = ( X − μ ) / σ ∼ N ( 0 , 1 ) .
Proof. The CDF of Z Z Z : P ( Z ≤ z ) = P ( X ≤ μ + σ z ) = ∫ − ∞ μ + σ z 1 σ 2 π e − t 2 / 2 d t P(Z \leq z) = P(X \leq \mu + \sigma z) = \int_{-\infty}^{\mu + \sigma z} \frac{1}{\sigma\sqrt{2\pi}} e^{-t^2/2}\, dt P ( Z ≤ z ) = P ( X ≤ μ + σ z ) = ∫ − ∞ μ + σ z σ 2 π 1 e − t 2 /2 d t . Substituting u = ( t − μ ) / σ u = (t - \mu)/\sigma u = ( t − μ ) / σ : = ∫ − ∞ z 1 2 π e − u 2 / 2 d u = \int_{-\infty}^{z} \frac{1}{\sqrt{2\pi}} e^{-u^2/2}\, du = ∫ − ∞ z 2 π 1 e − u 2 /2 d u Which is the CDF of N ( 0 , 1 ) N(0, 1) N ( 0 , 1 ) . ■ \blacksquare ■
Theorem 2.4 (Moment Generating Function). If X ∼ N ( μ , σ 2 ) X \sim N(\mu, \sigma^2) X ∼ N ( μ , σ 2 ) Then
M X ( t ) = E [ e t X ] = exp ( μ t + σ 2 t 2 2 ) M_X(t) = E[e^{tX}] = \exp\left(\mu t + \frac{\sigma^2 t^2}{2}\right) M X ( t ) = E [ e tX ] = exp ( μ t + 2 σ 2 t 2 )
Proof. M X ( t ) = ∫ − ∞ ∞ e t x 1 σ 2 π e − ( x − μ ) 2 / ( 2 σ 2 ) d x M_X(t) = \int_{-\infty}^{\infty} e^{tx} \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2/(2\sigma^2)}\, dx M X ( t ) = ∫ − ∞ ∞ e t x σ 2 π 1 e − ( x − μ ) 2 / ( 2 σ 2 ) d x . Completing the square in the exponent and evaluating the Gaussian integral gives the result. ■ \blacksquare ■
2.6 Moment Generating Functions Definition. The moment generating function (MGF) of X X X is M X ( t ) = E [ e t X ] M_X(t) = E[e^{tX}] M X ( t ) = E [ e tX ] (when it exists in a neighbourhood of t = 0 t = 0 t = 0 ).
Theorem 2.5. If the MGF exists in a neighbourhood of 0, it uniquely determines the distribution. Furthermore, E [ X n ] = M X ( n ) ( 0 ) E[X^n] = M_X^{(n)}(0) E [ X n ] = M X ( n ) ( 0 ) .
2.7 Worked Examples Problem. Let X ∼ P o i s s o n ( 3 ) X \sim \mathrm{Poisson}(3) X ∼ Poisson ( 3 ) and Y ∼ P o i s s o n ( 5 ) Y \sim \mathrm{Poisson}(5) Y ∼ Poisson ( 5 ) be independent. Find the distribution of X + Y X + Y X + Y .
Solution The MGF of X ∼ P o i s s o n ( λ ) X \sim \mathrm{Poisson}(\lambda) X ∼ Poisson ( λ ) is M X ( t ) = e λ ( e t − 1 ) M_X(t) = e^{\lambda(e^t - 1)} M X ( t ) = e λ ( e t − 1 ) .
M X + Y ( t ) = M X ( t ) ⋅ M Y ( t ) = e 3 ( e t − 1 ) ⋅ e 5 ( e t − 1 ) = e 8 ( e t − 1 ) M_{X+Y}(t) = M_X(t) \cdot M_Y(t) = e^{3(e^t - 1)} \cdot e^{5(e^t - 1)} = e^{8(e^t - 1)} M X + Y ( t ) = M X ( t ) ⋅ M Y ( t ) = e 3 ( e t − 1 ) ⋅ e 5 ( e t − 1 ) = e 8 ( e t − 1 ) .
This is the MGF of P o i s s o n ( 8 ) \mathrm{Poisson}(8) Poisson ( 8 ) . Since the MGF uniquely determines the distribution, X + Y ∼ P o i s s o n ( 8 ) X + Y \sim \mathrm{Poisson}(8) X + Y ∼ Poisson ( 8 ) .
■ \blacksquare ■
Worked Example: Minimum of Exponential Random Variables Solution. Let X 1 , … , X n X_1, \ldots, X_n X 1 , … , X n be independent with X i ∼ E x p ( λ i ) X_i \sim \mathrm{Exp}(\lambda_i) X i ∼ Exp ( λ i ) . Find the distribution of M = min ( X 1 , … , X n ) M = \min(X_1, \ldots, X_n) M = min ( X 1 , … , X n ) .
P ( M > t ) = P ( X 1 > t , … , X n > t ) = ∏ i = 1 n P ( X i > t ) = ∏ i = 1 n e − λ i t = e − ( λ 1 + ⋯ + λ n ) t P(M > t) = P(X_1 > t, \ldots, X_n > t) = \prod_{i=1}^{n} P(X_i > t) = \prod_{i=1}^{n} e^{-\lambda_i t} = e^{-(\lambda_1 + \cdots + \lambda_n)t} P ( M > t ) = P ( X 1 > t , … , X n > t ) = ∏ i = 1 n P ( X i > t ) = ∏ i = 1 n e − λ i t = e − ( λ 1 + ⋯ + λ n ) t
So P ( M ≤ t ) = 1 − e − λ t P(M \leq t) = 1 - e^{-\lambda t} P ( M ≤ t ) = 1 − e − λ t where λ = ∑ i = 1 n λ i \lambda = \sum_{i=1}^{n} \lambda_i λ = ∑ i = 1 n λ i . This means M ∼ E x p ( λ ) M \sim \mathrm{Exp}(\lambda) M ∼ Exp ( λ ) . ■ \blacksquare ■
3. Joint Distributions and Independence 3.1 Joint Distribution Functions Definition. The joint CDF of ( X , Y ) (X, Y) ( X , Y ) is F X , Y ( x , y ) = P ( X ≤ x , Y ≤ y ) F_{X,Y}(x, y) = P(X \leq x, Y \leq y) F X , Y ( x , y ) = P ( X ≤ x , Y ≤ y ) .
Definition. The joint PDF (for continuous random variables) is f X , Y ( x , y ) ≥ 0 f_{X,Y}(x, y) \geq 0 f X , Y ( x , y ) ≥ 0 such that
F X , Y ( x , y ) = ∫ − ∞ x ∫ − ∞ y f X , Y ( u , v ) d u d v F_{X,Y}(x, y) = \int_{-\infty}^{x}\int_{-\infty}^{y} f_{X,Y}(u, v)\, du\, dv F X , Y ( x , y ) = ∫ − ∞ x ∫ − ∞ y f X , Y ( u , v ) d u d v
Definition. The marginal PDF of X X X is f X ( x ) = ∫ − ∞ ∞ f X , Y ( x , y ) d y f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y)\, dy f X ( x ) = ∫ − ∞ ∞ f X , Y ( x , y ) d y .
3.2 Covariance and Correlation Definition. The covariance of X X X and Y Y Y is
C o v ( X , Y ) = E [ ( X − E [ X ] ) ( Y − E [ Y ] ) ] = E [ X Y ] − E [ X ] E [ Y ] \mathrm{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y] Cov ( X , Y ) = E [( X − E [ X ]) ( Y − E [ Y ])] = E [ X Y ] − E [ X ] E [ Y ]
Proposition 2.6. C o v ( X , Y ) = C o v ( Y , X ) \mathrm{Cov}(X, Y) = \mathrm{Cov}(Y, X) Cov ( X , Y ) = Cov ( Y , X ) and C o v ( a X + b , c Y + d ) = a c C o v ( X , Y ) \mathrm{Cov}(aX + b, cY + d) = ac\,\mathrm{Cov}(X, Y) Cov ( a X + b , c Y + d ) = a c Cov ( X , Y ) .
Definition. The correlation coefficient is
ρ ( X , Y ) = C o v ( X , Y ) V a r ( X ) V a r ( Y ) \rho(X, Y) = \frac{\mathrm{Cov}(X, Y)}{\sqrt{\mathrm{Var}(X)\,\mathrm{Var}(Y)}} ρ ( X , Y ) = Var ( X ) Var ( Y ) Cov ( X , Y )
Theorem 2.7 (Cauchy—Schwarz for Random Variables). ∣ ρ ( X , Y ) ∣ ≤ 1 |\rho(X, Y)| \leq 1 ∣ ρ ( X , Y ) ∣ ≤ 1 With equality if and only if Y = a X + b Y = aX + b Y = a X + b almost surely for some a , b a, b a , b .
3.3 Independence of Random Variables Definition. X X X and Y Y Y are independent if F X , Y ( x , y ) = F X ( x ) F Y ( y ) F_{X,Y}(x, y) = F_X(x)\, F_Y(y) F X , Y ( x , y ) = F X ( x ) F Y ( y ) for all x , y x, y x , y .
For continuous random variables, this is equivalent to f X , Y ( x , y ) = f X ( x ) f Y ( y ) f_{X,Y}(x, y) = f_X(x)\, f_Y(y) f X , Y ( x , y ) = f X ( x ) f Y ( y ) .
Proposition 2.8. If X X X and Y Y Y are independent, then C o v ( X , Y ) = 0 \mathrm{Cov}(X, Y) = 0 Cov ( X , Y ) = 0 . The converse is false.
Worked Example: Uncorrelated but Dependent Solution. Let X ∼ N ( 0 , 1 ) X \sim N(0, 1) X ∼ N ( 0 , 1 ) and Y = X 2 Y = X^2 Y = X 2 . Then C o v ( X , Y ) = E [ X 3 ] − E [ X ] E [ X 2 ] = 0 − 0 ⋅ 1 = 0 \mathrm{Cov}(X, Y) = E[X^3] - E[X]E[X^2] = 0 - 0 \cdot 1 = 0 Cov ( X , Y ) = E [ X 3 ] − E [ X ] E [ X 2 ] = 0 − 0 ⋅ 1 = 0 (since the third moment of a standard normal is 0).
But Y Y Y is completely determined by X X X So they are not independent. ■ \blacksquare ■
4. Limit Theorems 4.1 The Law of Large Numbers Theorem 4.1 (Weak Law of Large Numbers). Let X 1 , X 2 , … X_1, X_2, \ldots X 1 , X 2 , … be i.i.d. With E [ X i ] = μ E[X_i] = \mu E [ X i ] = μ and V a r ( X i ) = σ 2 < ∞ \mathrm{Var}(X_i) = \sigma^2 < \infty Var ( X i ) = σ 2 < ∞ . Then for every ε > 0 \varepsilon > 0 ε > 0 :
lim n → ∞ P ( ∣ 1 n ∑ i = 1 n X i − μ ∣ ≥ ε ) = 0 \lim_{n \to \infty} P\left(\left|\frac{1}{n}\sum_{i=1}^{n} X_i - \mu\right| \geq \varepsilon\right) = 0 lim n → ∞ P ( n 1 ∑ i = 1 n X i − μ ≥ ε ) = 0
Proof. Let S n = 1 n ∑ i = 1 n X i S_n = \frac{1}{n}\sum_{i=1}^{n} X_i S n = n 1 ∑ i = 1 n X i . Then E [ S n ] = μ E[S_n] = \mu E [ S n ] = μ and V a r ( S n ) = σ 2 / n \mathrm{Var}(S_n) = \sigma^2/n Var ( S n ) = σ 2 / n . By Chebyshev’s inequality:
P ( ∣ S n − μ ∣ ≥ ε ) ≤ V a r ( S n ) ε 2 = σ 2 n ε 2 → 0 a s n → ∞ P(|S_n - \mu| \geq \varepsilon) \leq \frac{\mathrm{Var}(S_n)}{\varepsilon^2} = \frac{\sigma^2}{n\varepsilon^2} \to 0 \quad \mathrm{as\ } n \to \infty P ( ∣ S n − μ ∣ ≥ ε ) ≤ ε 2 Var ( S n ) = n ε 2 σ 2 → 0 as n → ∞
■ \blacksquare ■
Theorem 4.2 (Strong Law of Large Numbers). Under the same conditions:
P ( lim n → ∞ 1 n ∑ i = 1 n X i = μ ) = 1 P\left(\lim_{n \to \infty} \frac{1}{n}\sum_{i=1}^{n} X_i = \mu\right) = 1 P ( lim n → ∞ n 1 ∑ i = 1 n X i = μ ) = 1
The sample mean converges to the population mean almost surely.
4.2 The Central Limit Theorem Theorem 4.3 (Central Limit Theorem). Let X 1 , X 2 , … X_1, X_2, \ldots X 1 , X 2 , … be i.i.d. With E [ X i ] = μ E[X_i] = \mu E [ X i ] = μ and V a r ( X i ) = σ 2 ∈ ( 0 , ∞ ) \mathrm{Var}(X_i) = \sigma^2 \in (0, \infty) Var ( X i ) = σ 2 ∈ ( 0 , ∞ ) . Then
S n − n μ σ n → d N ( 0 , 1 ) \frac{S_n - n\mu}{\sigma\sqrt{n}} \xrightarrow{d} N(0, 1) σ n S n − n μ d N ( 0 , 1 )
Where S n = ∑ i = 1 n X i S_n = \sum_{i=1}^{n} X_i S n = ∑ i = 1 n X i and → d \xrightarrow{d} d denotes convergence in distribution.
Equivalently, for large n n n :
P ( S n − n μ σ n ≤ z ) ≈ Φ ( z ) P\left(\frac{S_n - n\mu}{\sigma\sqrt{n}} \leq z\right) \approx \Phi(z) P ( σ n S n − n μ ≤ z ) ≈ Φ ( z )
Where Φ \Phi Φ is the CDF of the standard normal.
Proof (using characteristic functions). Let φ X ( t ) = E [ e i t X ] \varphi_X(t) = E[e^{itX}] φ X ( t ) = E [ e i tX ] be the characteristic function of X 1 X_1 X 1 . The characteristic function of ( S n − n μ ) / ( σ n ) (S_n - n\mu)/(\sigma\sqrt{n}) ( S n − n μ ) / ( σ n ) is:
φ n ( t ) = [ φ X ( t σ n ) ] n ⋅ e − i t n μ / σ \varphi_n(t) = \left[\varphi_X\left(\frac{t}{\sigma\sqrt{n}}\right)\right]^n \cdot e^{-it\sqrt{n}\mu/\sigma} φ n ( t ) = [ φ X ( σ n t ) ] n ⋅ e − i t n μ / σ
Expanding φ X \varphi_X φ X around 0: φ X ( s ) = 1 + i μ s − ( σ 2 + μ 2 ) s 2 2 + o ( s 2 ) \varphi_X(s) = 1 + i\mu s - \frac{(\sigma^2 + \mu^2)s^2}{2} + o(s^2) φ X ( s ) = 1 + i μ s − 2 ( σ 2 + μ 2 ) s 2 + o ( s 2 ) . Substituting s = t / ( σ n ) s = t/(\sigma\sqrt{n}) s = t / ( σ n ) :
φ n ( t ) = [ 1 + i μ t σ n − ( σ 2 + μ 2 ) t 2 2 σ 2 n + o ( 1 n ) ] n ⋅ e − i t n μ / σ \varphi_n(t) = \left[1 + \frac{i\mu t}{\sigma\sqrt{n}} - \frac{(\sigma^2 + \mu^2)t^2}{2\sigma^2 n} + o\left(\frac{1}{n}\right)\right]^n \cdot e^{-it\sqrt{n}\mu/\sigma} φ n ( t ) = [ 1 + σ n i μ t − 2 σ 2 n ( σ 2 + μ 2 ) t 2 + o ( n 1 ) ] n ⋅ e − i t n μ / σ
Using lim n → ∞ ( 1 + a n / n ) n = e lim a n \lim_{n \to \infty}(1 + a_n/n)^n = e^{\lim a_n} lim n → ∞ ( 1 + a n / n ) n = e l i m a n :
lim n → ∞ φ n ( t ) = exp ( i μ t σ − ( σ 2 + μ 2 ) t 2 2 σ 2 ) ⋅ exp ( − i μ t σ ) = e − t 2 / 2 \lim_{n \to \infty} \varphi_n(t) = \exp\left(\frac{i\mu t}{\sigma} - \frac{(\sigma^2 + \mu^2)t^2}{2\sigma^2}\right) \cdot \exp\left(-\frac{i\mu t}{\sigma}\right) = e^{-t^2/2} lim n → ∞ φ n ( t ) = exp ( σ i μ t − 2 σ 2 ( σ 2 + μ 2 ) t 2 ) ⋅ exp ( − σ i μ t ) = e − t 2 /2
This is the characteristic function of N ( 0 , 1 ) N(0, 1) N ( 0 , 1 ) . By Levy’s continuity theorem, the convergence in distribution follows. ■ \blacksquare ■
4.3 Worked Examples Problem. A fair die is rolled 100 times. Approximate the probability that the sum exceeds 370.
Solution Let X i X_i X i be the value of the i i i -th roll. Then E [ X i ] = 7 / 2 = 3.5 E[X_i] = 7/2 = 3.5 E [ X i ] = 7/2 = 3.5 and V a r ( X i ) = 35 / 12 ≈ 2.917 \mathrm{Var}(X_i) = 35/12 \approx 2.917 Var ( X i ) = 35/12 ≈ 2.917 .
S 100 = ∑ i = 1 100 X i S_{100} = \sum_{i=1}^{100} X_i S 100 = ∑ i = 1 100 X i . By the CLT:
S 100 − 350 100 ⋅ 35 / 12 ≈ N ( 0 , 1 ) \frac{S_{100} - 350}{\sqrt{100 \cdot 35/12}} \approx N(0, 1) 100 ⋅ 35/12 S 100 − 350 ≈ N ( 0 , 1 )
P ( S 100 > 370 ) = P ( Z > 370 − 350 291.7 ) ≈ P ( Z > 1.17 ) ≈ 0.121 P(S_{100} > 370) = P\left(Z > \frac{370 - 350}{\sqrt{291.7}}\right) \approx P(Z > 1.17) \approx 0.121 P ( S 100 > 370 ) = P ( Z > 291.7 370 − 350 ) ≈ P ( Z > 1.17 ) ≈ 0.121
■ \blacksquare ■
Worked Example: Sample Mean Distribution Solution. A population has mean 50 and standard deviation 10. Find the probability that the mean of a sample of 64 observations exceeds 52.
By the CLT, X ˉ ≈ N ( 50 , 100 / 64 ) = N ( 50 , 1.5625 ) \bar{X} \approx N(50, 100/64) = N(50, 1.5625) X ˉ ≈ N ( 50 , 100/64 ) = N ( 50 , 1.5625 ) .
P ( X ˉ > 52 ) = P ( Z > 52 − 50 1.5625 ) = P ( Z > 1.6 ) ≈ 0.0548 P(\bar{X} > 52) = P\left(Z > \frac{52 - 50}{\sqrt{1.5625}}\right) = P(Z > 1.6) \approx 0.0548 P ( X ˉ > 52 ) = P ( Z > 1.5625 52 − 50 ) = P ( Z > 1.6 ) ≈ 0.0548
■ \blacksquare ■
4.4 Common Pitfalls The CLT does not apply to small samples. The CLT is an asymptotic result. For small n n n ( n < 30 n < 30 n < 30 ), the normal approximation can be poor unless the underlying distribution is already close to normal. Use the Berry—Esseen theorem for finite-sample bounds.Independence is critical for the LLN and CLT. If the X i X_i X i are dependent, the sample mean may not converge to the population mean, or the convergence rate may differ. For stationary sequences with weak dependence, versions of these theorems still hold, but the proofs are more involved.Convergence in distribution is weaker than convergence in probability. The CLT gives convergence in distribution of the standardised sum, not convergence of the sum itself. The LLN gives the latter (convergence in probability).5.1 Distribution of a Function of a Random Variable Theorem 5.1 (CDF Method). If Y = g ( X ) Y = g(X) Y = g ( X ) and g g g is monotone, then
F Y ( y ) = P ( g ( X ) ≤ y ) = { F X ( g − 1 ( y ) ) if g is i n c r e a s i n g 1 − F X ( g − 1 ( y ) ) if g is d e c r e a s i n g F_Y(y) = P(g(X) \leq y) = \begin{cases} F_X(g^{-1}(y)) & \text{if} g \text{ is} increasing \\ 1 - F_X(g^{-1}(y)) & \text{if} g \text{ is} decreasing \end{cases} F Y ( y ) = P ( g ( X ) ≤ y ) = { F X ( g − 1 ( y )) 1 − F X ( g − 1 ( y )) if g is in cr e a s in g if g is d ecr e a s in g
Theorem 5.2 (Change of Variables). If Y = g ( X ) Y = g(X) Y = g ( X ) where g g g is differentiable and strictly monotone, then
f Y ( y ) = f X ( g − 1 ( y ) ) ⋅ ∣ d d y g − 1 ( y ) ∣ f_Y(y) = f_X(g^{-1}(y)) \cdot \left|\frac{d}{dy} g^{-1}(y)\right| f Y ( y ) = f X ( g − 1 ( y )) ⋅ d y d g − 1 ( y )
Worked Example: Distribution of $X^2$ where $X \sim N(0, 1)$ Solution. Let Y = X 2 Y = X^2 Y = X 2 where X ∼ N ( 0 , 1 ) X \sim N(0, 1) X ∼ N ( 0 , 1 ) . For y ≥ 0 y \geq 0 y ≥ 0 :
F Y ( y ) = P ( X 2 ≤ y ) = P ( − y ≤ X ≤ y ) = Φ ( y ) − Φ ( − y ) = 2 Φ ( y ) − 1 F_Y(y) = P(X^2 \leq y) = P(-\sqrt{y} \leq X \leq \sqrt{y}) = \Phi(\sqrt{y}) - \Phi(-\sqrt{y}) = 2\Phi(\sqrt{y}) - 1 F Y ( y ) = P ( X 2 ≤ y ) = P ( − y ≤ X ≤ y ) = Φ ( y ) − Φ ( − y ) = 2Φ ( y ) − 1
f Y ( y ) = d d y [ 2 Φ ( y ) − 1 ] = 2 ϕ ( y ) ⋅ 1 2 y = 1 2 π y e − y / 2 f_Y(y) = \frac{d}{dy}[2\Phi(\sqrt{y}) - 1] = 2\phi(\sqrt{y}) \cdot \frac{1}{2\sqrt{y}} = \frac{1}{\sqrt{2\pi y}}\, e^{-y/2} f Y ( y ) = d y d [ 2Φ ( y ) − 1 ] = 2 ϕ ( y ) ⋅ 2 y 1 = 2 π y 1 e − y /2
This is the PDF of the χ 2 ( 1 ) \chi^2(1) χ 2 ( 1 ) distribution. ■ \blacksquare ■
5.2 Convolution Theorem 5.3. If X X X and Y Y Y are independent continuous random variables, the PDF of Z = X + Y Z = X + Y Z = X + Y is
f Z ( z ) = ( f X ∗ f Y ) ( z ) = ∫ − ∞ ∞ f X ( x ) f Y ( z − x ) d x f_Z(z) = (f_X * f_Y)(z) = \int_{-\infty}^{\infty} f_X(x)\, f_Y(z - x)\, dx f Z ( z ) = ( f X ∗ f Y ) ( z ) = ∫ − ∞ ∞ f X ( x ) f Y ( z − x ) d x
Proof. F Z ( z ) = P ( X + Y ≤ z ) = ∬ x + y ≤ z f X , Y ( x , y ) d x d y = ∫ − ∞ ∞ f X ( x ) [ ∫ − ∞ z − x f Y ( y ) d y ] d x = ∫ − ∞ ∞ f X ( x ) F Y ( z − x ) d x F_Z(z) = P(X + Y \leq z) = \iint_{x+y \leq z} f_{X,Y}(x, y)\, dx\, dy = \int_{-\infty}^{\infty} f_X(x)\left[\int_{-\infty}^{z-x} f_Y(y)\, dy\right] dx = \int_{-\infty}^{\infty} f_X(x)\, F_Y(z - x)\, dx F Z ( z ) = P ( X + Y ≤ z ) = ∬ x + y ≤ z f X , Y ( x , y ) d x d y = ∫ − ∞ ∞ f X ( x ) [ ∫ − ∞ z − x f Y ( y ) d y ] d x = ∫ − ∞ ∞ f X ( x ) F Y ( z − x ) d x .
Differentiating: f Z ( z ) = ∫ − ∞ ∞ f X ( x ) f Y ( z − x ) d x f_Z(z) = \int_{-\infty}^{\infty} f_X(x)\, f_Y(z - x)\, dx f Z ( z ) = ∫ − ∞ ∞ f X ( x ) f Y ( z − x ) d x . ■ \blacksquare ■
Corollary 5.4. The sum of independent normals is normal: if X ∼ N ( μ 1 , σ 1 2 ) X \sim N(\mu_1, \sigma_1^2) X ∼ N ( μ 1 , σ 1 2 ) and Y ∼ N ( μ 2 , σ 2 2 ) Y \sim N(\mu_2, \sigma_2^2) Y ∼ N ( μ 2 , σ 2 2 ) are independent, then X + Y ∼ N ( μ 1 + μ 2 , σ 1 2 + σ 2 2 ) X + Y \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2) X + Y ∼ N ( μ 1 + μ 2 , σ 1 2 + σ 2 2 ) .
Proof. The convolution of two Gaussian PDFs is Gaussian. This follows from the MGF: M X + Y ( t ) = M X ( t ) M Y ( t ) = exp ( ( μ 1 + μ 2 ) t + ( σ 1 2 + σ 2 2 ) t 2 / 2 ) M_{X+Y}(t) = M_X(t)M_Y(t) = \exp((\mu_1 + \mu_2)t + (\sigma_1^2 + \sigma_2^2)t^2/2) M X + Y ( t ) = M X ( t ) M Y ( t ) = exp (( μ 1 + μ 2 ) t + ( σ 1 2 + σ 2 2 ) t 2 /2 ) Which is the MGF of N ( μ 1 + μ 2 , σ 1 2 + σ 2 2 ) N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2) N ( μ 1 + μ 2 , σ 1 2 + σ 2 2 ) . ■ \blacksquare ■
Common Pitfalls Dropping negative signs during algebraic manipulation — substitute back to verify your answer.
Misreading the question, particularly with ‘hence’ vs ‘hence or otherwise’ — the former requires using previous work.
Confusing the domain and range of functions, or not considering restrictions (e.g., denominator cannot be zero).
Confusing P ( A ∣ B ) P(A|B) P ( A ∣ B ) with P ( B ∣ A ) P(B|A) P ( B ∣ A ) — these are related by Bayes’ theorem but are not equal in general.
Worked Examples Example 1: Law of Total Probability Problem. Factory A produces 60% of items and factory B produces 40%. Defect rates are 2% and 5% respectively. Find the probability a randomly selected item is defective.
Solution. P ( D ) = P ( D ∣ A ) P ( A ) + P ( D ∣ B ) P ( B ) = 0.02 × 0.6 + 0.05 × 0.4 = 0.012 + 0.020 = 0.032 P(D) = P(D|A)P(A) + P(D|B)P(B) = 0.02 \times 0.6 + 0.05 \times 0.4 = 0.012 + 0.020 = 0.032 P ( D ) = P ( D ∣ A ) P ( A ) + P ( D ∣ B ) P ( B ) = 0.02 × 0.6 + 0.05 × 0.4 = 0.012 + 0.020 = 0.032 .
Given a defective item, P ( A ∣ D ) = 0.012 0.032 = 0.375 P(A|D) = \frac{0.012}{0.032} = 0.375 P ( A ∣ D ) = 0.032 0.012 = 0.375 (Bayes’ theorem).
■ \blacksquare ■
Example 2: Generating Functions Problem. A fair coin is tossed until the first head appears. Find the expected number of tosses using the probability generating function.
Solution. X ∼ Geo ( p = 0.5 ) X \sim \text{Geo}(p = 0.5) X ∼ Geo ( p = 0.5 ) . P ( X = k ) = 0.5 k P(X = k) = 0.5^k P ( X = k ) = 0. 5 k for k = 1 , 2 , … k = 1, 2, \ldots k = 1 , 2 , …
PGF: G ( s ) = ∑ k = 1 ∞ 0.5 k s k = 0.5 s 1 − 0.5 s G(s) = \sum_{k=1}^{\infty} 0.5^k s^k = \frac{0.5s}{1 - 0.5s} G ( s ) = ∑ k = 1 ∞ 0. 5 k s k = 1 − 0.5 s 0.5 s .
E ( X ) = G ′ ( 1 ) = 0.5 ( 1 − 0.5 ) 2 = 0.5 0.25 = 2 E(X) = G'(1) = \frac{0.5}{(1-0.5)^2} = \frac{0.5}{0.25} = 2 E ( X ) = G ′ ( 1 ) = ( 1 − 0.5 ) 2 0.5 = 0.25 0.5 = 2 .
■ \blacksquare ■
Summary Sample spaces, events, and sigma-algebras provide the rigorous foundation for probability theory. Random variables: discrete (PMF) and continuous (PDF); CDF F ( x ) = P ( X ≤ x ) F(x) = P(X \leq x) F ( x ) = P ( X ≤ x ) . Expectation: E ( X ) = ∑ x ⋅ P ( X = x ) E(X) = \sum x \cdot P(X=x) E ( X ) = ∑ x ⋅ P ( X = x ) or ∫ x f ( x ) d x \int x f(x)\,dx ∫ x f ( x ) d x ; linearity E ( a X + b Y ) = a E ( X ) + b E ( Y ) E(aX + bY) = aE(X) + bE(Y) E ( a X + bY ) = a E ( X ) + b E ( Y ) . Variance: Var ( X ) = E ( X 2 ) − [ E ( X ) ] 2 \text{Var}(X) = E(X^2) - [E(X)]^2 Var ( X ) = E ( X 2 ) − [ E ( X ) ] 2 ; Var ( a X + b ) = a 2 Var ( X ) \text{Var}(aX + b) = a^2\text{Var}(X) Var ( a X + b ) = a 2 Var ( X ) . Generating functions (PGF, MGF) encode distribution information; moments obtained by differentiation at specific points. Cross-References Topic Site Link Probability and Statistics WyattsNotes View Real Analysis WyattsNotes View Differential Equations WyattsNotes View Probability — Harvard Stat 110 Harvard View