Skip to content

Probability Theory

1. Probability Spaces

1.1 Sample Spaces and Events

A probability space is a triple (Ω,F,P)(\Omega, \mathcal{F}, P) where:

  • Ω\Omega is the sample space (set of all possible outcomes).
  • F\mathcal{F} is a sigma-algebra on Ω\Omega.
  • P:F[0,1]P : \mathcal{F} \to [0, 1] is a probability measure.

Definition. A sigma-algebra F\mathcal{F} on Ω\Omega is a collection of subsets satisfying:

  1. ΩF\Omega \in \mathcal{F}.
  2. If AFA \in \mathcal{F}Then AcFA^c \in \mathcal{F} (closed under complementation).
  3. If A1,A2,FA_1, A_2, \ldots \in \mathcal{F}Then i=1AiF\bigcup_{i=1}^{\infty} A_i \in \mathcal{F} (closed under countable unions).

Definition. A probability measure PP satisfies:

  1. Non-negativity: P(A)0P(A) \geq 0 for all AFA \in \mathcal{F}.
  2. Normalisation: P(Ω)=1P(\Omega) = 1.
  3. Countable additivity: If A1,A2,A_1, A_2, \ldots are pairwise disjoint, then P(i=1Ai)=i=1P(Ai)P\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i).

1.2 Basic Properties

Proposition 1.1. For any probability space:

  1. P()=0P(\emptyset) = 0.
  2. P(Ac)=1P(A)P(A^c) = 1 - P(A).
  3. If ABA \subseteq BThen P(A)P(B)P(A) \leq P(B).
  4. P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B) (inclusion-exclusion).
  5. Boole”s inequality: P(i=1nAi)i=1nP(Ai)P\left(\bigcup_{i=1}^{n} A_i\right) \leq \sum_{i=1}^{n} P(A_i).
  6. Bonferroni inequality: P(i=1nAi)1i=1n(1P(Ai))P\left(\bigcap_{i=1}^{n} A_i\right) \geq 1 - \sum_{i=1}^{n} (1 - P(A_i)).

Proof. (1) Apply countable additivity to the disjoint union Ω=Ω\Omega = \Omega \cup \emptyset \cup \emptyset \cup \cdots: 1=1+P()+P()+1 = 1 + P(\emptyset) + P(\emptyset) + \cdotsSo P()=0P(\emptyset) = 0.

(3) B=A(BA)B = A \cup (B \setminus A) is a disjoint union, so P(B)=P(A)+P(BA)P(A)P(B) = P(A) + P(B \setminus A) \geq P(A).

(4) P(AB)=P(A)+P(BA)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B \setminus A) = P(A) + P(B) - P(A \cap B). \blacksquare

1.3 Conditional Probability and Independence

Definition. The conditional probability of AA given BB (with P(B)>0P(B) > 0) is

P(AB)=P(AB)P(B)P(A \mid B) = \frac{P(A \cap B)}{P(B)}

Theorem 1.2 (Law of Total Probability). If B1,,BnB_1, \ldots, B_n form a partition of Ω\Omega with P(Bi)>0P(B_i) > 0 for all iiThen

P(A)=i=1nP(ABi)P(Bi)P(A) = \sum_{i=1}^{n} P(A \mid B_i)\, P(B_i)

Theorem 1.3 (Bayes’ Theorem). Under the same conditions:

P(BjA)=P(ABj)P(Bj)i=1nP(ABi)P(Bi)P(B_j \mid A) = \frac{P(A \mid B_j)\, P(B_j)}{\sum_{i=1}^{n} P(A \mid B_i)\, P(B_i)}

Definition. Events AA and BB are independent if P(AB)=P(A)P(B)P(A \cap B) = P(A)\,P(B).

Proposition 1.4. If AA and BB are independent with P(B)>0P(B) > 0Then P(AB)=P(A)P(A \mid B) = P(A).

Proof. P(AB)=P(AB)/P(B)=P(A)P(B)/P(B)=P(A)P(A \mid B) = P(A \cap B)/P(B) = P(A)P(B)/P(B) = P(A). \blacksquare

Definition. Events A1,,AnA_1, \ldots, A_n are mutually independent if for every subset J{1,,n}J \subseteq \{1, \ldots, n\}:

P(jJAj)=jJP(Aj)P\left(\bigcap_{j \in J} A_j\right) = \prod_{j \in J} P(A_j)

Pairwise independence does not imply mutual independence.

Worked Example: Pairwise vs Mutual Independence

Solution. Roll two fair dice. Let AA = “first die is even”, BB = “second die is even”, CC = “sum is even”.

P(A)=P(B)=P(C)=1/2P(A) = P(B) = P(C) = 1/2.

P(AB)=1/4=P(A)P(B)P(A \cap B) = 1/4 = P(A)P(B). P(AC)=P(firsteven,sumeven)=P(secondeven)=1/4=P(A)P(C)P(A \cap C) = P(\text{first} even, sum even) = P(\text{second} even) = 1/4 = P(A)P(C).

P(BC)=1/4=P(B)P(C)P(B \cap C) = 1/4 = P(B)P(C). So AA, BB, CC are pairwise independent.

But P(ABC)=P(botheven,sumeven)=P(botheven)=1/41/8=P(A)P(B)P(C)P(A \cap B \cap C) = P(\text{both} even, sum even) = P(\text{both} even) = 1/4 \neq 1/8 = P(A)P(B)P(C).

So AA, BB, CC are pairwise independent but not mutually independent. \blacksquare

2. Random Variables

2.1 Definition and Distribution Functions

Definition. A random variable is a measurable function X:ΩRX : \Omega \to \mathbb{R}. The cumulative distribution function (CDF) of XX is

FX(x)=P(Xx)F_X(x) = P(X \leq x)

Proposition 2.1 (Properties of the CDF).

  1. FF is non-decreasing: if aba \leq bThen F(a)F(b)F(a) \leq F(b).
  2. limxF(x)=0\lim_{x \to -\infty} F(x) = 0 and limx+F(x)=1\lim_{x \to +\infty} F(x) = 1.
  3. FF is right-continuous: limxa+F(x)=F(a)\lim_{x \to a^+} F(x) = F(a).

Proof. (1) If aba \leq bThen {Xa}{Xb}\{X \leq a\} \subseteq \{X \leq b\}So F(a)=P(Xa)P(Xb)=F(b)F(a) = P(X \leq a) \leq P(X \leq b) = F(b) by Proposition 1.1(3).

(2) As xx \to -\inftyThe events {Xx}\{X \leq x\} decrease to \emptysetSo by continuity from above of probability measures, F(x)0F(x) \to 0. As x+x \to +\inftyThe events increase to Ω\OmegaSo F(x)1F(x) \to 1.

(3) As xa+x \to a^+The events {Xx}\{X \leq x\} decrease to {Xa}\{X \leq a\}Giving right-continuity. \blacksquare

2.2 Discrete Random Variables

A random variable is discrete if its range is countable. The probability mass function (PMF) is pX(x)=P(X=x)p_X(x) = P(X = x).

Definition (Expected Value). For a discrete random variable:

E[X]=xxpX(x)E[X] = \sum_{x} x\, p_X(x)

Provided the sum converges absolutely.

Definition (Variance). Var(X)=E[(Xμ)2]=E[X2](E[X])2\mathrm{Var}(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2 where μ=E[X]\mu = E[X].

Proposition 2.2 (Linearity of Expectation). E[aX+bY]=aE[X]+bE[Y]E[aX + bY] = aE[X] + bE[Y] for any random variables XX, YY and constants aa, bb.

Proof. Direct computation from the definition of expected value. For the discrete case:

E[aX+bY]=x,y(ax+by)pX,Y(x,y)=axxpX(x)+byypY(y)=aE[X]+bE[Y]E[aX + bY] = \sum_{x,y} (ax + by)\, p_{X,Y}(x,y) = a\sum_x x\, p_X(x) + b\sum_y y\, p_Y(y) = aE[X] + bE[Y]

\blacksquare

2.3 Continuous Random Variables

A random variable is continuous if its CDF is absolutely continuous, i.e., there exists a probability density function (PDF) fXf_X such that

FX(x)=xfX(t)dtF_X(x) = \int_{-\infty}^{x} f_X(t)\, dt

Key properties:

  1. fX(x)0f_X(x) \geq 0 for all xx.
  2. fX(x)dx=1\int_{-\infty}^{\infty} f_X(x)\, dx = 1.
  3. P(aXb)=abfX(x)dxP(a \leq X \leq b) = \int_a^b f_X(x)\, dx.
  4. P(X=a)=0P(X = a) = 0 for any single point aa.

2.4 Common Distributions

Discrete distributions:

DistributionPMFE[X]E[X]Var(X)\mathrm{Var}(X)
Bernoulli(p)(p)px(1p)1xp^x(1-p)^{1-x}, x{0,1}x \in \{0,1\}ppp(1p)p(1-p)
Binomial(n,p)(n,p)(nx)px(1p)nx\binom{n}{x}p^x(1-p)^{n-x}npnpnp(1p)np(1-p)
Poisson(λ)(\lambda)eλλx/x!e^{-\lambda}\lambda^x / x!λ\lambdaλ\lambda
Geometric(p)(p)(1p)x1p(1-p)^{x-1}p, x1x \geq 11/p1/p(1p)/p2(1-p)/p^2

Continuous distributions:

DistributionPDFE[X]E[X]Var(X)\mathrm{Var}(X)
Uniform(a,b)(a,b)1/(ba)1/(b-a) on [a,b][a,b](a+b)/2(a+b)/2(ba)2/12(b-a)^2/12
Exponential(λ)(\lambda)λeλx\lambda e^{-\lambda x}, x0x \geq 01/λ1/\lambda1/λ21/\lambda^2
N(μ,σ2)N(\mu, \sigma^2)1σ2πe(xμ)2/(2σ2)\frac{1}{\sigma\sqrt{2\pi}}e^{-(x-\mu)^2/(2\sigma^2)}μ\muσ2\sigma^2

2.5 The Normal Distribution

Definition. XN(μ,σ2)X \sim N(\mu, \sigma^2) if XX has PDF f(x)=1σ2πexp((xμ)22σ2)f(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right).

Theorem 2.3 (Standardisation). If XN(μ,σ2)X \sim N(\mu, \sigma^2)Then Z=(Xμ)/σN(0,1)Z = (X - \mu)/\sigma \sim N(0, 1).

Proof. The CDF of ZZ: P(Zz)=P(Xμ+σz)=μ+σz1σ2πet2/2dtP(Z \leq z) = P(X \leq \mu + \sigma z) = \int_{-\infty}^{\mu + \sigma z} \frac{1}{\sigma\sqrt{2\pi}} e^{-t^2/2}\, dt. Substituting u=(tμ)/σu = (t - \mu)/\sigma: =z12πeu2/2du= \int_{-\infty}^{z} \frac{1}{\sqrt{2\pi}} e^{-u^2/2}\, duWhich is the CDF of N(0,1)N(0, 1). \blacksquare

Theorem 2.4 (Moment Generating Function). If XN(μ,σ2)X \sim N(\mu, \sigma^2)Then

MX(t)=E[etX]=exp(μt+σ2t22)M_X(t) = E[e^{tX}] = \exp\left(\mu t + \frac{\sigma^2 t^2}{2}\right)

Proof. MX(t)=etx1σ2πe(xμ)2/(2σ2)dxM_X(t) = \int_{-\infty}^{\infty} e^{tx} \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2/(2\sigma^2)}\, dx. Completing the square in the exponent and evaluating the Gaussian integral gives the result. \blacksquare

2.6 Moment Generating Functions

Definition. The moment generating function (MGF) of XX is MX(t)=E[etX]M_X(t) = E[e^{tX}] (when it exists in a neighbourhood of t=0t = 0).

Theorem 2.5. If the MGF exists in a neighbourhood of 0, it uniquely determines the distribution. Furthermore, E[Xn]=MX(n)(0)E[X^n] = M_X^{(n)}(0).

2.7 Worked Examples

Problem. Let XPoisson(3)X \sim \mathrm{Poisson}(3) and YPoisson(5)Y \sim \mathrm{Poisson}(5) be independent. Find the distribution of X+YX + Y.

Solution

The MGF of XPoisson(λ)X \sim \mathrm{Poisson}(\lambda) is MX(t)=eλ(et1)M_X(t) = e^{\lambda(e^t - 1)}.

MX+Y(t)=MX(t)MY(t)=e3(et1)e5(et1)=e8(et1)M_{X+Y}(t) = M_X(t) \cdot M_Y(t) = e^{3(e^t - 1)} \cdot e^{5(e^t - 1)} = e^{8(e^t - 1)}.

This is the MGF of Poisson(8)\mathrm{Poisson}(8). Since the MGF uniquely determines the distribution, X+YPoisson(8)X + Y \sim \mathrm{Poisson}(8).

\blacksquare

Worked Example: Minimum of Exponential Random Variables

Solution. Let X1,,XnX_1, \ldots, X_n be independent with XiExp(λi)X_i \sim \mathrm{Exp}(\lambda_i). Find the distribution of M=min(X1,,Xn)M = \min(X_1, \ldots, X_n).

P(M>t)=P(X1>t,,Xn>t)=i=1nP(Xi>t)=i=1neλit=e(λ1++λn)tP(M > t) = P(X_1 > t, \ldots, X_n > t) = \prod_{i=1}^{n} P(X_i > t) = \prod_{i=1}^{n} e^{-\lambda_i t} = e^{-(\lambda_1 + \cdots + \lambda_n)t}

So P(Mt)=1eλtP(M \leq t) = 1 - e^{-\lambda t} where λ=i=1nλi\lambda = \sum_{i=1}^{n} \lambda_i. This means MExp(λ)M \sim \mathrm{Exp}(\lambda). \blacksquare

3. Joint Distributions and Independence

3.1 Joint Distribution Functions

Definition. The joint CDF of (X,Y)(X, Y) is FX,Y(x,y)=P(Xx,Yy)F_{X,Y}(x, y) = P(X \leq x, Y \leq y).

Definition. The joint PDF (for continuous random variables) is fX,Y(x,y)0f_{X,Y}(x, y) \geq 0 such that

FX,Y(x,y)=xyfX,Y(u,v)dudvF_{X,Y}(x, y) = \int_{-\infty}^{x}\int_{-\infty}^{y} f_{X,Y}(u, v)\, du\, dv

Definition. The marginal PDF of XX is fX(x)=fX,Y(x,y)dyf_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y)\, dy.

3.2 Covariance and Correlation

Definition. The covariance of XX and YY is

Cov(X,Y)=E[(XE[X])(YE[Y])]=E[XY]E[X]E[Y]\mathrm{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y]

Proposition 2.6. Cov(X,Y)=Cov(Y,X)\mathrm{Cov}(X, Y) = \mathrm{Cov}(Y, X) and Cov(aX+b,cY+d)=acCov(X,Y)\mathrm{Cov}(aX + b, cY + d) = ac\,\mathrm{Cov}(X, Y).

Definition. The correlation coefficient is

ρ(X,Y)=Cov(X,Y)Var(X)Var(Y)\rho(X, Y) = \frac{\mathrm{Cov}(X, Y)}{\sqrt{\mathrm{Var}(X)\,\mathrm{Var}(Y)}}

Theorem 2.7 (Cauchy—Schwarz for Random Variables). ρ(X,Y)1|\rho(X, Y)| \leq 1With equality if and only if Y=aX+bY = aX + b almost surely for some a,ba, b.

3.3 Independence of Random Variables

Definition. XX and YY are independent if FX,Y(x,y)=FX(x)FY(y)F_{X,Y}(x, y) = F_X(x)\, F_Y(y) for all x,yx, y.

For continuous random variables, this is equivalent to fX,Y(x,y)=fX(x)fY(y)f_{X,Y}(x, y) = f_X(x)\, f_Y(y).

Proposition 2.8. If XX and YY are independent, then Cov(X,Y)=0\mathrm{Cov}(X, Y) = 0. The converse is false.

Worked Example: Uncorrelated but Dependent

Solution. Let XN(0,1)X \sim N(0, 1) and Y=X2Y = X^2. Then Cov(X,Y)=E[X3]E[X]E[X2]=001=0\mathrm{Cov}(X, Y) = E[X^3] - E[X]E[X^2] = 0 - 0 \cdot 1 = 0 (since the third moment of a standard normal is 0).

But YY is completely determined by XXSo they are not independent. \blacksquare

4. Limit Theorems

4.1 The Law of Large Numbers

Theorem 4.1 (Weak Law of Large Numbers). Let X1,X2,X_1, X_2, \ldots be i.i.d. With E[Xi]=μE[X_i] = \mu and Var(Xi)=σ2<\mathrm{Var}(X_i) = \sigma^2 < \infty. Then for every ε>0\varepsilon > 0:

limnP(1ni=1nXiμε)=0\lim_{n \to \infty} P\left(\left|\frac{1}{n}\sum_{i=1}^{n} X_i - \mu\right| \geq \varepsilon\right) = 0

Proof. Let Sn=1ni=1nXiS_n = \frac{1}{n}\sum_{i=1}^{n} X_i. Then E[Sn]=μE[S_n] = \mu and Var(Sn)=σ2/n\mathrm{Var}(S_n) = \sigma^2/n. By Chebyshev’s inequality:

P(Snμε)Var(Sn)ε2=σ2nε20as nP(|S_n - \mu| \geq \varepsilon) \leq \frac{\mathrm{Var}(S_n)}{\varepsilon^2} = \frac{\sigma^2}{n\varepsilon^2} \to 0 \quad \mathrm{as\ } n \to \infty

\blacksquare

Theorem 4.2 (Strong Law of Large Numbers). Under the same conditions:

P(limn1ni=1nXi=μ)=1P\left(\lim_{n \to \infty} \frac{1}{n}\sum_{i=1}^{n} X_i = \mu\right) = 1

The sample mean converges to the population mean almost surely.

4.2 The Central Limit Theorem

Theorem 4.3 (Central Limit Theorem). Let X1,X2,X_1, X_2, \ldots be i.i.d. With E[Xi]=μE[X_i] = \mu and Var(Xi)=σ2(0,)\mathrm{Var}(X_i) = \sigma^2 \in (0, \infty). Then

SnnμσndN(0,1)\frac{S_n - n\mu}{\sigma\sqrt{n}} \xrightarrow{d} N(0, 1)

Where Sn=i=1nXiS_n = \sum_{i=1}^{n} X_i and d\xrightarrow{d} denotes convergence in distribution.

Equivalently, for large nn:

P(Snnμσnz)Φ(z)P\left(\frac{S_n - n\mu}{\sigma\sqrt{n}} \leq z\right) \approx \Phi(z)

Where Φ\Phi is the CDF of the standard normal.

Proof (using characteristic functions). Let φX(t)=E[eitX]\varphi_X(t) = E[e^{itX}] be the characteristic function of X1X_1. The characteristic function of (Snnμ)/(σn)(S_n - n\mu)/(\sigma\sqrt{n}) is:

φn(t)=[φX(tσn)]neitnμ/σ\varphi_n(t) = \left[\varphi_X\left(\frac{t}{\sigma\sqrt{n}}\right)\right]^n \cdot e^{-it\sqrt{n}\mu/\sigma}

Expanding φX\varphi_X around 0: φX(s)=1+iμs(σ2+μ2)s22+o(s2)\varphi_X(s) = 1 + i\mu s - \frac{(\sigma^2 + \mu^2)s^2}{2} + o(s^2). Substituting s=t/(σn)s = t/(\sigma\sqrt{n}):

φn(t)=[1+iμtσn(σ2+μ2)t22σ2n+o(1n)]neitnμ/σ\varphi_n(t) = \left[1 + \frac{i\mu t}{\sigma\sqrt{n}} - \frac{(\sigma^2 + \mu^2)t^2}{2\sigma^2 n} + o\left(\frac{1}{n}\right)\right]^n \cdot e^{-it\sqrt{n}\mu/\sigma}

Using limn(1+an/n)n=eliman\lim_{n \to \infty}(1 + a_n/n)^n = e^{\lim a_n}:

limnφn(t)=exp(iμtσ(σ2+μ2)t22σ2)exp(iμtσ)=et2/2\lim_{n \to \infty} \varphi_n(t) = \exp\left(\frac{i\mu t}{\sigma} - \frac{(\sigma^2 + \mu^2)t^2}{2\sigma^2}\right) \cdot \exp\left(-\frac{i\mu t}{\sigma}\right) = e^{-t^2/2}

This is the characteristic function of N(0,1)N(0, 1). By Levy’s continuity theorem, the convergence in distribution follows. \blacksquare

4.3 Worked Examples

Problem. A fair die is rolled 100 times. Approximate the probability that the sum exceeds 370.

Solution

Let XiX_i be the value of the ii-th roll. Then E[Xi]=7/2=3.5E[X_i] = 7/2 = 3.5 and Var(Xi)=35/122.917\mathrm{Var}(X_i) = 35/12 \approx 2.917.

S100=i=1100XiS_{100} = \sum_{i=1}^{100} X_i. By the CLT:

S10035010035/12N(0,1)\frac{S_{100} - 350}{\sqrt{100 \cdot 35/12}} \approx N(0, 1)

P(S100>370)=P(Z>370350291.7)P(Z>1.17)0.121P(S_{100} > 370) = P\left(Z > \frac{370 - 350}{\sqrt{291.7}}\right) \approx P(Z > 1.17) \approx 0.121

\blacksquare

Worked Example: Sample Mean Distribution

Solution. A population has mean 50 and standard deviation 10. Find the probability that the mean of a sample of 64 observations exceeds 52.

By the CLT, XˉN(50,100/64)=N(50,1.5625)\bar{X} \approx N(50, 100/64) = N(50, 1.5625).

P(Xˉ>52)=P(Z>52501.5625)=P(Z>1.6)0.0548P(\bar{X} > 52) = P\left(Z > \frac{52 - 50}{\sqrt{1.5625}}\right) = P(Z > 1.6) \approx 0.0548

\blacksquare

4.4 Common Pitfalls

  • The CLT does not apply to small samples. The CLT is an asymptotic result. For small nn ( n<30n < 30), the normal approximation can be poor unless the underlying distribution is already close to normal. Use the Berry—Esseen theorem for finite-sample bounds.
  • Independence is critical for the LLN and CLT. If the XiX_i are dependent, the sample mean may not converge to the population mean, or the convergence rate may differ. For stationary sequences with weak dependence, versions of these theorems still hold, but the proofs are more involved.
  • Convergence in distribution is weaker than convergence in probability. The CLT gives convergence in distribution of the standardised sum, not convergence of the sum itself. The LLN gives the latter (convergence in probability).

5. Transformations and Convolutions

5.1 Distribution of a Function of a Random Variable

Theorem 5.1 (CDF Method). If Y=g(X)Y = g(X) and gg is monotone, then

FY(y)=P(g(X)y)={FX(g1(y))ifg isincreasing1FX(g1(y))ifg isdecreasingF_Y(y) = P(g(X) \leq y) = \begin{cases} F_X(g^{-1}(y)) & \text{if} g \text{ is} increasing \\ 1 - F_X(g^{-1}(y)) & \text{if} g \text{ is} decreasing \end{cases}

Theorem 5.2 (Change of Variables). If Y=g(X)Y = g(X) where gg is differentiable and strictly monotone, then

fY(y)=fX(g1(y))ddyg1(y)f_Y(y) = f_X(g^{-1}(y)) \cdot \left|\frac{d}{dy} g^{-1}(y)\right|

Worked Example: Distribution of $X^2$ where $X \sim N(0, 1)$

Solution. Let Y=X2Y = X^2 where XN(0,1)X \sim N(0, 1). For y0y \geq 0:

FY(y)=P(X2y)=P(yXy)=Φ(y)Φ(y)=2Φ(y)1F_Y(y) = P(X^2 \leq y) = P(-\sqrt{y} \leq X \leq \sqrt{y}) = \Phi(\sqrt{y}) - \Phi(-\sqrt{y}) = 2\Phi(\sqrt{y}) - 1

fY(y)=ddy[2Φ(y)1]=2ϕ(y)12y=12πyey/2f_Y(y) = \frac{d}{dy}[2\Phi(\sqrt{y}) - 1] = 2\phi(\sqrt{y}) \cdot \frac{1}{2\sqrt{y}} = \frac{1}{\sqrt{2\pi y}}\, e^{-y/2}

This is the PDF of the χ2(1)\chi^2(1) distribution. \blacksquare

5.2 Convolution

Theorem 5.3. If XX and YY are independent continuous random variables, the PDF of Z=X+YZ = X + Y is

fZ(z)=(fXfY)(z)=fX(x)fY(zx)dxf_Z(z) = (f_X * f_Y)(z) = \int_{-\infty}^{\infty} f_X(x)\, f_Y(z - x)\, dx

Proof. FZ(z)=P(X+Yz)=x+yzfX,Y(x,y)dxdy=fX(x)[zxfY(y)dy]dx=fX(x)FY(zx)dxF_Z(z) = P(X + Y \leq z) = \iint_{x+y \leq z} f_{X,Y}(x, y)\, dx\, dy = \int_{-\infty}^{\infty} f_X(x)\left[\int_{-\infty}^{z-x} f_Y(y)\, dy\right] dx = \int_{-\infty}^{\infty} f_X(x)\, F_Y(z - x)\, dx.

Differentiating: fZ(z)=fX(x)fY(zx)dxf_Z(z) = \int_{-\infty}^{\infty} f_X(x)\, f_Y(z - x)\, dx. \blacksquare

Corollary 5.4. The sum of independent normals is normal: if XN(μ1,σ12)X \sim N(\mu_1, \sigma_1^2) and YN(μ2,σ22)Y \sim N(\mu_2, \sigma_2^2) are independent, then X+YN(μ1+μ2,σ12+σ22)X + Y \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2).

Proof. The convolution of two Gaussian PDFs is Gaussian. This follows from the MGF: MX+Y(t)=MX(t)MY(t)=exp((μ1+μ2)t+(σ12+σ22)t2/2)M_{X+Y}(t) = M_X(t)M_Y(t) = \exp((\mu_1 + \mu_2)t + (\sigma_1^2 + \sigma_2^2)t^2/2)Which is the MGF of N(μ1+μ2,σ12+σ22)N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2). \blacksquare

Common Pitfalls

  1. Dropping negative signs during algebraic manipulation — substitute back to verify your answer.

  2. Misreading the question, particularly with ‘hence’ vs ‘hence or otherwise’ — the former requires using previous work.

  3. Confusing the domain and range of functions, or not considering restrictions (e.g., denominator cannot be zero).

  4. Confusing P(AB)P(A|B) with P(BA)P(B|A) — these are related by Bayes’ theorem but are not equal in general.

Worked Examples

Example 1: Law of Total Probability

Problem. Factory A produces 60% of items and factory B produces 40%. Defect rates are 2% and 5% respectively. Find the probability a randomly selected item is defective.

Solution. P(D)=P(DA)P(A)+P(DB)P(B)=0.02×0.6+0.05×0.4=0.012+0.020=0.032P(D) = P(D|A)P(A) + P(D|B)P(B) = 0.02 \times 0.6 + 0.05 \times 0.4 = 0.012 + 0.020 = 0.032.

Given a defective item, P(AD)=0.0120.032=0.375P(A|D) = \frac{0.012}{0.032} = 0.375 (Bayes’ theorem).

\blacksquare

Example 2: Generating Functions

Problem. A fair coin is tossed until the first head appears. Find the expected number of tosses using the probability generating function.

Solution. XGeo(p=0.5)X \sim \text{Geo}(p = 0.5). P(X=k)=0.5kP(X = k) = 0.5^k for k=1,2,k = 1, 2, \ldots

PGF: G(s)=k=10.5ksk=0.5s10.5sG(s) = \sum_{k=1}^{\infty} 0.5^k s^k = \frac{0.5s}{1 - 0.5s}.

E(X)=G(1)=0.5(10.5)2=0.50.25=2E(X) = G'(1) = \frac{0.5}{(1-0.5)^2} = \frac{0.5}{0.25} = 2.

\blacksquare

Summary

  • Sample spaces, events, and sigma-algebras provide the rigorous foundation for probability theory.
  • Random variables: discrete (PMF) and continuous (PDF); CDF F(x)=P(Xx)F(x) = P(X \leq x).
  • Expectation: E(X)=xP(X=x)E(X) = \sum x \cdot P(X=x) or xf(x)dx\int x f(x)\,dx; linearity E(aX+bY)=aE(X)+bE(Y)E(aX + bY) = aE(X) + bE(Y).
  • Variance: Var(X)=E(X2)[E(X)]2\text{Var}(X) = E(X^2) - [E(X)]^2; Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2\text{Var}(X).
  • Generating functions (PGF, MGF) encode distribution information; moments obtained by differentiation at specific points.

Cross-References

TopicSiteLink
Probability and StatisticsWyattsNotesView
Real AnalysisWyattsNotesView
Differential EquationsWyattsNotesView
Probability — Harvard Stat 110HarvardView