Skip to content

Probability and Statistics

1. Probability Spaces

1.1 Sample Spaces and Events

A probability space is a triple (Ω,F,P)(\Omega, \mathcal{F}, P) where:

  • Ω\Omega is the sample space (set of all possible outcomes).
  • F\mathcal{F} is a sigma-algebra (collection of events) on Ω\Omega.
  • P:F[0,1]P : \mathcal{F} \to [0,1] is a probability measure.

1.2 Sigma-Algebras

Definition. A sigma-algebra (or σ\sigma-algebra) F\mathcal{F} on a set Ω\Omega is a collection of subsets Of Ω\Omega satisfying:

  1. ΩF\Omega \in \mathcal{F}.
  2. If AFA \in \mathcal{F}Then AcFA^c \in \mathcal{F} (closed under complementation).
  3. If A1,A2,A3,FA_1, A_2, A_3, \ldots \in \mathcal{F}Then i=1AiF\bigcup_{i=1}^{\infty} A_i \in \mathcal{F} (closed under countable unions).

Remark. From (2) and (3), F\mathcal{F} is also closed under countable intersections (by De Morgan”s laws). The pair (Ω,F)(\Omega, \mathcal{F}) is called a measurable space.

Example 1.1. For any set Ω\OmegaThe trivial sigma-algebra is \\{\\emptyset, \\Omega\\} and the power Set P(Ω)\mathcal{P}(\Omega) is also a sigma-algebra.

Example 1.2. If Ω={1,2,3,4,5,6}\Omega = \{1, 2, 3, 4, 5, 6\} (a fair die), then F=P(Ω)\mathcal{F} = \mathcal{P}(\Omega) contains all 26=642^6 = 64 subsets. This is the sigma-algebra used for finite Sample spaces.

Example 1.3. For Ω=R\Omega = \mathbb{R}The Borel sigma-algebra B\mathcal{B} is the smallest sigma-algebra Containing all open intervals. It is generated by taking countable unions, intersections, and complements of open Sets. We write (R,B)(\mathbb{R}, \mathcal{B}).

Proposition 1.0. The intersection of any collection of sigma-algebras on Ω\Omega is a sigma-algebra.

Proof. Let \\{\\mathcal{{'}F{}'}_\\alpha\\} be a collection of sigma-algebras. Then: (1) ΩFα\Omega \in \mathcal{F}_\alpha for all α\alphaSo ΩαFα\Omega \in \bigcap_\alpha \mathcal{F}_\alpha. (2) If AαFαA \in \bigcap_\alpha \mathcal{F}_\alphaThen AFαA \in \mathcal{F}_\alpha for all α\alphaSo AcFαA^c \in \mathcal{F}_\alpha for all α\alphaHence AcαFαA^c \in \bigcap_\alpha \mathcal{F}_\alpha. (3) Countable unions follow similarly. \blacksquare

Intuition. This proposition guarantees that for any collection of subsets E\mathcal{E}There exists a smallest Sigma-algebra containing E\mathcal{E}Called the sigma-algebra generated by E\mathcal{E}Denoted σ(E)\sigma(\mathcal{E}).

1.3 Axioms of Probability (Kolmogorov)

  1. Non-negativity: P(A)0P(A) \geq 0 for all AFA \in \mathcal{F}.
  2. Normalisation: P(Ω)=1P(\Omega) = 1.
  3. Countable additivity: If A1,A2,A_1, A_2, \ldots are pairwise disjoint events, then

P(i=1Ai)=i=1P(Ai)P\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i)

1.4 Basic Properties

Proposition 1.1. P()=0P(\emptyset) = 0.

Proof. Ω=Ω\Omega = \Omega \cup \emptyset (disjoint union), so P(Ω)=P(Ω)+P()P(\Omega) = P(\Omega) + P(\emptyset) Hence P()=0P(\emptyset) = 0. \blacksquare

Proposition 1.2 (Complement). P(Ac)=1P(A)P(A^c) = 1 - P(A).

Proposition 1.3 (Monotonicity). If ABA \subseteq BThen P(A)P(B)P(A) \leq P(B).

Proof. Write B=A(BA)B = A \cup (B \setminus A)A disjoint union. By countable additivity, P(B)=P(A)+P(BA)P(A)P(B) = P(A) + P(B \setminus A) \geq P(A) since P(BA)0P(B \setminus A) \geq 0. \blacksquare

Proposition 1.4 (Inclusion-Exclusion). For any two events A,BA, B:

P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)

Proof. Write AB=A(BA)A \cup B = A \cup (B \setminus A) as a disjoint union. Then P(AB)=P(A)+P(BA)P(A \cup B) = P(A) + P(B \setminus A). Since B=(BA)(AB)B = (B \setminus A) \cup (A \cap B) is also disjoint, P(B)=P(BA)+P(AB)P(B) = P(B \setminus A) + P(A \cap B)So P(BA)=P(B)P(AB)P(B \setminus A) = P(B) - P(A \cap B). Substituting gives the Result. \blacksquare

Proposition 1.5 (General Inclusion-Exclusion). For events A1,,AnA_1, \ldots, A_n:

P(i=1nAi)=iP(Ai)i<jP(AiAj)+i<j<kP(AiAjAk)+(1)n+1P(A1An)P\left(\bigcup_{i=1}^n A_i\right) = \sum_{i} P(A_i) - \sum_{i \lt j} P(A_i \cap A_j) + \sum_{i \lt j \lt k} P(A_i \cap A_j \cap A_k) - \cdots + (-1)^{n+1} P(A_1 \cap \cdots \cap A_n)

Proposition 1.6 (Boole’s Inequality). P(AB)P(A)+P(B)P(A \cup B) \leq P(A) + P(B). More generally,

P(i=1nAi)i=1nP(Ai)P\left(\bigcup_{i=1}^n A_i\right) \leq \sum_{i=1}^n P(A_i)

1.5 Conditional Probability

Definition. The conditional probability of AA given BB (where P(B)>0P(B) \gt 0) is

P(AB)=P(AB)P(B)P(A \mid B) = \frac{P(A \cap B)}{P(B)}

Intuition. Conditioning on BB restricts the sample space to BB and rescales so that P(BB)=1P(B \mid B) = 1.

Theorem 1.1 (Law of Total Probability). If B1,,BnB_1, \ldots, B_n partition Ω\Omega with P(Bi)>0P(B_i) \gt 0:

P(A)=i=1nP(ABi)P(Bi)P(A) = \sum_{i=1}^n P(A \mid B_i) P(B_i)

Proof. Since B1,,BnB_1, \ldots, B_n partition Ω\OmegaWe have A=i=1n(ABi)A = \bigcup_{i=1}^n (A \cap B_i) (disjoint union). By countable additivity:

P(A)=i=1nP(ABi)=i=1nP(ABi)P(Bi)P(A) = \sum_{i=1}^n P(A \cap B_i) = \sum_{i=1}^n P(A \mid B_i)\, P(B_i)

\blacksquare

Theorem 1.2 (Bayes’ Theorem). For events AA and BB with P(B)>0P(B) \gt 0:

P(AB)=P(BA)P(A)P(B)P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}

In the partition form:

P(BjA)=P(ABj)P(Bj)i=1nP(ABi)P(Bi)P(B_j \mid A) = \frac{P(A \mid B_j) P(B_j)}{\sum_{i=1}^n P(A \mid B_i) P(B_i)}

Proof. By definition, P(AB)=P(AB)/P(B)P(A \mid B) = P(A \cap B)/P(B) and P(BA)=P(AB)/P(A)P(B \mid A) = P(A \cap B)/P(A). Solving the Second for P(AB)=P(BA)P(A)P(A \cap B) = P(B \mid A) P(A) and substituting into the first gives Bayes’ theorem. The partition Form follows by applying the law of total probability to the denominator P(B)P(B). \blacksquare

1.6 Worked Examples

Problem 1.1. A bag contains 4 red and 6 blue marbles. Two marbles are drawn without replacement. What is the Probability that both are red?

Solution

Let R1R_1 be the event “first marble is red” and R2R_2 be “second marble is red.” Then:

P(R1R2)=P(R1)P(R2R1)=41039=1290=215P(R_1 \cap R_2) = P(R_1)\, P(R_2 \mid R_1) = \frac{4}{10} \cdot \frac{3}{9} = \frac{12}{90} = \frac{2}{15}

Problem 1.2. A disease affects 1% of a population. A test has sensitivity 95% (P(positivedisease)=0.95P(\mathrm{positive} \mid \mathrm{disease}) = 0.95) And specificity 90% (P(negativehealthy)=0.90P(\mathrm{negative} \mid \mathrm{healthy}) = 0.90). If a person tests positive, what is the Probability they have the disease?

Solution

Let DD = “has disease” and ++ = “tests positive.” We want P(D+)P(D \mid +).

By Bayes’ theorem:

P(D+)=P(+D)P(D)P(+D)P(D)+P(+Dc)P(Dc)P(D \mid +) = \frac{P(+ \mid D)\, P(D)}{P(+ \mid D)\, P(D) + P(+ \mid D^c)\, P(D^c)}

=0.95×0.010.95×0.01+0.10×0.99=0.00950.0095+0.099=0.00950.10850.0876= \frac{0.95 \times 0.01}{0.95 \times 0.01 + 0.10 \times 0.99} = \frac{0.0095}{0.0095 + 0.099} = \frac{0.0095}{0.1085} \approx 0.0876

So even with a positive test, there is only about an 8.8% chance of having the disease, due to the low prior Probability (base rate fallacy). \blacksquare

Proposition 1.7 (General Inclusion-Exclusion). For events A1,,AnA_1, \ldots, A_n:

P(i=1nAi)=k=1n(1)k+11i1<<iknP(Ai1Aik)P\left(\bigcup_{i=1}^n A_i\right) = \sum_{k=1}^n (-1)^{k+1} \sum_{1 \leq i_1 \lt \cdots \lt i_k \leq n} P(A_{i_1} \cap \cdots \cap A_{i_k})

Proof. By induction on nn. The base case n=1n = 1 is trivial. Assume the result holds for nn events. For n+1n + 1 Events:

P(i=1n+1Ai)=P(i=1nAi)+P(An+1)P(i=1nAiAn+1)P\left(\bigcup_{i=1}^{n+1} A_i\right) = P\left(\bigcup_{i=1}^n A_i\right) + P(A_{n+1}) - P\left(\bigcup_{i=1}^n A_i \cap A_{n+1}\right)

Applying the induction hypothesis to both union terms and rearranging yields the formula for n+1n + 1. \blacksquare

Problem 1.3. Three machines A, B, C produce items. Machine A produces 50% of items with 2% defective, B produces 30% with 1% defective, and C produces 20% with 3% defective. An item is found to be defective. What is the Probability it was produced by machine A?

Solution

Let DD = “defective” and AA, BB, CC denote production by each machine.

P(A)=0.5P(A) = 0.5, P(B)=0.3P(B) = 0.3, P(C)=0.2P(C) = 0.2. P(DA)=0.02P(D \mid A) = 0.02, P(DB)=0.01P(D \mid B) = 0.01, P(DC)=0.03P(D \mid C) = 0.03.

P(AD)=P(DA)P(A)P(DA)P(A)+P(DB)P(B)+P(DC)P(C)P(A \mid D) = \frac{P(D \mid A)\, P(A)}{P(D \mid A)\, P(A) + P(D \mid B)\, P(B) + P(D \mid C)\, P(C)}

=0.02×0.50.02×0.5+0.01×0.3+0.03×0.2=0.0100.010+0.003+0.006=0.0100.0190.526= \frac{0.02 \times 0.5}{0.02 \times 0.5 + 0.01 \times 0.3 + 0.03 \times 0.2} = \frac{0.010}{0.010 + 0.003 + 0.006} = \frac{0.010}{0.019} \approx 0.526

:::caution Common Pitfall People often confuse P(AB)P(A \mid B) with P(BA)P(B \mid A). In medical testing, P(diseasepositive)P(\mathrm{disease} \mid \mathrm{positive}) Is much lower than P(positivedisease)P(\mathrm{positive} \mid \mathrm{disease}) due to low base rates. Always apply Bayes’ Theorem rigorously. :::

2. Random Variables

2.1 Definition

A random variable is a measurable function X:ΩRX : \Omega \to \mathbb{R}. Measurability means that for every Borel Set BRB \subseteq \mathbb{R}, {X1(B):ωΩ,X(ω)B}F\{X^{-1}(B) : \omega \in \Omega, X(\omega) \in B\} \in \mathcal{F}.

  • Discrete random variable: takes values in a countable set.
  • Continuous random variable: has a probability density function (PDF).

Example 2.1 (Discrete). Roll a fair die. Define X(ω)=ωX(\omega) = \omega. Then XX takes values in {1,2,3,4,5,6}\{1, 2, 3, 4, 5, 6\} With P(X=k)=1/6P(X = k) = 1/6 for each kk.

Example 2.2 (Discrete — Indicator). For any event AAThe indicator random variable 1A\mathbf{1}_A equals 1 if AA occurs and 0 otherwise. Then E[1A]=P(A)E[\mathbf{1}_A] = P(A) and Var(1A)=P(A)(1P(A))\mathrm{Var}(\mathbf{1}_A) = P(A)(1 - P(A)).

Example 2.3 (Continuous). Let XUniform(0,1)X \sim \mathrm{Uniform}(0, 1). Then XX is the identity on (0,1)(0, 1) with PDF f(x)=1f(x) = 1 for x(0,1)x \in (0, 1) and f(x)=0f(x) = 0 otherwise.

Example 2.4 (Mixed). A random variable can be neither purely discrete nor purely continuous. For instance, if X=0X = 0 with probability 1/21/2 and XExp(1)X \sim \mathrm{Exp}(1) with probability 1/21/2Then XX has an atom at 0 And a continuous part on (0,)(0, \infty).

2.2 Cumulative Distribution Function

The CDF of a random variable XX is

FX(x)=P(Xx)F_X(x) = P(X \leq x)

Theorem 2.1 (Properties of the CDF).

  1. limxF(x)=0\lim_{x \to -\infty} F(x) = 0, limx+F(x)=1\lim_{x \to +\infty} F(x) = 1.
  2. FF is non-decreasing: if a<ba \lt bThen F(a)F(b)F(a) \leq F(b).
  3. FF is right-continuous: limxa+F(x)=F(a)\lim_{x \to a^+} F(x) = F(a).
  4. P(a<Xb)=F(b)F(a)P(a \lt X \leq b) = F(b) - F(a).
  5. P(X=a)=F(a)F(a)P(X = a) = F(a) - F(a^-) where F(a)=limxaF(x)F(a^-) = \lim_{x \to a^-} F(x).
  6. FF has at most countably many points of discontinuity.

Proof. (1) By monotonicity of PP: F(x)=P(Xx)P(Ω)=1F(x) = P(X \leq x) \leq P(\Omega) = 1 and as xx \to \infty {Xx}Ω\{X \leq x\} \uparrow \OmegaSo by continuity from below, F(x)1F(x) \to 1. Similarly as xx \to -\infty {Xx}\{X \leq x\} \downarrow \emptyset and F(x)0F(x) \to 0.

(2) If a<ba \lt bThen {Xa}{Xb}\{X \leq a\} \subseteq \{X \leq b\}So by monotonicity of PP F(a)F(b)F(a) \leq F(b).

(3) Let xnax_n \downarrow a. Then {Xxn}{Xa}\{X \leq x_n\} \downarrow \{X \leq a\} (since {Xa}=n{Xxn}\{X \leq a\} = \bigcap_n \{X \leq x_n\}). By continuity from above of PP, F(xn)F(a)F(x_n) \to F(a).

(4) P(a<Xb)=P(Xb)P(Xa)=F(b)F(a)P(a \lt X \leq b) = P(X \leq b) - P(X \leq a) = F(b) - F(a).

(5) P(X=a)=P(Xa)P(X<a)=F(a)limxaF(x)=F(a)F(a)P(X = a) = P(X \leq a) - P(X \lt a) = F(a) - \lim_{x \uparrow a} F(x) = F(a) - F(a^-).

(6) Since FF is non-decreasing, it can have at most countably many jump discontinuities (the sum of all jumps must Be bounded by 1). \blacksquare

2.3 Probability Mass Function (Discrete)

For a discrete random variable XX with values {x1,x2,}\{x_1, x_2, \ldots\}:

fX(x)=P(X=x)={piifx=xi0otherwisef_X(x) = P(X = x) = \begin{cases} p_i & \mathrm{if} x = x_i \\ 0 & \mathrm{otherwise} \end{cases}

Where pi0p_i \geq 0 and ipi=1\sum_i p_i = 1.

2.4 Probability Density Function (Continuous)

A random variable XX is continuous if there exists a function fX0f_X \geq 0 such that

P(aXb)=abfX(x)dxP(a \leq X \leq b) = \int_a^b f_X(x)\, dx

And fX(x)dx=1\int_{-\infty}^{\infty} f_X(x)\, dx = 1.

Note: fX(x)f_X(x) is not a probability; it is a probability density. For continuous XX, P(X=x)=0P(X = x) = 0 For any individual xx.

2.5 Functions of Random Variables

Proposition 2.1. If XX is a continuous random variable with PDF fXf_X and gg is a strictly monotone Differentiable function, then Y=g(X)Y = g(X) has PDF

fY(y)=fX(g1(y))ddyg1(y)f_Y(y) = f_X(g^{-1}(y)) \cdot \left|\frac{d}{dy} g^{-1}(y)\right|

Proof. Suppose gg is strictly increasing. Then FY(y)=P(Yy)=P(Xg1(y))=FX(g1(y))F_Y(y) = P(Y \leq y) = P(X \leq g^{-1}(y)) = F_X(g^{-1}(y)). Differentiating: fY(y)=fX(g1(y))(g1)(y)f_Y(y) = f_X(g^{-1}(y)) \cdot (g^{-1})'(y). For decreasing ggThe inequality reverses, Introducing a minus sign. Both cases are captured by the absolute value. \blacksquare

Problem 2.1. Let XUniform(0,1)X \sim \mathrm{Uniform}(0, 1). Find the distribution of Y=lnXY = -\ln X.

Solution

Here g(x)=lnxg(x) = -\ln xWhich is strictly decreasing on (0,1)(0, 1). The inverse is g1(y)=eyg^{-1}(y) = e^{-y} for y>0y \gt 0. We have (g1)(y)=ey(g^{-1})'(y) = -e^{-y}.

fY(y)=fX(ey)ey=1ey=ey,y>0f_Y(y) = f_X(e^{-y}) \cdot |-e^{-y}| = 1 \cdot e^{-y} = e^{-y}, \quad y \gt 0

This is the Exp(1)\mathrm{Exp}(1) distribution. \blacksquare

Problem 2.2. Let XN(0,1)X \sim N(0, 1). Find the distribution of Y=X2Y = X^2.

Solution

The function g(x)=x2g(x) = x^2 is not monotone, so we must split into cases.

For y>0y \gt 0:

FY(y)=P(X2y)=P(yXy)=Φ(y)Φ(y)=2Φ(y)1F_Y(y) = P(X^2 \leq y) = P(-\sqrt{y} \leq X \leq \sqrt{y}) = \Phi(\sqrt{y}) - \Phi(-\sqrt{y}) = 2\Phi(\sqrt{y}) - 1

Differentiating:

fY(y)=2ϕ(y)12y=1y12πey/2=12πyey/2f_Y(y) = 2 \cdot \phi(\sqrt{y}) \cdot \frac{1}{2\sqrt{y}} = \frac{1}{\sqrt{y}} \cdot \frac{1}{\sqrt{2\pi}} e^{-y/2} = \frac{1}{\sqrt{2\pi y}} e^{-y/2}

This is the PDF of a χ12\chi^2_1 (chi-squared with 1 degree of freedom) distribution, which equals Gamma(1/2,1/2)\mathrm{Gamma}(1/2, 1/2). \blacksquare

2.6 Quantile Function

Definition. The quantile function (or inverse CDF) of a random variable XX with CDF FF is

F1(p)=inf{x:F(x)p},0<p<1F^{-1}(p) = \inf\{x : F(x) \geq p\}, \quad 0 \lt p \lt 1

Remark. If FF is strictly increasing, then F1(p)F^{-1}(p) is the unique xx such that F(x)=pF(x) = p. For discrete Distributions, F1F^{-1} is the generalised inverse.

Theorem 2.2 (Probability Integral Transform). If XX has a continuous CDF FFThen F(X)Uniform(0,1)F(X) \sim \mathrm{Uniform}(0, 1).

Proof. For u(0,1)u \in (0, 1): P(F(X)u)=P(XF1(u))=F(F1(u))=uP(F(X) \leq u) = P(X \leq F^{-1}(u)) = F(F^{-1}(u)) = u. \blacksquare

Intuition. This theorem is the foundation of inverse transform sampling: to generate from any distribution With CDF FFDraw UUniform(0,1)U \sim \mathrm{Uniform}(0, 1) and compute X=F1(U)X = F^{-1}(U).

2.7 Order Statistics

Definition. Let X1,,XnX_1, \ldots, X_n be i.i.d. With CDF FF and PDF ff. The order …/4-statistics-and-probability/2_statistics X(1)X(2)X(n)X_{(1)} \leq X_{(2)} \leq \cdots \leq X_{(n)} are the sorted values.

Theorem 2.3. The PDF of the kk-th order statistic X(k)X_{(k)} is

fX(k)(x)=n!(k1)!(nk)![F(x)]k1[1F(x)]nkf(x)f_{X_{(k)}}(x) = \frac{n!}{(k-1)!(n-k)!}\, [F(x)]^{k-1}\, [1 - F(x)]^{n-k}\, f(x)

Proof. For X(k)xX_{(k)} \leq x to hold, at least kk of the XiX_i must be x\leq x. The event X(k)(x,x+dx)X_{(k)} \in (x, x + dx) Requires exactly k1k - 1 observations below xxOne in (x,x+dx)(x, x + dx)And nkn - k above x+dxx + dx:

fX(k)(x)dx=(nk1,1,nk)[F(x)]k1f(x)dx[1F(x)]nkf_{X_{(k)}}(x)\, dx = \binom{n}{k-1, 1, n-k}\, [F(x)]^{k-1}\, f(x)\, dx\, [1 - F(x)]^{n-k}

Which gives the result after cancelling dxdx. \blacksquare

Problem 2.3. Let X1,X2,X3X_1, X_2, X_3 be i.i.d. Uniform(0,1)\mathrm{Uniform}(0, 1). Find the PDF of the median X(2)X_{(2)}.

Solution

Here n=3n = 3, k=2k = 2, F(x)=xF(x) = x, f(x)=1f(x) = 1 on (0,1)(0, 1).

fX(2)(x)=3!1!1!x1(1x)11=6x(1x),0<x<1f_{X_{(2)}}(x) = \frac{3!}{1! \cdot 1!}\, x^{1}\, (1 - x)^{1} \cdot 1 = 6x(1 - x), \quad 0 \lt x \lt 1

This is a Beta(2,2)\mathrm{Beta}(2, 2) distribution. \blacksquare

3. Common Distributions

3.1 Discrete Distributions

Bernoulli. XBernoulli(p)X \sim \mathrm{Bernoulli}(p): P(X=1)=pP(X = 1) = p, P(X=0)=1pP(X = 0) = 1 - p.

E[X]=p,Var(X)=p(1p)E[X] = p, \quad \mathrm{Var}(X) = p(1 - p)

MX(t)=1p+petM_X(t) = 1 - p + pe^t

Proof of MGF: MX(t)=E[etX]=et0(1p)+et1p=1p+petM_X(t) = E[e^{tX}] = e^{t \cdot 0}(1 - p) + e^{t \cdot 1} \cdot p = 1 - p + pe^t. \blacksquare

Binomial. XBin(n,p)X \sim \mathrm{Bin}(n, p): number of successes in nn independent Bernoulli trials.

P(X=k)=(nk)pk(1p)nk,k=0,1,,nP(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \ldots, n

E[X]=np,Var(X)=np(1p)E[X] = np, \quad \mathrm{Var}(X) = np(1-p)

MX(t)=(1p+pet)nM_X(t) = (1 - p + pe^t)^n

Proof of MGF: Since X=i=1nXiX = \sum_{i=1}^n X_i where XiBernoulli(p)X_i \sim \mathrm{Bernoulli}(p) are independent:

MX(t)=i=1nMXi(t)=(1p+pet)nM_X(t) = \prod_{i=1}^n M_{X_i}(t) = (1 - p + pe^t)^n \quad \blacksquare

Geometric. XGeom(p)X \sim \mathrm{Geom}(p): number of trials until the first success (counting the success).

P(X=k)=(1p)k1p,k=1,2,3,P(X = k) = (1 - p)^{k-1} p, \quad k = 1, 2, 3, \ldots

E[X]=1p,Var(X)=1pp2E[X] = \frac{1}{p}, \quad \mathrm{Var}(X) = \frac{1 - p}{p^2}

MX(t)=pet1(1p)et,fort<ln(1p)M_X(t) = \frac{pe^t}{1 - (1 - p)e^t}, \quad \mathrm{for} t \lt -\ln(1 - p)

Proof that E[X]=1/pE[X] = 1/p:

E[X]=k=1k(1p)k1p=pk=1k(1p)k1=p1(1(1p))2=pp2=1pE[X] = \sum_{k=1}^{\infty} k(1-p)^{k-1}p = p \sum_{k=1}^{\infty} k(1-p)^{k-1} = p \cdot \frac{1}{(1 - (1-p))^2} = \frac{p}{p^2} = \frac{1}{p}

Using the identity k=1krk1=1/(1r)2\sum_{k=1}^{\infty} kr^{k-1} = 1/(1-r)^2 for r<1|r| \lt 1. \blacksquare

Negative Binomial. XNegBin(r,p)X \sim \mathrm{NegBin}(r, p): number of trials until the rr-th success.

P(X=k)=(k1r1)pr(1p)kr,k=r,r+1,P(X = k) = \binom{k-1}{r-1} p^r (1-p)^{k-r}, \quad k = r, r+1, \ldots

E[X]=rp,Var(X)=r(1p)p2E[X] = \frac{r}{p}, \quad \mathrm{Var}(X) = \frac{r(1-p)}{p^2}

Hypergeometric. XHypergeometric(N,K,n)X \sim \mathrm{Hypergeometric}(N, K, n): sampling nn items without replacement from a population Of NN containing KK “successes.”

P(X=k)=(Kk)(NKnk)(Nn),k=max(0,n+KN),,min(n,K)P(X = k) = \frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}}, \quad k = \max(0, n + K - N), \ldots, \min(n, K)

E[X]=nKN,Var(X)=nKNNKNNnN1E[X] = n \cdot \frac{K}{N}, \quad \mathrm{Var}(X) = n \cdot \frac{K}{N} \cdot \frac{N-K}{N} \cdot \frac{N-n}{N-1}

Remark. The factor (Nn)/(N1)(N - n)/(N - 1) is the finite population correction. When nNn \ll NThe Hypergeometric is well-approximated by Bin(n,K/N)\mathrm{Bin}(n, K/N).

Poisson. XPoisson(λ)X \sim \mathrm{Poisson}(\lambda): models rare events occurring at rate λ\lambda.

P(X=k)=eλλkk!,k=0,1,2,P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!}, \quad k = 0, 1, 2, \ldots

E[X]=λ,Var(X)=λE[X] = \lambda, \quad \mathrm{Var}(X) = \lambda

MX(t)=exp(λ(et1))M_X(t) = \exp\left(\lambda(e^t - 1)\right)

Proof of MGF:

MX(t)=k=0etkeλλkk!=eλk=0(λet)kk!=eλeλet=exp(λ(et1))M_X(t) = \sum_{k=0}^{\infty} e^{tk} \frac{e^{-\lambda} \lambda^k}{k!} = e^{-\lambda} \sum_{k=0}^{\infty} \frac{(\lambda e^t)^k}{k!} = e^{-\lambda} \cdot e^{\lambda e^t} = \exp(\lambda(e^t - 1))

\blacksquare

Proof that E[X]=λE[X] = \lambda:

E[X]=k=0keλλkk!=eλk=1λk(k1)!=eλλk=0λkk!=λE[X] = \sum_{k=0}^{\infty} k \frac{e^{-\lambda} \lambda^k}{k!} = e^{-\lambda} \sum_{k=1}^{\infty} \frac{\lambda^k}{(k-1)!} = e^{-\lambda} \lambda \sum_{k=0}^{\infty} \frac{\lambda^k}{k!} = \lambda

\blacksquare

3.2 Continuous Distributions

Uniform. XUniform(a,b)X \sim \mathrm{Uniform}(a, b):

f(x)={1baifaxb0otherwisef(x) = \begin{cases} \frac{1}{b - a} & \mathrm{if} a \leq x \leq b \\ 0 & \mathrm{otherwise} \end{cases}

E[X]=a+b2,Var(X)=(ba)212E[X] = \frac{a + b}{2}, \quad \mathrm{Var}(X) = \frac{(b-a)^2}{12}

MX(t)={etbetat(ba)ift01ift=0M_X(t) = \begin{cases} \frac{e^{tb} - e^{ta}}{t(b - a)} & \mathrm{if} t \neq 0 \\ 1 & \mathrm{if} t = 0 \end{cases}

Gamma. XGamma(α,λ)X \sim \mathrm{Gamma}(\alpha, \lambda) (shape α>0\alpha \gt 0Rate λ>0\lambda \gt 0):

f(x)=λαΓ(α)xα1eλx,x>0f(x) = \frac{\lambda^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\lambda x}, \quad x \gt 0

Where Γ(α)=0tα1etdt\Gamma(\alpha) = \int_0^\infty t^{\alpha - 1} e^{-t}\, dt is the Gamma function.

E[X]=αλ,Var(X)=αλ2E[X] = \frac{\alpha}{\lambda}, \quad \mathrm{Var}(X) = \frac{\alpha}{\lambda^2}

MX(t)=(λλt)α,t<λM_X(t) = \left(\frac{\lambda}{\lambda - t}\right)^\alpha, \quad t \lt \lambda

Remark. Special cases: Gamma(1,λ)=Exp(λ)\mathrm{Gamma}(1, \lambda) = \mathrm{Exp}(\lambda); Gamma(n/2,1/2)=χn2\mathrm{Gamma}(n/2, 1/2) = \chi^2_n.

Theorem 3.0 (Sum of Independent Gammas). If XGamma(α1,λ)X \sim \mathrm{Gamma}(\alpha_1, \lambda) and YGamma(α2,λ)Y \sim \mathrm{Gamma}(\alpha_2, \lambda) are independent, then X+YGamma(α1+α2,λ)X + Y \sim \mathrm{Gamma}(\alpha_1 + \alpha_2, \lambda).

Proof. MX+Y(t)=MX(t)MY(t)=(λλt)α1(λλt)α2=(λλt)α1+α2M_{X+Y}(t) = M_X(t)\, M_Y(t) = \left(\frac{\lambda}{\lambda - t}\right)^{\alpha_1} \left(\frac{\lambda}{\lambda - t}\right)^{\alpha_2} = \left(\frac{\lambda}{\lambda - t}\right)^{\alpha_1 + \alpha_2}, which is the MGF of Gamma(α1+α2,λ)\mathrm{Gamma}(\alpha_1 + \alpha_2, \lambda). \blacksquare

Chi-Squared. Xχk2X \sim \chi^2_k (chi-squared with kk degrees of freedom):

f(x)=12k/2Γ(k/2)xk/21ex/2,x>0f(x) = \frac{1}{2^{k/2}\, \Gamma(k/2)}\, x^{k/2 - 1}\, e^{-x/2}, \quad x \gt 0

E[X]=k,Var(X)=2kE[X] = k, \quad \mathrm{Var}(X) = 2k

MX(t)=(12t)k/2,t<1/2M_X(t) = (1 - 2t)^{-k/2}, \quad t \lt 1/2

Remark. If Z1,,ZkN(0,1)Z_1, \ldots, Z_k \sim N(0, 1) are independent, then i=1kZi2χk2\sum_{i=1}^k Z_i^2 \sim \chi^2_k.

Beta. XBeta(α,β)X \sim \mathrm{Beta}(\alpha, \beta):

f(x)=xα1(1x)β1B(α,β),0<x<1f(x) = \frac{x^{\alpha - 1}(1 - x)^{\beta - 1}}{B(\alpha, \beta)}, \quad 0 \lt x \lt 1

Where B(α,β)=Γ(α)Γ(β)Γ(α+β)B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)} is the Beta function.

E[X]=αα+β,Var(X)=αβ(α+β)2(α+β+1)E[X] = \frac{\alpha}{\alpha + \beta}, \quad \mathrm{Var}(X) = \frac{\alpha \beta}{(\alpha + \beta)^2(\alpha + \beta + 1)}

Exponential. XExp(λ)X \sim \mathrm{Exp}(\lambda):

f(x)={λeλxifx00ifx<0f(x) = \begin{cases} \lambda e^{-\lambda x} & \mathrm{if} x \geq 0 \\ 0 & \mathrm{if} x \lt 0 \end{cases}

E[X]=1λ,Var(X)=1λ2E[X] = \frac{1}{\lambda}, \quad \mathrm{Var}(X) = \frac{1}{\lambda^2}

Theorem 3.1 (Memoryless Property). If XExp(λ)X \sim \mathrm{Exp}(\lambda)Then for all s,t>0s, t \gt 0:

P(X>s+tX>s)=P(X>t)P(X \gt s + t \mid X \gt s) = P(X \gt t)

Proof.

P(X>s+tX>s)=P(X>s+t)P(X>s)=eλ(s+t)eλs=eλt=P(X>t)P(X \gt s + t \mid X \gt s) = \frac{P(X \gt s + t)}{P(X \gt s)} = \frac{e^{-\lambda(s+t)}}{e^{-\lambda s}} = e^{-\lambda t} = P(X \gt t)

\blacksquare

Remark. The exponential distribution is the only continuous distribution with the memoryless property. This Makes it the natural model for waiting times between Poisson events.

Normal (Gaussian). XN(μ,σ2)X \sim N(\mu, \sigma^2):

f(x)=1σ2πexp((xμ)22σ2),xRf(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right), \quad x \in \mathbb{R}

E[X]=μ,Var(X)=σ2E[X] = \mu, \quad \mathrm{Var}(X) = \sigma^2

The standard normal ZN(0,1)Z \sim N(0,1) has CDF denoted Φ(z)\Phi(z). For any XN(μ,σ2)X \sim N(\mu, \sigma^2):

Z=XμσN(0,1)Z = \frac{X - \mu}{\sigma} \sim N(0, 1)

MX(t)=exp(μt+σ2t22)M_X(t) = \exp\left(\mu t + \frac{\sigma^2 t^2}{2}\right)

Verification that ff integrates to 1. Consider I=ex2/2dxI = \int_{-\infty}^{\infty} e^{-x^2/2}\, dx. Then I2=e(x2+y2)/2dxdyI^2 = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} e^{-(x^2+y^2)/2}\, dx\, dy. Switching to polar coordinates: x=rcosθx = r\cos\theta, y=rsinθy = r\sin\theta, dxdy=rdrdθdx\, dy = r\, dr\, d\theta.

I2=02π0er2/2rdrdθ=2π0er2/2rdr=2π[er2/2]0=2πI^2 = \int_0^{2\pi}\int_0^{\infty} e^{-r^2/2}\, r\, dr\, d\theta = 2\pi \int_0^{\infty} e^{-r^2/2}\, r\, dr = 2\pi \left[-e^{-r^2/2}\right]_0^{\infty} = 2\pi

So I=2πI = \sqrt{2\pi}Confirming that 12πex2/2dx=1\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty} e^{-x^2/2}\, dx = 1. For the general N(μ,σ2)N(\mu, \sigma^2)The substitution z=(xμ)/σz = (x - \mu)/\sigma reduces to the standard case.

Verification that E[X]=μE[X] = \mu for ZN(0,1)Z \sim N(0, 1): The integrand zϕ(z)z \cdot \phi(z) is an odd function of zzSo zϕ(z)dz=0\int_{-\infty}^{\infty} z\, \phi(z)\, dz = 0. For X=μ+σZX = \mu + \sigma Z: E[X]=μ+σ0=μE[X] = \mu + \sigma \cdot 0 = \mu.

Verification that Var(Z)=1\mathrm{Var}(Z) = 1: Integration by parts with u=zu = z, dv=zϕ(z)dzdv = z\, \phi(z)\, dz:

E[Z2]=z2ϕ(z)dz=[zϕ(z)]+ϕ(z)dz=0+1=1E[Z^2] = \int_{-\infty}^{\infty} z^2 \phi(z)\, dz = \left[-z\phi(z)\right]_{-\infty}^{\infty} + \int_{-\infty}^{\infty} \phi(z)\, dz = 0 + 1 = 1

Proof of MGF for ZN(0,1)Z \sim N(0, 1):

MZ(t)=etz12πez2/2dz=12πe(z22tz)/2dzM_Z(t) = \int_{-\infty}^{\infty} e^{tz} \frac{1}{\sqrt{2\pi}} e^{-z^2/2}\, dz = \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-(z^2 - 2tz)/2}\, dz

Completing the square: z22tz=(zt)2t2z^2 - 2tz = (z - t)^2 - t^2So:

MZ(t)=et2/212πe(zt)2/2dz=et2/21=et2/2M_Z(t) = e^{t^2/2} \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-(z-t)^2/2}\, dz = e^{t^2/2} \cdot 1 = e^{t^2/2}

Since the integrand is the PDF of N(t,1)N(t, 1) evaluated over all R\mathbb{R}. \blacksquare

For X=μ+σZX = \mu + \sigma Z: MX(t)=E[et(μ+σZ)]=eμtE[e(σt)Z]=eμteσ2t2/2M_X(t) = E[e^{t(\mu + \sigma Z)}] = e^{\mu t} E[e^{(\sigma t)Z}] = e^{\mu t} e^{\sigma^2 t^2/2}.

Theorem 3.2. The sum of independent normal random variables is normal: If XiN(μi,σi2)X_i \sim N(\mu_i, \sigma_i^2) are independent, then XiN(μi,σi2)\sum X_i \sim N(\sum \mu_i, \sum \sigma_i^2).

Proof (using MGFs). MXi(t)=iMXi(t)=iexp(μit+σi2t2/2)=exp((μi)t+(σi2)t22)M_{\sum X_i}(t) = \prod_i M_{X_i}(t) = \prod_i \exp(\mu_i t + \sigma_i^2 t^2/2) = \exp\left((\sum \mu_i)t + \frac{(\sum \sigma_i^2) t^2}{2}\right), which is the MGF of N(μi,σi2)N(\sum \mu_i, \sum \sigma_i^2). By the uniqueness theorem for MGFs, the result follows. \blacksquare

3.3 Relationships Between Distributions

Theorem 3.3 (Poisson Limit Theorem). If XnBin(n,pn)X_n \sim \mathrm{Bin}(n, p_n) where npnλnp_n \to \lambda as nn \to \infty Then XndPoisson(λ)X_n \xrightarrow{d} \mathrm{Poisson}(\lambda).

Proof. Let λn=npn\lambda_n = np_n so that pn=λn/np_n = \lambda_n / n. Then:

P(Xn=k)=(nk)pnk(1pn)nk=n(n1)(nk+1)k!(λnn)k(1λnn)nkP(X_n = k) = \binom{n}{k} p_n^k (1 - p_n)^{n-k} = \frac{n(n-1)\cdots(n-k+1)}{k!} \left(\frac{\lambda_n}{n}\right)^k \left(1 - \frac{\lambda_n}{n}\right)^{n-k}

=λnkk!nnn1nnk+1n(1λnn)n(1λnn)k= \frac{\lambda_n^k}{k!} \cdot \frac{n}{n} \cdot \frac{n-1}{n} \cdots \frac{n-k+1}{n} \cdot \left(1 - \frac{\lambda_n}{n}\right)^n \cdot \left(1 - \frac{\lambda_n}{n}\right)^{-k}

As nn \to \infty: λnλ\lambda_n \to \lambda; njn1\frac{n-j}{n} \to 1 for each fixed jj; (1λnn)neλ\left(1 - \frac{\lambda_n}{n}\right)^n \to e^{-\lambda}; and (1λnn)k1\left(1 - \frac{\lambda_n}{n}\right)^{-k} \to 1.

Therefore: P(Xn=k)λkeλk!=P(Poisson(λ)=k)P(X_n = k) \to \frac{\lambda^k e^{-\lambda}}{k!} = P(\mathrm{Poisson}(\lambda) = k). \blacksquare

Intuition. The Poisson distribution approximates the binomial when nn is large, pp is small, and npnp is moderate.

Theorem 3.4 (Normal approximation to the Binomial). If XBin(n,p)X \sim \mathrm{Bin}(n, p) with nn large, then Approximately XN(np,np(1p))X \approx N(np, np(1-p)). More precisely, using a continuity correction:

P(aXb)Φ(b+0.5npnp(1p))Φ(a0.5npnp(1p))P(a \leq X \leq b) \approx \Phi\left(\frac{b + 0.5 - np}{\sqrt{np(1-p)}}\right) - \Phi\left(\frac{a - 0.5 - np}{\sqrt{np(1-p)}}\right)

Intuition. By the CLT, the sum of nn i.i.d. Bernoulli(p)\mathrm{Bernoulli}(p) variables (each with mean pp and variance p(1p)p(1-p)) is approximately normal.

3.4 Worked Examples

Problem 3.1. A call centre receives an average of 4.5 calls per minute. What is the probability of receiving More than 6 calls in a given minute?

Solution

Model the number of calls as XPoisson(4.5)X \sim \mathrm{Poisson}(4.5).

P(X>6)=1P(X6)=1k=06e4.54.5kk!P(X \gt 6) = 1 - P(X \leq 6) = 1 - \sum_{k=0}^{6} \frac{e^{-4.5} \cdot 4.5^k}{k!}

=1e4.5(1+4.5+4.522+4.536+4.5424+4.55120+4.56720)= 1 - e^{-4.5}\left(1 + 4.5 + \frac{4.5^2}{2} + \frac{4.5^3}{6} + \frac{4.5^4}{24} + \frac{4.5^5}{120} + \frac{4.5^6}{720}\right)

=1e4.5(1+4.5+10.125+15.1875+17.0859+15.3773+11.5330)= 1 - e^{-4.5}(1 + 4.5 + 10.125 + 15.1875 + 17.0859 + 15.3773 + 11.5330)

10.93140.0686\approx 1 - 0.9314 \approx 0.0686

Problem 3.2. The lifetime of a component is exponentially distributed with mean 500 hours. Given that the Component has lasted 300 hours, what is the probability it lasts at least another 200 hours?

Solution

The mean is 1/λ=5001/\lambda = 500So λ=1/500\lambda = 1/500. By the memoryless property:

P(X>300+200X>300)=P(X>200)=e200/500=e0.40.6703P(X \gt 300 + 200 \mid X \gt 300) = P(X \gt 200) = e^{-200/500} = e^{-0.4} \approx 0.6703

4. Expectation, Variance, and Moment Generating Functions

4.1 Expectation

Definition. The expected value of XX is

E[X]={xxfX(x)(discrete)xfX(x)dx(continuous)E[X] = \begin{cases} \sum_x x\, f_X(x) & \mathrm{(discrete)} \\ \int_{-\infty}^{\infty} x\, f_X(x)\, dx & \mathrm{(continuous)} \end{cases}

Proposition 4.1 (LOTUS — Law of the Unconscious Statistician). For any function gg:

E[g(X)]={xg(x)fX(x)(discrete)g(x)fX(x)dx(continuous)E[g(X)] = \begin{cases} \sum_x g(x)\, f_X(x) & \mathrm{(discrete)} \\ \int_{-\infty}^{\infty} g(x)\, f_X(x)\, dx & \mathrm{(continuous)} \end{cases}

Intuition. We do not need to find the distribution of Y=g(X)Y = g(X) to compute E[Y]E[Y]; we integrate with respect to the Distribution of XX directly.

Theorem 4.1 (Linearity of Expectation).

  1. E[aX+b]=aE[X]+bE[aX + b] = aE[X] + b.
  2. E[X+Y]=E[X]+E[Y]E[X + Y] = E[X] + E[Y] for any random variables X,YX, Y (no independence required).

Proof. We prove (2) for the continuous case; the discrete case is analogous.

E[X+Y]=R2(x+y)fX,Y(x,y)dxdyE[X + Y] = \iint_{\mathbb{R}^2} (x + y)\, f_{X,Y}(x,y)\, dx\, dy

=R2xfX,Y(x,y)dxdy+R2yfX,Y(x,y)dxdy= \iint_{\mathbb{R}^2} x\, f_{X,Y}(x,y)\, dx\, dy + \iint_{\mathbb{R}^2} y\, f_{X,Y}(x,y)\, dx\, dy

=x(fX,Y(x,y)dy)dx+y(fX,Y(x,y)dx)dy= \int_{-\infty}^{\infty} x \left(\int_{-\infty}^{\infty} f_{X,Y}(x,y)\, dy\right) dx + \int_{-\infty}^{\infty} y \left(\int_{-\infty}^{\infty} f_{X,Y}(x,y)\, dx\right) dy

=xfX(x)dx+yfY(y)dy=E[X]+E[Y]= \int_{-\infty}^{\infty} x\, f_X(x)\, dx + \int_{-\infty}^{\infty} y\, f_Y(y)\, dy = E[X] + E[Y] \quad \blacksquare

  1. If XX and YY are independent, E[XY]=E[X]E[Y]E[XY] = E[X]E[Y].

Proof. E[XY]=xyfX(x)fY(y)dxdy=(xfX(x)dx)(yfY(y)dy)=E[X]E[Y]E[XY] = \iint xy\, f_X(x)f_Y(y)\, dx\, dy = \left(\int x f_X(x)\, dx\right)\left(\int y f_Y(y)\, dy\right) = E[X]E[Y]. \blacksquare

4.2 Variance

Var(X)=E[(XE[X])2]=E[X2](E[X])2\mathrm{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2

Theorem 4.2.

  1. Var(aX+b)=a2Var(X)\mathrm{Var}(aX + b) = a^2 \mathrm{Var}(X).
  2. If X,YX, Y are independent: Var(X+Y)=Var(X)+Var(Y)\mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y).

Proof. (1) Var(aX+b)=E[(aX+baE[X]b)2]=E[a2(XE[X])2]=a2Var(X)\mathrm{Var}(aX + b) = E[(aX + b - aE[X] - b)^2] = E[a^2(X - E[X])^2] = a^2 \mathrm{Var}(X).

(2) Var(X+Y)=E[(X+Y)2](E[X]+E[Y])2\mathrm{Var}(X + Y) = E[(X + Y)^2] - (E[X] + E[Y])^2 =E[X2]+2E[XY]+E[Y2]E[X]22E[X]E[Y]E[Y]2= E[X^2] + 2E[XY] + E[Y^2] - E[X]^2 - 2E[X]E[Y] - E[Y]^2 =(E[X2]E[X]2)+(E[Y2]E[Y]2)+2(E[XY]E[X]E[Y])= (E[X^2] - E[X]^2) + (E[Y^2] - E[Y]^2) + 2(E[XY] - E[X]E[Y]) =Var(X)+Var(Y)+2Cov(X,Y)= \mathrm{Var}(X) + \mathrm{Var}(Y) + 2\,\mathrm{Cov}(X,Y).

If X,YX, Y are independent, Cov(X,Y)=0\mathrm{Cov}(X,Y) = 0. \blacksquare

4.3 Moment Generating Functions

The moment generating function (MGF) of XX is

MX(t)=E[etX]M_X(t) = E[e^{tX}]

(provided the expectation exists in a neighbourhood of t=0t = 0).

Theorem 4.3. If MX(t)M_X(t) exists in a neighbourhood of 00Then E[Xn]=MX(n)(0)E[X^n] = M_X^{(n)}(0).

Proof. MX(t)=E[etX]=n=0tnn!E[Xn]M_X(t) = E[e^{tX}] = \sum_{n=0}^{\infty} \frac{t^n}{n!} E[X^n] (by expanding the Taylor series and exchanging Summation and expectation, justified by dominated convergence). The coefficient of tn/n!t^n/n! is E[Xn]E[X^n]So E[Xn]=MX(n)(0)E[X^n] = M_X^{(n)}(0). \blacksquare

Theorem 4.4 (Uniqueness). If MX(t)=MY(t)M_X(t) = M_Y(t) for all tt in a neighbourhood of 00Then XX And YY have the same distribution.

Theorem 4.5. If XX and YY are independent, MX+Y(t)=MX(t)MY(t)M_{X+Y}(t) = M_X(t) M_Y(t).

Proof. MX+Y(t)=E[et(X+Y)]=E[etXetY]=E[etX]E[etY]=MX(t)MY(t)M_{X+Y}(t) = E[e^{t(X+Y)}] = E[e^{tX} e^{tY}] = E[e^{tX}]\, E[e^{tY}] = M_X(t)\, M_Y(t) Where the third equality uses independence. \blacksquare

4.4 Important Inequalities

Theorem 4.6a (Markov’s Inequality). If X0X \geq 0 and a>0a \gt 0:

P(Xa)E[X]aP(X \geq a) \leq \frac{E[X]}{a}

Proof. E[X]=0xdF(x)axdF(x)aadF(x)=aP(Xa)E[X] = \int_0^\infty x\, dF(x) \geq \int_a^\infty x\, dF(x) \geq a \int_a^\infty dF(x) = a\, P(X \geq a). \blacksquare

Theorem 4.6b (Chebyshev’s Inequality). For any random variable XX with finite mean μ\mu and variance σ2\sigma^2 And any k>0k \gt 0:

P(Xμk)σ2k2P(|X - \mu| \geq k) \leq \frac{\sigma^2}{k^2}

Proof. Apply Markov’s inequality to (Xμ)2(X - \mu)^2 with a=k2a = k^2: P(Xμk)=P((Xμ)2k2)E[(Xμ)2]k2=σ2k2P(|X - \mu| \geq k) = P((X - \mu)^2 \geq k^2) \leq \frac{E[(X-\mu)^2]}{k^2} = \frac{\sigma^2}{k^2}. \blacksquare

Theorem 4.6c (Jensen’s Inequality). If φ\varphi is convex, then E[φ(X)]φ(E[X])E[\varphi(X)] \geq \varphi(E[X]). If φ\varphi is concave, the inequality reverses.

Proof (sketch). For a convex function φ\varphiThe tangent line at any point lies below the graph: φ(x)φ(μ)+φ(μ)(xμ)\varphi(x) \geq \varphi(\mu) + \varphi'(\mu)(x - \mu) where μ=E[X]\mu = E[X]. Taking expectations of both sides: E[φ(X)]φ(μ)+φ(μ)0=φ(E[X])E[\varphi(X)] \geq \varphi(\mu) + \varphi'(\mu) \cdot 0 = \varphi(E[X]). \blacksquare

Remark. Important applications: E[X2](E[X])2E[X^2] \geq (E[X])^2 (variance is non-negative, since x2x^2 is convex); E[logX]logE[X]E[\log X] \leq \log E[X] (logarithm is concave — this is used in proving the information inequality).

4.5 Cauchy-Schwarz Inequality for Random Variables

Theorem 4.6 (Cauchy-Schwarz). For any random variables X,YX, Y with finite second moments:

(E[XY])2E[X2]E[Y2](E[XY])^2 \leq E[X^2]\, E[Y^2]

Proof. For any real tt, E[(X+tY)2]=E[X2]+2tE[XY]+t2E[Y2]0E[(X + tY)^2] = E[X^2] + 2t\, E[XY] + t^2\, E[Y^2] \geq 0. This is a quadratic in tt That is always non-negative, so its discriminant must be non-positive:

(2E[XY])24E[Y2]E[X2]0    (E[XY])2E[X2]E[Y2](2E[XY])^2 - 4\, E[Y^2]\, E[X^2] \leq 0 \implies (E[XY])^2 \leq E[X^2]\, E[Y^2]

\blacksquare

Corollary 4.1. ρX,Y1|\rho_{X,Y}| \leq 1.

Proof. Apply Cauchy-Schwarz to XE[X]X - E[X] and YE[Y]Y - E[Y]:

Cov(X,Y)2Var(X)Var(Y)\mathrm{Cov}(X,Y)^2 \leq \mathrm{Var}(X)\, \mathrm{Var}(Y)

So ρX,Y=Cov(X,Y)Var(X)Var(Y)1|\rho_{X,Y}| = \frac{|\mathrm{Cov}(X,Y)|}{\sqrt{\mathrm{Var}(X)\,\mathrm{Var}(Y)}} \leq 1. \blacksquare

Theorem 4.7 (Conditional Variance Formula).

Var(Y)=E[Var(YX)]+Var(E[YX])\mathrm{Var}(Y) = E[\mathrm{Var}(Y \mid X)] + \mathrm{Var}(E[Y \mid X])

Proof. E[Var(YX)]=E[E[Y2X]]E[(E[YX])2]=E[Y2]E[(E[YX])2]E[\mathrm{Var}(Y \mid X)] = E[E[Y^2 \mid X]] - E[(E[Y \mid X])^2] = E[Y^2] - E[(E[Y \mid X])^2]. Also Var(E[YX])=E[(E[YX])2](E[E[YX]])2=E[(E[YX])2](E[Y])2\mathrm{Var}(E[Y \mid X]) = E[(E[Y \mid X])^2] - (E[E[Y \mid X]])^2 = E[(E[Y \mid X])^2] - (E[Y])^2. Adding: E[Y2](E[Y])2=Var(Y)E[Y^2] - (E[Y])^2 = \mathrm{Var}(Y). \blacksquare

Intuition. Total variance equals the average “within-group” variance plus the “between-group” variance. This Decomposition is the foundation of ANOVA (Analysis of Variance).

4.6 Worked Examples

Worked Example. Find the MGF of XBin(n,p)X \sim \mathrm{Bin}(n, p) and use it to derive E[X]E[X] and Var(X)\mathrm{Var}(X).

MX(t)=(1p+pet)nM_X(t) = (1 - p + pe^t)^n

MX(t)=n(1p+pet)n1petM_X'(t) = n(1 - p + pe^t)^{n-1} \cdot pe^t

E[X]=MX(0)=n(1)n1p=npE[X] = M_X'(0) = n(1)^{n-1} \cdot p = np

MX(t)=n(n1)(1p+pet)n2(pet)2+n(1p+pet)n1petM_X''(t) = n(n-1)(1 - p + pe^t)^{n-2}(pe^t)^2 + n(1 - p + pe^t)^{n-1} \cdot pe^t

E[X2]=MX(0)=n(n1)p2+npE[X^2] = M_X''(0) = n(n-1)p^2 + np

Var(X)=n(n1)p2+npn2p2=npnp2=np(1p)\mathrm{Var}(X) = n(n-1)p^2 + np - n^2p^2 = np - np^2 = np(1-p) \quad \blacksquare

Problem 4.3. The number of accidents at an intersection per week follows Poisson(λ)\mathrm{Poisson}(\lambda) with λ=2\lambda = 2. Find P(X1)P(X \leq 1) and, using Markov’s inequality, give an upper bound for P(X6)P(X \geq 6).

Solution

P(X1)=P(X=0)+P(X=1)=e2+2e2=3e20.406P(X \leq 1) = P(X = 0) + P(X = 1) = e^{-2} + 2e^{-2} = 3e^{-2} \approx 0.406

By Markov’s inequality (since X0X \geq 0):

P(X6)E[X]6=26=130.333P(X \geq 6) \leq \frac{E[X]}{6} = \frac{2}{6} = \frac{1}{3} \approx 0.333

For comparison, the exact value: P(X6)=1P(X5)=1e2(1+2+2+8/6+16/24+32/120)0.0166P(X \geq 6) = 1 - P(X \leq 5) = 1 - e^{-2}(1 + 2 + 2 + 8/6 + 16/24 + 32/120) \approx 0.0166. Markov’s bound is very loose but requires no knowledge beyond the mean.

MX(t)=E[etX]=0etxλeλxdx=λ0e(λt)xdxM_X(t) = E[e^{tX}] = \int_0^{\infty} e^{tx} \lambda e^{-\lambda x}\, dx = \lambda \int_0^{\infty} e^{-(\lambda - t)x}\, dx

For t<λt \lt \lambda: MX(t)=λλtM_X(t) = \frac{\lambda}{\lambda - t}.

MX(t)=λ(λt)2M_X'(t) = \frac{\lambda}{(\lambda - t)^2}So E[X]=MX(0)=1/λE[X] = M_X'(0) = 1/\lambda. \blacksquare

Problem 4.1. Find E[X2]E[X^2] and Var(X)\mathrm{Var}(X) for XExp(λ)X \sim \mathrm{Exp}(\lambda) using the MGF.

Solution

MX(t)=2λ(λt)3M_X''(t) = \frac{2\lambda}{(\lambda - t)^3}So E[X2]=MX(0)=2λλ3=2λ2E[X^2] = M_X''(0) = \frac{2\lambda}{\lambda^3} = \frac{2}{\lambda^2}.

Var(X)=E[X2](E[X])2=2λ21λ2=1λ2\mathrm{Var}(X) = E[X^2] - (E[X])^2 = \frac{2}{\lambda^2} - \frac{1}{\lambda^2} = \frac{1}{\lambda^2}

Problem 4.2. A fair die is rolled. Let XX be the value shown. Compute E[X]E[X], E[X2]E[X^2]And Var(X)\mathrm{Var}(X).

Solution

E[X]=16(1+2+3+4+5+6)=216=3.5E[X] = \frac{1}{6}(1 + 2 + 3 + 4 + 5 + 6) = \frac{21}{6} = 3.5

E[X2]=16(1+4+9+16+25+36)=91615.167E[X^2] = \frac{1}{6}(1 + 4 + 9 + 16 + 25 + 36) = \frac{91}{6} \approx 15.167

Var(X)=916(72)2=916494=18214712=35122.917\mathrm{Var}(X) = \frac{91}{6} - \left(\frac{7}{2}\right)^2 = \frac{91}{6} - \frac{49}{4} = \frac{182 - 147}{12} = \frac{35}{12} \approx 2.917

:::caution Common Pitfall Var(X)=E[X2](E[X])2\mathrm{Var}(X) = E[X^2] - (E[X])^2not (E[X])2E[X2](E[X])^2 - E[X^2]. The variance is always non-negative, so if you Obtain a negative value, you have made an arithmetic error. :::

5. Joint Distributions

5.1 Joint PDF and CDF

For two random variables XX and YYThe joint CDF is FX,Y(x,y)=P(Xx,Yy)F_{X,Y}(x,y) = P(X \leq x, Y \leq y).

The joint PDF (for continuous) satisfies P((X,Y)A)=AfX,Y(x,y)dxdyP((X,Y) \in A) = \iint_A f_{X,Y}(x,y)\, dx\, dy.

5.2 Marginal Distributions

The marginal PDF of XX is obtained by integrating out YY:

fX(x)=fX,Y(x,y)dyf_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y)\, dy

Similarly, fY(y)=fX,Y(x,y)dxf_Y(y) = \int_{-\infty}^{\infty} f_{X,Y}(x,y)\, dx.

5.3 Conditional Distributions

Definition. The conditional PDF of XX given Y=yY = y (when fY(y)>0f_Y(y) \gt 0) is

fXY(xy)=fX,Y(x,y)fY(y)f_{X \mid Y}(x \mid y) = \frac{f_{X,Y}(x,y)}{f_Y(y)}

Definition. The conditional expectation is

E[XY=y]=xfXY(xy)dxE[X \mid Y = y] = \int_{-\infty}^{\infty} x\, f_{X \mid Y}(x \mid y)\, dx

Theorem 5.0 (Law of Iterated Expectations / Tower Property).

E[X]=E[E[XY]]E[X] = E[E[X \mid Y]]

Proof. For the continuous case:

E[E[XY]]=E[XY=y]fY(y)dy=(xfXY(xy)dx)fY(y)dyE[E[X \mid Y]] = \int_{-\infty}^{\infty} E[X \mid Y = y]\, f_Y(y)\, dy = \int_{-\infty}^{\infty} \left(\int_{-\infty}^{\infty} x\, f_{X \mid Y}(x \mid y)\, dx\right) f_Y(y)\, dy

=xfX,Y(x,y)fY(y)fY(y)dxdy=xfX,Y(x,y)dxdy=E[X]= \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} x\, \frac{f_{X,Y}(x,y)}{f_Y(y)}\, f_Y(y)\, dx\, dy = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} x\, f_{X,Y}(x,y)\, dx\, dy = E[X]

\blacksquare

5.4 Independence

XX and YY are independent if and only if

fX,Y(x,y)=fX(x)fY(y)forallx,yf_{X,Y}(x,y) = f_X(x) f_Y(y) \quad \mathrm{for} all x, y

Theorem 5.1. If XX and YY are independent, then E[XY]=E[X]E[Y]E[XY] = E[X]E[Y] and Var(X+Y)=Var(X)+Var(Y)\mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y).

Proposition 5.1 (Independence Criteria). The following are equivalent for continuous X,YX, Y:

  1. XX and YY are independent.
  2. fX,Y(x,y)=fX(x)fY(y)f_{X,Y}(x,y) = f_X(x) f_Y(y) for all x,yx, y.
  3. fXY(xy)=fX(x)f_{X \mid Y}(x \mid y) = f_X(x) for all x,yx, y with fY(y)>0f_Y(y) \gt 0.
  4. FX,Y(x,y)=FX(x)FY(y)F_{X,Y}(x,y) = F_X(x)\, F_Y(y) for all x,yx, y.

5.5 Covariance and Correlation

Cov(X,Y)=E[(XE[X])(YE[Y])]=E[XY]E[X]E[Y]\mathrm{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y]

The correlation coefficient is

ρX,Y=Cov(X,Y)Var(X)Var(Y)\rho_{X,Y} = \frac{\mathrm{Cov}(X,Y)}{\sqrt{\mathrm{Var}(X)\mathrm{Var}(Y)}}

Properties:

  • Cov(aX+b,cY+d)=acCov(X,Y)\mathrm{Cov}(aX + b, cY + d) = ac\,\mathrm{Cov}(X, Y).
  • Cov(X,X)=Var(X)\mathrm{Cov}(X, X) = \mathrm{Var}(X).
  • Cov(X+Z,Y)=Cov(X,Y)+Cov(Z,Y)\mathrm{Cov}(X + Z, Y) = \mathrm{Cov}(X, Y) + \mathrm{Cov}(Z, Y) (bilinearity).
  • 1ρX,Y1-1 \leq \rho_{X,Y} \leq 1.
  • ρ=±1\rho = \pm 1 if and only if XX and YY are linearly related.
  • ρ=0\rho = 0 does not imply independence (only uncorrelatedness).

Proposition 5.2 (Variance of a Sum). For any random variables X1,,XnX_1, \ldots, X_n:

Var(i=1nXi)=i=1nVar(Xi)+21i<jnCov(Xi,Xj)\mathrm{Var}\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \mathrm{Var}(X_i) + 2\sum_{1 \leq i \lt j \leq n} \mathrm{Cov}(X_i, X_j)

Proof. Expand Var(Xi)=E[(Xi)2](E[Xi])2\mathrm{Var}(\sum X_i) = E[(\sum X_i)^2] - (E[\sum X_i])^2 and collect terms. \blacksquare

Remark. If the XiX_i are pairwise uncorrelated (which includes independence as a special case), the covariance Terms vanish and the variance of the sum equals the sum of the variances.

5.6 Transformation of Random Variables (Jacobian Method)

Theorem 5.2. Let (X,Y)(X, Y) have joint PDF fX,Y(x,y)f_{X,Y}(x,y) and let (U,V)=g(X,Y)(U, V) = g(X, Y) where gg is a bijection From R2\mathbb{R}^2 to R2\mathbb{R}^2 with inverse g1g^{-1}:

u=u(x,y),v=v(x,y)u = u(x, y), \quad v = v(x, y)

Then the joint PDF of (U,V)(U, V) is

fU,V(u,v)=fX,Y(x(u,v),y(u,v))Jf_{U,V}(u,v) = f_{X,Y}(x(u,v), y(u,v)) \cdot |J|

Where the Jacobian determinant is

J=det(xuxvyuyv)J = \det \begin{pmatrix} \frac{\partial x}{\partial u} & \frac{\partial x}{\partial v} \\ \frac{\partial y}{\partial u} & \frac{\partial y}{\partial v} \end{pmatrix}

Problem 5.1. Let X,YX, Y be independent with X,YExp(λ)X, Y \sim \mathrm{Exp}(\lambda). Find the joint distribution of U=X+YU = X + Y and V=X/(X+Y)V = X / (X + Y).

Solution

The inverse transformation is X=UVX = UV, Y=U(1V)Y = U(1 - V) for u>0,0<v<1u \gt 0, 0 \lt v \lt 1.

The Jacobian:

J=det(vu1vu)=uvu(1v)=uJ = \det \begin{pmatrix} v & u \\ 1 - v & -u \end{pmatrix} = -uv - u(1 - v) = -u

So J=u|J| = u.

The joint PDF of (X,Y)(X, Y) is fX,Y(x,y)=λ2eλ(x+y)f_{X,Y}(x,y) = \lambda^2 e^{-\lambda(x+y)} for x,y>0x, y \gt 0.

fU,V(u,v)=λ2eλ(uv+u(1v))u=λ2ueλu,u>0,0<v<1f_{U,V}(u,v) = \lambda^2 e^{-\lambda(uv + u(1-v))} \cdot u = \lambda^2 u\, e^{-\lambda u}, \quad u \gt 0, 0 \lt v \lt 1

This factors as fU(u)fV(v)f_U(u) \cdot f_V(v) where fU(u)=λ2ueλuf_U(u) = \lambda^2 u\, e^{-\lambda u} (Gamma(2,λ)(2, \lambda)) And fV(v)=1f_V(v) = 1 on (0,1)(0, 1) (Uniform(0,1)(0, 1)). Hence UU and VV are independent. \blacksquare

5.7 Bivariate Normal Distribution

Definition. (X,Y)(X, Y) has a bivariate normal distribution with parameters μX,μY,σX2,σY2,ρ\mu_X, \mu_Y, \sigma_X^2, \sigma_Y^2, \rho If the joint PDF is

f(x,y)=12πσXσY1ρ2exp(12(1ρ2)[(xμX)2σX22ρ(xμX)(yμY)σXσY+(yμY)2σY2])f(x,y) = \frac{1}{2\pi \sigma_X \sigma_Y \sqrt{1 - \rho^2}} \exp\left(-\frac{1}{2(1-\rho^2)}\left[\frac{(x - \mu_X)^2}{\sigma_X^2} - \frac{2\rho(x - \mu_X)(y - \mu_Y)}{\sigma_X \sigma_Y} + \frac{(y - \mu_Y)^2}{\sigma_Y^2}\right]\right)

Key Properties:

  1. Both marginals are normal: XN(μX,σX2)X \sim N(\mu_X, \sigma_X^2) and YN(μY,σY2)Y \sim N(\mu_Y, \sigma_Y^2).
  2. ρ=Corr(X,Y)\rho = \mathrm{Corr}(X, Y).
  3. XX and YY are independent if and only if ρ=0\rho = 0.
  4. Every linear combination aX+bYaX + bY is normally distributed.
  5. The conditional distribution YX=xY \mid X = x is normal with E[YX=x]=μY+ρσYσX(xμX)E[Y \mid X = x] = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X) and Var(YX=x)=σY2(1ρ2)\mathrm{Var}(Y \mid X = x) = \sigma_Y^2(1 - \rho^2).
  6. The joint MGF is

MX,Y(t1,t2)=exp(μXt1+μYt2+12(σX2t12+2ρσXσYt1t2+σY2t22))M_{X,Y}(t_1, t_2) = \exp\left(\mu_X t_1 + \mu_Y t_2 + \frac{1}{2}(\sigma_X^2 t_1^2 + 2\rho\sigma_X\sigma_Y t_1 t_2 + \sigma_Y^2 t_2^2)\right)

Problem 5.4. Let (X,Y)(X, Y) be bivariate normal with μX=0\mu_X = 0, μY=0\mu_Y = 0, σX=σY=1\sigma_X = \sigma_Y = 1, ρ=1/2\rho = 1/2. Find P(Y>1X=0.5)P(Y \gt 1 \mid X = 0.5).

Solution

The conditional distribution YX=0.5Y \mid X = 0.5 is normal with:

E[YX=0.5]=0+121(0.50)=0.25E[Y \mid X = 0.5] = 0 + \frac{1}{2} \cdot 1 \cdot (0.5 - 0) = 0.25

Var(YX=0.5)=1(11/4)=3/4,σ=3/2\mathrm{Var}(Y \mid X = 0.5) = 1 \cdot (1 - 1/4) = 3/4, \quad \sigma = \sqrt{3}/2

P(Y>1X=0.5)=P(Z>10.253/2)=P(Z>0.75×23)=P(Z>0.866)0.193P(Y \gt 1 \mid X = 0.5) = P\left(Z \gt \frac{1 - 0.25}{\sqrt{3}/2}\right) = P\left(Z \gt \frac{0.75 \times 2}{\sqrt{3}}\right) = P(Z \gt 0.866) \approx 0.193

5.8 Worked Examples

Problem 5.2. Let fX,Y(x,y)=8xyf_{X,Y}(x,y) = 8xy for 0x1,0yx0 \leq x \leq 1, 0 \leq y \leq x. Find the marginal distributions of XX And YYAnd determine whether XX and YY are independent.

Solution

fX(x)=0x8xydy=8xx22=4x3,0x1f_X(x) = \int_0^x 8xy\, dy = 8x \cdot \frac{x^2}{2} = 4x^3, \quad 0 \leq x \leq 1

fY(y)=y18xydx=8y1y22=4y(1y2),0y1f_Y(y) = \int_y^1 8xy\, dx = 8y \cdot \frac{1 - y^2}{2} = 4y(1 - y^2), \quad 0 \leq y \leq 1

Check: fX(x)fY(y)=4x34y(1y2)=16x3y(1y2)8xy=fX,Y(x,y)f_X(x) f_Y(y) = 4x^3 \cdot 4y(1 - y^2) = 16x^3 y(1 - y^2) \neq 8xy = f_{X,Y}(x,y).

Therefore XX and YY are not independent. \blacksquare

Problem 5.3. Let XN(0,1)X \sim N(0, 1) and Y=X2Y = X^2. Compute Cov(X,Y)\mathrm{Cov}(X, Y) and ρX,Y\rho_{X,Y}.

Solution

E[X]=0E[X] = 0, E[Y]=E[X2]=1E[Y] = E[X^2] = 1And E[XY]=E[X3]=0E[XY] = E[X^3] = 0 (since X3X^3 is an odd function of a symmetric distribution).

Cov(X,Y)=E[XY]E[X]E[Y]=00=0\mathrm{Cov}(X, Y) = E[XY] - E[X]E[Y] = 0 - 0 = 0

So ρX,Y=0\rho_{X,Y} = 0. However, XX and YY are not independent (knowing XX determines YY exactly). This demonstrates that zero correlation does not imply independence. \blacksquare

6. Limit Theorems

6.1 Convergence in Probability and Distribution

Definition. XnpXX_n \xrightarrow{p} X (convergence in probability) if for every ε>0\varepsilon \gt 0:

limnP(XnX>ε)=0\lim_{n \to \infty} P(|X_n - X| \gt \varepsilon) = 0

Definition. XndXX_n \xrightarrow{d} X (convergence in distribution) if limnFXn(x)=FX(x)\lim_{n \to \infty} F_{X_n}(x) = F_X(x) at all Continuity points of FXF_X.

Remark. Convergence in probability implies convergence in distribution. The converse does not hold , But does hold when the limit is a constant.

Proposition 6.1. If XnpcX_n \xrightarrow{p} c (a constant), then XndcX_n \xrightarrow{d} c.

Proof. If Fc(x)F_c(x) is the CDF of the constant ccThen Fc(x)=0F_c(x) = 0 for x<cx \lt c and Fc(x)=1F_c(x) = 1 for x>cx \gt c. For x>cx \gt c: FXn(x)=P(Xnx)=1P(Xn>x)10=1=Fc(x)F_{X_n}(x) = P(X_n \leq x) = 1 - P(X_n \gt x) \to 1 - 0 = 1 = F_c(x). For x<cx \lt c: FXn(x)=P(Xnx)P(Xnc>cx)0=Fc(x)F_{X_n}(x) = P(X_n \leq x) \leq P(|X_n - c| \gt c - x) \to 0 = F_c(x). Since FcF_c is continuous at all xcx \neq cConvergence holds. \blacksquare

Definition. Xna.s.XX_n \xrightarrow{a.s.} X (almost sure convergence) if P(limnXn=X)=1P(\lim_{n \to \infty} X_n = X) = 1.

Remark. The hierarchy of convergence is: almost sure     \implies in probability     \implies in distribution. None of the reverse implications hold .

6.2 Law of Large Numbers

Theorem 6.1 (Weak Law of Large Numbers). Let X1,X2,X_1, X_2, \ldots be i.i.d. With E[Xi]=μE[X_i] = \mu and Var(Xi)=σ2<\mathrm{Var}(X_i) = \sigma^2 \lt \infty. Then for every ε>0\varepsilon \gt 0:

limnP(1ni=1nXiμ>ε)=0\lim_{n \to \infty} P\left(\left|\frac{1}{n}\sum_{i=1}^n X_i - \mu\right| \gt \varepsilon\right) = 0

Proof. Let Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i. Then E[Xˉn]=μE[\bar{X}_n] = \mu and Var(Xˉn)=σ2/n\mathrm{Var}(\bar{X}_n) = \sigma^2/n. By Chebyshev’s inequality:

P(Xˉnμ>ε)Var(Xˉn)ε2=σ2nε20P(|\bar{X}_n - \mu| \gt \varepsilon) \leq \frac{\mathrm{Var}(\bar{X}_n)}{\varepsilon^2} = \frac{\sigma^2}{n\varepsilon^2} \to 0

\blacksquare

Theorem 6.2 (Strong Law of Large Numbers). Under the same hypotheses:

P(limn1ni=1nXi=μ)=1P\left(\lim_{n \to \infty} \frac{1}{n}\sum_{i=1}^n X_i = \mu\right) = 1

6.3 Central Limit Theorem

Theorem 6.3 (Central Limit Theorem). Let X1,X2,X_1, X_2, \ldots be i.i.d. With E[Xi]=μE[X_i] = \mu and Var(Xi)=σ2>0\mathrm{Var}(X_i) = \sigma^2 \gt 0. Then

Xˉnμσ/ndN(0,1)\frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \xrightarrow{d} N(0, 1)

That is, for all a<ba \lt b:

limnP(a<Xˉnμσ/n<b)=Φ(b)Φ(a)\lim_{n \to \infty} P\left(a \lt \frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \lt b\right) = \Phi(b) - \Phi(a)

Proof (sketch using MGFs). Let Yi=(Xiμ)/σY_i = (X_i - \mu)/\sigmaSo E[Yi]=0E[Y_i] = 0 and Var(Yi)=1\mathrm{Var}(Y_i) = 1. Define Zn=1ni=1nYiZ_n = \frac{1}{\sqrt{n}} \sum_{i=1}^n Y_i. We show MZn(t)et2/2M_{Z_n}(t) \to e^{t^2/2} (the standard normal MGF).

Let M(t)=E[etY1]M(t) = E[e^{tY_1}] be the MGF of Y1Y_1. Then:

MZn(t)=[M(tn)]nM_{Z_n}(t) = \left[M\left(\frac{t}{\sqrt{n}}\right)\right]^n

By Taylor expansion of MM around 0: M(h)=1+hM(0)+h22M(0)+o(h2)=1+0+h22+o(h2)M(h) = 1 + h\, M'(0) + \frac{h^2}{2} M''(0) + o(h^2) = 1 + 0 + \frac{h^2}{2} + o(h^2) (since E[Y1]=0E[Y_1] = 0 and E[Y12]=1E[Y_1^2] = 1).

Therefore:

MZn(t)=[1+t22n+o(1n)]net2/2M_{Z_n}(t) = \left[1 + \frac{t^2}{2n} + o\left(\frac{1}{n}\right)\right]^n \to e^{t^2/2}

As nn \to \infty. By the continuity theorem for MGFs, ZndN(0,1)Z_n \xrightarrow{d} N(0, 1). \blacksquare

6.4 Slutsky’s Theorem

Theorem 6.4 (Slutsky). If XndXX_n \xrightarrow{d} X and YnpcY_n \xrightarrow{p} c (a constant), then:

  1. Xn+YndX+cX_n + Y_n \xrightarrow{d} X + c.
  2. YnXndcXY_n X_n \xrightarrow{d} cX.
  3. Xn/YndX/cX_n / Y_n \xrightarrow{d} X / c (provided c0c \neq 0).

Intuition. Slutsky’s theorem allows us to replace a convergent-in-probability sequence by its limit inside Expressions that converge in distribution. This is essential for deriving the asymptotic distribution of tt-…/4-statistics-and-probability/2_statistics, for instance.

Corollary 6.1 (Asymptotic distribution of the t-statistic). If X1,,XnX_1, \ldots, X_n are i.i.d. With mean μ\mu Variance σ2\sigma^2And fourth moment, then

XˉnμSn/ndN(0,1)\frac{\bar{X}_n - \mu}{S_n / \sqrt{n}} \xrightarrow{d} N(0, 1)

Where Sn2=1n1i=1n(XiXˉn)2S_n^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X}_n)^2.

Proof. By the CLT, n(Xˉnμ)/σdN(0,1)\sqrt{n}(\bar{X}_n - \mu)/\sigma \xrightarrow{d} N(0, 1). By the WLLN, Sn2pσ2S_n^2 \xrightarrow{p} \sigma^2 So SnpσS_n \xrightarrow{p} \sigma. By the continuous mapping theorem, σ/Snp1\sigma / S_n \xrightarrow{p} 1. Applying Slutsky’s Theorem:

n(Xˉnμ)σσSndN(0,1)1=N(0,1)\frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \cdot \frac{\sigma}{S_n} \xrightarrow{d} N(0, 1) \cdot 1 = N(0, 1) \quad \blacksquare

6.5 Delta Method

Theorem 6.5 (Delta Method). If n(Tnθ)dN(0,σ2)\sqrt{n}(T_n - \theta) \xrightarrow{d} N(0, \sigma^2) and gg is differentiable At θ\theta with g(θ)0g'(\theta) \neq 0Then

n(g(Tn)g(θ))dN(0,σ2[g(θ)]2)\sqrt{n}(g(T_n) - g(\theta)) \xrightarrow{d} N(0, \sigma^2 [g'(\theta)]^2)

Proof (sketch). By a first-order Taylor expansion: g(Tn)g(θ)+g(θ)(Tnθ)g(T_n) \approx g(\theta) + g'(\theta)(T_n - \theta). Multiplying by n\sqrt{n}: n(g(Tn)g(θ))g(θ)n(Tnθ)\sqrt{n}(g(T_n) - g(\theta)) \approx g'(\theta) \cdot \sqrt{n}(T_n - \theta). The right side converges in distribution to g(θ)N(0,σ2)=N(0,σ2[g(θ)]2)g'(\theta) \cdot N(0, \sigma^2) = N(0, \sigma^2[g'(\theta)]^2). Slutsky’s theorem makes this rigorous. \blacksquare

Problem 6.4. Let X1,,XnX_1, \ldots, X_n be i.i.d. Poisson(λ)\mathrm{Poisson}(\lambda). Find the asymptotic distribution of n(XˉneXˉn)\sqrt{n}(\bar{X}_n - e^{-\bar{X}_n}) using the delta method.

Solution

By the CLT, n(Xˉnλ)dN(0,λ)\sqrt{n}(\bar{X}_n - \lambda) \xrightarrow{d} N(0, \lambda) (since Var(Xi)=λ\mathrm{Var}(X_i) = \lambda).

Let g(t)=tetg(t) = t - e^{-t}. Then g(t)=1+etg'(t) = 1 + e^{-t}So g(λ)=1+eλg'(\lambda) = 1 + e^{-\lambda}.

By the delta method:

n(g(Xˉn)g(λ))dN(0,λ(1+eλ)2)\sqrt{n}(g(\bar{X}_n) - g(\lambda)) \xrightarrow{d} N\left(0, \lambda(1 + e^{-\lambda})^2\right)

6.5 Worked Examples

Problem 6.1. A factory produces light bulbs with mean lifetime 1000 hours and standard deviation 200 Hours. What is the probability that the mean lifetime of 100 bulbs exceeds 1040 hours?

Solution

By the CLT, Xˉ100N(1000,2002/100)=N(1000,400)\bar{X}_{100} \approx N(1000, 200^2/100) = N(1000, 400).

P(Xˉ>1040)=P(Z>1040100020)=P(Z>2)0.0228P(\bar{X} \gt 1040) = P\left(Z \gt \frac{1040 - 1000}{20}\right) = P(Z \gt 2) \approx 0.0228

\blacksquare

Problem 6.2. A coin is flipped 10,000 times. Approximate the probability that the number of heads is between 4,900 and 5,100.

Solution

Let XBin(10000,0.5)X \sim \mathrm{Bin}(10000, 0.5)So E[X]=5000E[X] = 5000 and Var(X)=2500\mathrm{Var}(X) = 2500. By the normal approximation With continuity correction:

P(4900X5100)P(4899.5500050Z5100.5500050)P(4900 \leq X \leq 5100) \approx P\left(\frac{4899.5 - 5000}{50} \leq Z \leq \frac{5100.5 - 5000}{50}\right)

=P(2.01Z2.01)Φ(2.01)Φ(2.01)2×0.97781=0.9556= P(-2.01 \leq Z \leq 2.01) \approx \Phi(2.01) - \Phi(-2.01) \approx 2 \times 0.9778 - 1 = 0.9556

Problem 6.3. Let X1,,XnX_1, \ldots, X_n be i.i.d. With mean μ\mu and variance σ2\sigma^2. Let S2=1n1(XiXˉ)2S^2 = \frac{1}{n-1}\sum(X_i - \bar{X})^2. Show that S2pσ2S^2 \xrightarrow{p} \sigma^2.

Solution

First, note E[S2]=σ2E[S^2] = \sigma^2 (it is unbiased). We need to show Var(S2)0\mathrm{Var}(S^2) \to 0 as nn \to \infty. Since S2S^2 is a sample average of i.i.d. Random variables (after centering), by the weak law of large numbers, S2pσ2S^2 \xrightarrow{p} \sigma^2. \blacksquare

7. Maximum Likelihood Estimation

7.1 Likelihood Function

Given i.i.d. Observations x1,,xnx_1, \ldots, x_n from a distribution with parameter θ\thetaThe likelihood function is

L(θ)=i=1nf(xiθ)L(\theta) = \prod_{i=1}^n f(x_i \mid \theta)

And the log-likelihood is

(θ)=logL(θ)=i=1nlogf(xiθ)\ell(\theta) = \log L(\theta) = \sum_{i=1}^n \log f(x_i \mid \theta)

7.2 MLE Procedure

The maximum likelihood estimator (MLE) θ^MLE\hat{\theta}_{\mathrm{MLE}} maximises L(θ)L(\theta) (equivalently, (θ)\ell(\theta)):

θ^MLE=argmaxθL(θ)\hat{\theta}_{\mathrm{MLE} = \arg\max_\theta L(\theta)}

found by solving (θ)=0\ell'(\theta) = 0 and verifying (θ^)<0\ell''(\hat{\theta}) \lt 0.

7.3 Properties of MLEs

Theorem 7.1 (Consistency — Sketch). Under regularity conditions, θ^MLEpθ0\hat{\theta}_{\mathrm{MLE} \xrightarrow{p} \theta_0} (the true parameter).

Proof sketch. By the law of large numbers, 1n(θ)pEθ0[logf(Xθ)]\frac{1}{n}\ell(\theta) \xrightarrow{p} E_{\theta_0}[\log f(X \mid \theta)] For each θ\theta. The Kullback-Leibler divergence D(θ0θ)=Eθ0[logf(Xθ)]+Eθ0[logf(Xθ0)]D(\theta_0 \| \theta) = -E_{\theta_0}[\log f(X \mid \theta)] + E_{\theta_0}[\log f(X \mid \theta_0)] Is minimised (at zero) when θ=θ0\theta = \theta_0 by the information inequality. Therefore the maximiser of (θ)\ell(\theta) converges in probability to θ0\theta_0.

Theorem 7.2 (Asymptotic Normality). Under regularity conditions:

n(θ^MLEθ0)dN(0,1I(θ0))\sqrt{n}(\hat{\theta}_{\mathrm{MLE} - \theta_0) \xrightarrow{d} N\left(0, \frac{1}{I(\theta_0)}\right)}

Where I(θ0)I(\theta_0) is the Fisher information.

7.4 Fisher Information and the Cramer-Rao Bound

Definition. The Fisher information for a single observation is

I(θ)=E[(θlogf(Xθ))2]=E[2θ2logf(Xθ)]I(\theta) = E\left[\left(\frac{\partial}{\partial \theta} \log f(X \mid \theta)\right)^2\right] = -E\left[\frac{\partial^2}{\partial \theta^2} \log f(X \mid \theta)\right]

Provided the interchange of differentiation and integration is valid.

Theorem 7.3 (Cramer-Rao Lower Bound). For any unbiased estimator TT of θ\theta:

Var(T)1nI(θ)\mathrm{Var}(T) \geq \frac{1}{n\, I(\theta)}

Intuition. The Cramer-Rao bound gives a theoretical minimum for the variance of any unbiased estimator. An Estimator that achieves this bound is called efficient.

Example 7.1. For XN(μ,σ2)X \sim N(\mu, \sigma^2) with σ2\sigma^2 known, the Fisher information about μ\mu is I(μ)=1/σ2I(\mu) = 1/\sigma^2. The MLE μ^=Xˉ\hat{\mu} = \bar{X} has Var(Xˉ)=σ2/n=1/(nI(μ))\mathrm{Var}(\bar{X}) = \sigma^2/n = 1/(n \cdot I(\mu)) So the sample mean achieves the Cramer-Rao bound and is efficient.

7.5 Confidence Intervals

Definition. A 100(1α)%100(1 - \alpha)\% confidence interval for θ\theta is a random interval [L,U][L, U] such that

Pθ(LθU)=1αP_\theta(L \leq \theta \leq U) = 1 - \alpha

Theorem 7.4. Under the asymptotic normality of the MLE, an approximate 100(1α)%100(1-\alpha)\% confidence interval For θ\theta is

θ^±zα/21nI(θ^)\hat{\theta} \pm z_{\alpha/2} \cdot \frac{1}{\sqrt{n\, I(\hat{\theta})}}

Where zα/2=Φ1(1α/2)z_{\alpha/2} = \Phi^{-1}(1 - \alpha/2).

Example 7.2. For X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim N(\mu, \sigma^2) with σ\sigma known, the exact 100(1α)%100(1-\alpha)\% Confidence interval for μ\mu is:

Xˉ±zα/2σn\bar{X} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}

When σ\sigma is unknown, replace σ\sigma with SS and zα/2z_{\alpha/2} with tn1,α/2t_{n-1, \alpha/2}:

Xˉ±tn1,α/2Sn\bar{X} \pm t_{n-1, \alpha/2} \cdot \frac{S}{\sqrt{n}}

Theorem 7.5 (Wald Confidence Interval). For a scalar parameter θ\theta with MLE θ^\hat{\theta} satisfying Asymptotic normality, the Wald confidence interval is

θ^±zα/2SE(θ^)^\hat{\theta} \pm z_{\alpha/2}\, \widehat{\mathrm{SE}(\hat{\theta})}

Where SE(θ^)=1/nI(θ^)^\widehat{\mathrm{SE}(\hat{\theta}) = 1/\sqrt{n\, I(\hat{\theta})}} is the estimated standard error.

Problem 7.4. In a survey of 400 people, 220 support a policy. Construct a 95% confidence interval for the True proportion pp.

Solution

p^=220/400=0.55\hat{p} = 220/400 = 0.55. For a Bernoulli: I(p)=1/[p(1p)]I(p) = 1/[p(1-p)]So SE=p^(1p^)/n=0.55×0.45/400=0.0006190.02488^\widehat{\mathrm{SE} = \sqrt{\hat{p}(1 - \hat{p})/n} = \sqrt{0.55 \times 0.45 / 400} = \sqrt{0.000619} \approx 0.02488}.

95%CI:0.55±1.96×0.02488=0.55±0.0488=(0.501,0.599)95\%\, \mathrm{CI}: 0.55 \pm 1.96 \times 0.02488 = 0.55 \pm 0.0488 = (0.501, 0.599)

7.6 Worked Examples

Problem 7.1. Find the MLE for λ\lambda given i.i.d. Observations x1,,xnx_1, \ldots, x_n from Exp(λ)\mathrm{Exp}(\lambda).

Solution

L(λ)=i=1nλeλxi=λneλxiL(\lambda) = \prod_{i=1}^n \lambda e^{-\lambda x_i} = \lambda^n e^{-\lambda \sum x_i}

(λ)=nlogλλi=1nxi\ell(\lambda) = n \log \lambda - \lambda \sum_{i=1}^n x_i

ddλ=nλi=1nxi=0    λ^=nxi=1xˉ\frac{d\ell}{d\lambda} = \frac{n}{\lambda} - \sum_{i=1}^n x_i = 0 \implies \hat{\lambda} = \frac{n}{\sum x_i} = \frac{1}{\bar{x}}

Verify: d2dλ2=nλ2<0\frac{d^2\ell}{d\lambda^2} = -\frac{n}{\lambda^2} \lt 0Confirming this is a maximum. \blacksquare

Problem 7.2. Find the MLE for pp given i.i.d. Observations from Bin(n,p)\mathrm{Bin}(n, p) (observed counts x1,,xmx_1, \ldots, x_m).

Solution

L(p)=j=1m(nxj)pxj(1p)nxjL(p) = \prod_{j=1}^m \binom{n}{x_j} p^{x_j} (1-p)^{n - x_j}

(p)=j=1m[log(nxj)+xjlogp+(nxj)log(1p)]\ell(p) = \sum_{j=1}^m \left[\log \binom{n}{x_j} + x_j \log p + (n - x_j) \log(1 - p)\right]

ddp=j=1m[xjpnxj1p]=xjpmnxj1p=0\frac{d\ell}{dp} = \sum_{j=1}^m \left[\frac{x_j}{p} - \frac{n - x_j}{1 - p}\right] = \frac{\sum x_j}{p} - \frac{mn - \sum x_j}{1 - p} = 0

Solving: (1p)xj=p(mnxj)(1 - p)\sum x_j = p(mn - \sum x_j)So xj=pmn\sum x_j = pmnHence p^=xjmn=xˉn\hat{p} = \frac{\sum x_j}{mn} = \frac{\bar{x}}{n}.

Verify: d2dp2=xjp2mnxj(1p)2<0\frac{d^2\ell}{dp^2} = -\frac{\sum x_j}{p^2} - \frac{mn - \sum x_j}{(1-p)^2} \lt 0. \blacksquare

Problem 7.3. Compute the Fisher information for λ\lambda in the exponential model and construct a 95% confidence Interval.

Solution

For XExp(λ)X \sim \mathrm{Exp}(\lambda): f(xλ)=λeλxf(x \mid \lambda) = \lambda e^{-\lambda x}So logf=logλλx\log f = \log \lambda - \lambda x.

λlogf=1λx\frac{\partial}{\partial \lambda} \log f = \frac{1}{\lambda} - x

I(λ)=E[2λ2logf]=E[1λ2]=1λ2I(\lambda) = -E\left[\frac{\partial^2}{\partial \lambda^2} \log f\right] = -E\left[-\frac{1}{\lambda^2}\right] = \frac{1}{\lambda^2}

The MLE λ^=1/Xˉ\hat{\lambda} = 1/\bar{X} is approximately N(λ,1/(nI(λ)))=N(λ,λ2/n)N(\lambda, 1/(n \cdot I(\lambda))) = N(\lambda, \lambda^2/n).

A 95% confidence interval is:

λ^±1.96λ^n\hat{\lambda} \pm 1.96 \cdot \frac{\hat{\lambda}}{\sqrt{n}}

:::caution Common Pitfall The MLE is not always unbiased. For example, the MLE σ^2=1n(XiXˉ)2\hat{\sigma}^2 = \frac{1}{n}\sum (X_i - \bar{X})^2 For the normal variance is biased; the unbiased estimator uses n1n - 1 in the denominator. :::

8. Hypothesis Testing

8.1 Framework

A hypothesis test evaluates two competing statements:

  • Null hypothesis H0H_0: the status quo (e.g., μ=μ0\mu = \mu_0).
  • Alternative hypothesis H1H_1: what we want to show (e.g., μ>μ0\mu \gt \mu_0).

8.2 Test Statistics and Decisions

A test statistic TT is a function of the data. We reject H0H_0 if TT falls in the rejection Region (critical region).

Type I error: rejecting H0H_0 when it is true (false positive). Probability = α\alpha (significance Level).

Type II error: failing to reject H0H_0 when it is false (false negative). Probability = β\beta.

The power of a test is 1β=P(rejectH0H1istrue)1 - \beta = P(\mathrm{reject} H_0 \mid H_1 \mathrm{ is} true).

8.3 Neyman-Pearson Lemma

Theorem 8.1 (Neyman-Pearson Lemma). Consider testing H0:θ=θ0H_0: \theta = \theta_0 versus H1:θ=θ1H_1: \theta = \theta_1 Based on a single observation XX with PDF f(xθ)f(x \mid \theta). The most powerful test of level α\alpha rejects H0H_0 when the likelihood ratio exceeds a threshold:

Λ(x)=f(xθ1)f(xθ0)>k\Lambda(x) = \frac{f(x \mid \theta_1)}{f(x \mid \theta_0)} \gt k

For some kk chosen so that P(Λ(X)>kH0)=αP(\Lambda(X) \gt k \mid H_0) = \alpha.

Proof (sketch). Consider any test ϕ\phi with level α\alpha (i.e., Eθ0[ϕ(X)]αE_{\theta_0}[\phi(X)] \leq \alpha). The power Under H1H_1 is Eθ1[ϕ(X)]=ϕ(x)f(xθ1)dxE_{\theta_1}[\phi(X)] = \int \phi(x) f(x \mid \theta_1)\, dx. Write this as ϕ(x)Λ(x)f(xθ0)dx\int \phi(x) \Lambda(x) f(x \mid \theta_0)\, dx. The likelihood ratio test ϕ\phi^* rejects when Λ>k\Lambda \gt k and Randomises on the boundary, so it assigns the largest ϕ(x)\phi^*(x) values to the largest Λ(x)\Lambda(x) values. Any other Level-α\alpha test assigns less rejection probability to large-Λ\Lambda regions and cannot exceed the power of ϕ\phi^*. \blacksquare

8.4 Likelihood Ratio Tests (General)

For composite hypotheses H0:θΘ0H_0: \theta \in \Theta_0 vs H1:θΘ1H_1: \theta \in \Theta_1The generalised likelihood Ratio statistic is

Λ=supθΘ0L(θ)supθΘL(θ)\Lambda = \frac{\sup_{\theta \in \Theta_0} L(\theta)}{\sup_{\theta \in \Theta} L(\theta)}

We reject H0H_0 when Λ\Lambda is small (equivalently, when 2logΛ-2\log \Lambda is large).

Theorem 8.2 (Wilks’ Theorem). Under H0H_0 and regularity conditions:

2logΛdχd2-2 \log \Lambda \xrightarrow{d} \chi^2_d

Where d=dim(Θ)dim(Θ0)d = \dim(\Theta) - \dim(\Theta_0) is the difference in the number of free parameters.

8.5 p-Values

The p-value is the probability of observing a test statistic at least as extreme as the one Computed, assuming H0H_0 is true. We reject H0H_0 if the p-value is less than α\alpha.

8.6 Z-Test for a Mean

If X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim N(\mu, \sigma^2) with known σ\sigmaTo test H0:μ=μ0H_0: \mu = \mu_0:

Z=Xˉμ0σ/nZ = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}}

Under H0H_0, ZN(0,1)Z \sim N(0, 1).

  • For H1:μ>μ0H_1: \mu \gt \mu_0: reject if Z>zαZ \gt z_\alpha.
  • For H1:μ<μ0H_1: \mu \lt \mu_0: reject if Z<zαZ \lt -z_\alpha.
  • For H1:μμ0H_1: \mu \neq \mu_0: reject if Z>zα/2|Z| \gt z_{\alpha/2}.

8.7 t-Test for a Mean (Unknown Variance)

If σ\sigma is unknown, replace σ\sigma with the sample standard deviation SS:

T=Xˉμ0S/nT = \frac{\bar{X} - \mu_0}{S / \sqrt{n}}

Under H0H_0, Ttn1T \sim t_{n-1} (Student’s t-distribution with n1n - 1 degrees of freedom).

8.8 Chi-Squared Test for Variance

To test H0:σ2=σ02H_0: \sigma^2 = \sigma_0^2 for X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim N(\mu, \sigma^2):

χ2=(n1)S2σ02\chi^2 = \frac{(n-1)S^2}{\sigma_0^2}

Under H0H_0, χ2χn12\chi^2 \sim \chi^2_{n-1}.

8.9 Chi-Squared Goodness-of-Fit Test

To test whether observed data follow a specified discrete distribution, partition the sample space into kk cells With expected counts eie_i and observed counts oio_i. The test statistic is

χ2=i=1k(oiei)2ei\chi^2 = \sum_{i=1}^k \frac{(o_i - e_i)^2}{e_i}

Under H0H_0 (and provided expected counts are sufficiently large), χ2χk1p2\chi^2 \sim \chi^2_{k - 1 - p} where pp is the Number of parameters estimated from the data.

Problem 8.4. A die is rolled 60 times with the following frequencies: \\{1: 8, 2: 12, 3: 9, 4: 11, 5: 13, 6: 7\\}. Test whether the die is fair at α=0.05\alpha = 0.05.

Solution

H0H_0: die is fair (each face has probability 1/61/6). Expected count per face: ei=60/6=10e_i = 60/6 = 10.

χ2=(810)210+(1210)210+(910)210+(1110)210+(1310)210+(710)210\chi^2 = \frac{(8 - 10)^2}{10} + \frac{(12 - 10)^2}{10} + \frac{(9 - 10)^2}{10} + \frac{(11 - 10)^2}{10} + \frac{(13 - 10)^2}{10} + \frac{(7 - 10)^2}{10}

=4+4+1+1+9+910=2810=2.8= \frac{4 + 4 + 1 + 1 + 9 + 9}{10} = \frac{28}{10} = 2.8

Under H0H_0, χ2χ52\chi^2 \sim \chi^2_5. The critical value at α=0.05\alpha = 0.05 is χ5,0.052=11.07\chi^2_{5, 0.05} = 11.07.

Since 2.8<11.072.8 \lt 11.07We fail to reject H0H_0. There is insufficient evidence that the die is unfair. \blacksquare

8.10 Two-Sample Tests

Two-sample Z-test. To test H0:μ1μ2=0H_0: \mu_1 - \mu_2 = 0 for independent samples with known variances σ12,σ22\sigma_1^2, \sigma_2^2:

Z=Xˉ1Xˉ2σ12/n1+σ22/n2Z = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\sigma_1^2/n_1 + \sigma_2^2/n_2}}

Under H0H_0, ZN(0,1)Z \sim N(0, 1).

Two-sample t-test (Welch’s). When variances are unknown and possibly unequal:

T=Xˉ1Xˉ2S12/n1+S22/n2T = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{S_1^2/n_1 + S_2^2/n_2}}

The degrees of freedom are approximated by Welch’s formula:

ν=(S12/n1+S22/n2)2(S12/n1)2n11+(S22/n2)2n21\nu = \frac{(S_1^2/n_1 + S_2^2/n_2)^2}{\frac{(S_1^2/n_1)^2}{n_1 - 1} + \frac{(S_2^2/n_2)^2}{n_2 - 1}}

Problem 8.5. A study compares two teaching methods. Method A (20 students): mean score 78, standard deviation 8. Method B (25 students): mean score 72, standard deviation 10. Test H0:μA=μBH_0: \mu_A = \mu_B vs H1:μAμBH_1: \mu_A \neq \mu_B At α=0.05\alpha = 0.05.

Solution

Using Welch’s t-test:

T=787264/20+100/25=63.2+4=67.2=62.6832.236T = \frac{78 - 72}{\sqrt{64/20 + 100/25}} = \frac{6}{\sqrt{3.2 + 4}} = \frac{6}{\sqrt{7.2}} = \frac{6}{2.683} \approx 2.236

ν=(3.2+4)23.22/19+42/24=51.840.539+0.667=51.841.20642.98\nu = \frac{(3.2 + 4)^2}{3.2^2/19 + 4^2/24} = \frac{51.84}{0.539 + 0.667} = \frac{51.84}{1.206} \approx 42.98

Use ν43\nu \approx 43 degrees of freedom. The critical values for a two-sided test at α=0.05\alpha = 0.05 are Approximately ±2.017\pm 2.017.

Since T=2.236>2.017|T| = 2.236 \gt 2.017We reject H0H_0 at the 5% significance level. There is evidence that the two Teaching methods produce different mean scores. \blacksquare

8.9 Worked Examples

Problem 8.1. A process produces bolts with mean diameter 10mm. A sample of 25 bolts has mean 10.12mm And standard deviation 0.3mm. Test H0:μ=10H_0: \mu = 10 vs H1:μ10H_1: \mu \neq 10 at α=0.05\alpha = 0.05.

Solution

Use the t-test: T=10.12100.3/25=0.120.06=2T = \frac{10.12 - 10}{0.3/\sqrt{25}} = \frac{0.12}{0.06} = 2.

Under H0H_0, Tt24T \sim t_{24}. The critical values are t24,0.0252.064t_{24, 0.025} \approx 2.064.

Since T=2<2.064|T| = 2 \lt 2.064We fail to reject H0H_0 at the 5% significance level. There is insufficient Evidence to conclude the mean diameter differs from 10mm. \blacksquare

Problem 8.2. A pharmaceutical company claims a drug reduces blood pressure by 5mmHg on average. In a trial Of 50 patients, the observed mean reduction was 4.2mmHg with standard deviation 3.1mmHg. Test the claim at α=0.05\alpha = 0.05.

Solution

H0:μ=5H_0: \mu = 5 vs H1:μ5H_1: \mu \neq 5.

T=4.253.1/50=0.80.43841.825T = \frac{4.2 - 5}{3.1 / \sqrt{50}} = \frac{-0.8}{0.4384} \approx -1.825

Under H0H_0, Tt49T \sim t_{49}. The critical values for a two-sided test at α=0.05\alpha = 0.05 are approximately ±2.010\pm 2.010.

Since T=1.825<2.010|T| = 1.825 \lt 2.010We fail to reject H0H_0. There is insufficient evidence to refute the company’s Claim. \blacksquare

Problem 8.3. Let X1,,XnX_1, \ldots, X_n be i.i.d. N(μ,σ2)N(\mu, \sigma^2) with σ2\sigma^2 known. Derive the likelihood ratio Test for H0:μ=μ0H_0: \mu = \mu_0 vs H1:μμ0H_1: \mu \neq \mu_0.

Solution

Under H0H_0: supμ=μ0L(μ,σ2)=L(μ0,σ2)\sup_{\mu = \mu_0} L(\mu, \sigma^2) = L(\mu_0, \sigma^2).

Under H1H0H_1 \cup H_0: supμL(μ,σ2)=L(xˉ,σ2)\sup_\mu L(\mu, \sigma^2) = L(\bar{x}, \sigma^2).

Λ=L(μ0,σ2)L(xˉ,σ2)=exp(12σ2[(xiμ0)2(xixˉ)2])\Lambda = \frac{L(\mu_0, \sigma^2)}{L(\bar{x}, \sigma^2)} = \exp\left(-\frac{1}{2\sigma^2}\left[\sum(x_i - \mu_0)^2 - \sum(x_i - \bar{x})^2\right]\right)

Now (xiμ0)2=(xixˉ)2+n(xˉμ0)2\sum(x_i - \mu_0)^2 = \sum(x_i - \bar{x})^2 + n(\bar{x} - \mu_0)^2So:

Λ=exp(n(xˉμ0)22σ2)\Lambda = \exp\left(-\frac{n(\bar{x} - \mu_0)^2}{2\sigma^2}\right)

We reject H0H_0 when Λ\Lambda is small, i.e., when n(xˉμ0)2σ2\frac{n(\bar{x} - \mu_0)^2}{\sigma^2} is large, which is Equivalent to Xˉμ0σ/n>c\left|\frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}}\right| \gt c. This recovers the Z-test. \blacksquare

:::caution Common Pitfall “Failing to reject H0H_0” is not the same as “accepting H0H_0”. The test only provides evidence against H0H_0; absence of evidence is not evidence of absence. The distinction is critical in scientific Reasoning. :::

9. Problem Set

Problem 1. Let A,B,CA, B, C be events with P(A)=0.4P(A) = 0.4, P(B)=0.5P(B) = 0.5, P(C)=0.3P(C) = 0.3, P(AB)=0.2P(A \cap B) = 0.2 P(AC)=0.1P(A \cap C) = 0.1, P(BC)=0.15P(B \cap C) = 0.15And P(ABC)=0.05P(A \cap B \cap C) = 0.05. Compute P(ABC)P(A \cup B \cup C).

Solution

By the general inclusion-exclusion principle:

P(ABC)=P(A)+P(B)+P(C)P(AB)P(AC)P(BC)+P(ABC)P(A \cup B \cup C) = P(A) + P(B) + P(C) - P(A \cap B) - P(A \cap C) - P(B \cap C) + P(A \cap B \cap C)

=0.4+0.5+0.30.20.10.15+0.05=0.80= 0.4 + 0.5 + 0.3 - 0.2 - 0.1 - 0.15 + 0.05 = 0.80

If you get this wrong, revise: Section 1.5 (Inclusion-Exclusion).

Problem 2. Cards are drawn without replacement from a standard 52-card deck. What is the probability that the Third ace appears on the 10th draw?

Solution

We need exactly 2 aces in the first 9 draws and the 10th card is an ace.

P=(42)(487)(529)×243=6×628914993679075400×2430.00476P = \frac{\binom{4}{2}\binom{48}{7}}{\binom{52}{9}} \times \frac{2}{43} = \frac{6 \times 62891499}{3679075400} \times \frac{2}{43} \approx 0.00476

If you get this wrong, revise: Section 1.6 (Conditional Probability).

Problem 3. Prove that for events AA and BB: P(AB)P(A)+P(B)1P(A \cap B) \geq P(A) + P(B) - 1.

Solution

From inclusion-exclusion: P(AB)=P(A)+P(B)P(AB)1P(A \cup B) = P(A) + P(B) - P(A \cap B) \leq 1.

Therefore: P(A)+P(B)P(AB)1P(A) + P(B) - P(A \cap B) \leq 1Which gives P(AB)P(A)+P(B)1P(A \cap B) \geq P(A) + P(B) - 1. \blacksquare

This is known as the Bonferroni inequality.

If you get this wrong, revise: Section 1.4 (Basic Properties).

Problem 4. Let XX be a continuous random variable with PDF f(x)=cx2f(x) = cx^2 for 0x10 \leq x \leq 1 and f(x)=0f(x) = 0 Otherwise. Find ccThe CDF, E[X]E[X]And Var(X)\mathrm{Var}(X).

Solution

Normalisation: 01cx2dx=c/3=1\int_0^1 cx^2\, dx = c/3 = 1So c=3c = 3.

CDF: F(x)=0x3t2dt=x3F(x) = \int_0^x 3t^2\, dt = x^3 for 0x10 \leq x \leq 1.

E[X]=01x3x2dx=301x3dx=34E[X] = \int_0^1 x \cdot 3x^2\, dx = 3\int_0^1 x^3\, dx = \frac{3}{4}

E[X2]=01x23x2dx=301x4dx=35E[X^2] = \int_0^1 x^2 \cdot 3x^2\, dx = 3\int_0^1 x^4\, dx = \frac{3}{5}

Var(X)=35(34)2=35916=484580=380\mathrm{Var}(X) = \frac{3}{5} - \left(\frac{3}{4}\right)^2 = \frac{3}{5} - \frac{9}{16} = \frac{48 - 45}{80} = \frac{3}{80}

If you get this wrong, revise: Section 2.4 (PDF) and Section 4.1-4.2 (Expectation and Variance).

Problem 5. If XBin(n,p)X \sim \mathrm{Bin}(n, p)Use LOTUS to show that E[X(X1)]=n(n1)p2E[X(X - 1)] = n(n-1)p^2. Use this to derive Var(X)=np(1p)\mathrm{Var}(X) = np(1 - p).

Solution

E[X(X1)]=k=0nk(k1)(nk)pk(1p)nkE[X(X - 1)] = \sum_{k=0}^n k(k-1) \binom{n}{k} p^k (1-p)^{n-k}

For k2k \geq 2: k(k1)(nk)=k(k1)n!k!(nk)!=n!(k2)!(nk)!=n(n1)(n2k2)k(k-1)\binom{n}{k} = k(k-1) \cdot \frac{n!}{k!(n-k)!} = \frac{n!}{(k-2)!(n-k)!} = n(n-1)\binom{n-2}{k-2}.

E[X(X1)]=n(n1)p2k=2n(n2k2)pk2(1p)nk=n(n1)p21=n(n1)p2E[X(X-1)] = n(n-1)p^2 \sum_{k=2}^n \binom{n-2}{k-2} p^{k-2}(1-p)^{n-k} = n(n-1)p^2 \cdot 1 = n(n-1)p^2

(the sum is the binomial theorem for Bin(n2,p)\mathrm{Bin}(n-2, p)).

Now E[X2]=E[X(X1)]+E[X]=n(n1)p2+npE[X^2] = E[X(X-1)] + E[X] = n(n-1)p^2 + np.

Var(X)=n(n1)p2+npn2p2=npnp2=np(1p)\mathrm{Var}(X) = n(n-1)p^2 + np - n^2p^2 = np - np^2 = np(1-p) \quad \blacksquare

If you get this wrong, revise: Section 3.1 (Binomial Distribution) and Section 4.1 (LOTUS).

Problem 6. Let XPoisson(λ)X \sim \mathrm{Poisson}(\lambda). Find E[X(X1)(X2)]E[X(X-1)(X-2)] and use it to compute Var(X)\mathrm{Var}(X).

Solution

E[X(X1)(X2)]=k=0k(k1)(k2)eλλkk!=eλk=3λk(k3)!E[X(X-1)(X-2)] = \sum_{k=0}^{\infty} k(k-1)(k-2) \frac{e^{-\lambda}\lambda^k}{k!} = e^{-\lambda} \sum_{k=3}^{\infty} \frac{\lambda^k}{(k-3)!}

=eλλ3j=0λjj!=λ3= e^{-\lambda} \lambda^3 \sum_{j=0}^{\infty} \frac{\lambda^j}{j!} = \lambda^3

So E[X3]=E[X(X1)(X2)]+3E[X2]2E[X]=λ3+3(λ2+λ)2λ=λ3+3λ2+λE[X^3] = E[X(X-1)(X-2)] + 3E[X^2] - 2E[X] = \lambda^3 + 3(\lambda^2 + \lambda) - 2\lambda = \lambda^3 + 3\lambda^2 + \lambda.

For variance: E[X2]=E[X(X1)]+E[X]=λ2+λE[X^2] = E[X(X-1)] + E[X] = \lambda^2 + \lambda.

Var(X)=(λ2+λ)λ2=λ\mathrm{Var}(X) = (\lambda^2 + \lambda) - \lambda^2 = \lambda \quad \blacksquare

If you get this wrong, revise: Section 3.1 (Poisson Distribution).

Problem 7. Let XX and YY be independent with XN(2,4)X \sim N(2, 4) and YN(3,9)Y \sim N(3, 9). Find the distribution of 2X3Y+52X - 3Y + 5.

Solution

E[2X3Y+5]=2(2)3(3)+5=49+5=0E[2X - 3Y + 5] = 2(2) - 3(3) + 5 = 4 - 9 + 5 = 0.

Var(2X3Y+5)=4Var(X)+9Var(Y)=4(4)+9(9)=16+81=97\mathrm{Var}(2X - 3Y + 5) = 4\,\mathrm{Var}(X) + 9\,\mathrm{Var}(Y) = 4(4) + 9(9) = 16 + 81 = 97.

Since linear combinations of independent normals are normal: 2X3Y+5N(0,97)2X - 3Y + 5 \sim N(0, 97). \blacksquare

If you get this wrong, revise: Section 3.2 (Normal Distribution) and Theorem 3.2.

Problem 8. Let fX,Y(x,y)=32(x2+y2)f_{X,Y}(x,y) = \frac{3}{2}(x^2 + y^2) for 0x1,0y10 \leq x \leq 1, 0 \leq y \leq 1. Find P(X>Y)P(X \gt Y).

Solution

P(X>Y)=01y132(x2+y2)dxdyP(X \gt Y) = \int_0^1 \int_y^1 \frac{3}{2}(x^2 + y^2)\, dx\, dy

=3201[x33+xy2]x=yx=1dy=3201(13+y2y33y3)dy= \frac{3}{2} \int_0^1 \left[\frac{x^3}{3} + xy^2\right]_{x=y}^{x=1}\, dy = \frac{3}{2} \int_0^1 \left(\frac{1}{3} + y^2 - \frac{y^3}{3} - y^3\right)\, dy

=3201(13+y24y33)dy=32[y3+y33y43]01= \frac{3}{2} \int_0^1 \left(\frac{1}{3} + y^2 - \frac{4y^3}{3}\right)\, dy = \frac{3}{2} \left[\frac{y}{3} + \frac{y^3}{3} - \frac{y^4}{3}\right]_0^1

=32(13+1313)=3213=12= \frac{3}{2} \left(\frac{1}{3} + \frac{1}{3} - \frac{1}{3}\right) = \frac{3}{2} \cdot \frac{1}{3} = \frac{1}{2}

If you get this wrong, revise: Section 5.1 (Joint PDF).

Problem 9. Let X1,X2X_1, X_2 be i.i.d. With common PDF f(x)=2xf(x) = 2x for 0x10 \leq x \leq 1. Find the PDF of M=max(X1,X2)M = \max(X_1, X_2).

Solution

The CDF of each XiX_i is F(x)=x2F(x) = x^2 for 0x10 \leq x \leq 1.

FM(m)=P(max(X1,X2)m)=P(X1m)P(X2m)=F(m)2=m4F_M(m) = P(\max(X_1, X_2) \leq m) = P(X_1 \leq m)\, P(X_2 \leq m) = F(m)^2 = m^4

fM(m)=ddmFM(m)=4m3,0m1f_M(m) = \frac{d}{dm} F_M(m) = 4m^3, \quad 0 \leq m \leq 1

If you get this wrong, revise: Section 2.2 (CDF Properties) and Section 5.3 (Independence).

Problem 10. Let XX have MGF MX(t)=13et+13e2t+13e3tM_X(t) = \frac{1}{3}e^t + \frac{1}{3}e^{2t} + \frac{1}{3}e^{3t}. What is the distribution Of XX? Compute E[X]E[X] and Var(X)\mathrm{Var}(X).

Solution

The MGF is a weighted sum of exponentials, corresponding to a discrete distribution:

P(X=1)=P(X=2)=P(X=3)=13P(X = 1) = P(X = 2) = P(X = 3) = \frac{1}{3}

This is the discrete uniform distribution on \\{1, 2, 3\\}.

E[X]=MX(0)=13e0+23e0+33e0=1+2+33=2E[X] = M_X'(0) = \frac{1}{3}e^0 + \frac{2}{3}e^0 + \frac{3}{3}e^0 = \frac{1 + 2 + 3}{3} = 2

E[X2]=MX(0)=13+43+93=143E[X^2] = M_X''(0) = \frac{1}{3} + \frac{4}{3} + \frac{9}{3} = \frac{14}{3}

Var(X)=1434=23\mathrm{Var}(X) = \frac{14}{3} - 4 = \frac{2}{3}

If you get this wrong, revise: Section 4.3 (MGFs).

Problem 11. Use the CLT to approximate P(Bin(100,0.3)35)P(\mathrm{Bin}(100, 0.3) \leq 35).

Solution

XBin(100,0.3)X \sim \mathrm{Bin}(100, 0.3)So E[X]=30E[X] = 30, Var(X)=21\mathrm{Var}(X) = 21, σ=214.583\sigma = \sqrt{21} \approx 4.583.

With continuity correction:

P(X35)P(Z35.53021)=P(Z5.54.583)=P(Z1.20)0.8849P(X \leq 35) \approx P\left(Z \leq \frac{35.5 - 30}{\sqrt{21}}\right) = P\left(Z \leq \frac{5.5}{4.583}\right) = P(Z \leq 1.20) \approx 0.8849

If you get this wrong, revise: Section 6.3 (CLT) and Section 3.3 (Normal Approximation).

Problem 12. Let X1,,X100X_1, \ldots, X_{100} be i.i.d. Uniform(0,1)\mathrm{Uniform}(0, 1). Approximate P(0.48<Xˉ<0.52)P(0.48 \lt \bar{X} \lt 0.52).

Solution

E[Xi]=1/2E[X_i] = 1/2, Var(Xi)=1/12\mathrm{Var}(X_i) = 1/12. By the CLT:

XˉN(12,11200),σXˉ=112000.02887\bar{X} \approx N\left(\frac{1}{2}, \frac{1}{1200}\right), \quad \sigma_{\bar{X}} = \frac{1}{\sqrt{1200}} \approx 0.02887

P(0.48<Xˉ<0.52)=P(0.480.500.02887<Z<0.520.500.02887)=P(0.693<Z<0.693)P(0.48 \lt \bar{X} \lt 0.52) = P\left(\frac{0.48 - 0.50}{0.02887} \lt Z \lt \frac{0.52 - 0.50}{0.02887}\right) = P(-0.693 \lt Z \lt 0.693)

Φ(0.693)Φ(0.693)2(0.7557)1=0.5114\approx \Phi(0.693) - \Phi(-0.693) \approx 2(0.7557) - 1 = 0.5114

If you get this wrong, revise: Section 6.3 (CLT).

Problem 13. Find the MLE for θ\theta given a single observation xx from the Pareto distribution with PDF f(xθ)=θ/xθ+1f(x \mid \theta) = \theta / x^{\theta + 1} for x1x \geq 1 and θ>0\theta \gt 0.

Solution

L(θ)=θxθ+1L(\theta) = \frac{\theta}{x^{\theta + 1}}

(θ)=logθ(θ+1)logx\ell(\theta) = \log \theta - (\theta + 1)\log x

ddθ=1θlogx=0    θ^=1logx\frac{d\ell}{d\theta} = \frac{1}{\theta} - \log x = 0 \implies \hat{\theta} = \frac{1}{\log x}

Verify: d2dθ2=1θ2<0\frac{d^2\ell}{d\theta^2} = -\frac{1}{\theta^2} \lt 0Confirming a maximum. \blacksquare

If you get this wrong, revise: Section 7.2 (MLE Procedure).

Problem 14. Compute the Fisher information I(θ)I(\theta) for the Pareto distribution in Problem 13.

Solution

θlogf(xθ)=1θlogx\frac{\partial}{\partial \theta} \log f(x \mid \theta) = \frac{1}{\theta} - \log x

2θ2logf(xθ)=1θ2\frac{\partial^2}{\partial \theta^2} \log f(x \mid \theta) = -\frac{1}{\theta^2}

I(θ)=E[1θ2]=1θ2I(\theta) = -E\left[-\frac{1}{\theta^2}\right] = \frac{1}{\theta^2}

If you get this wrong, revise: Section 7.4 (Fisher Information).

Problem 15. A test of H0:μ=50H_0: \mu = 50 vs H1:μ>50H_1: \mu \gt 50 is conducted at α=0.01\alpha = 0.01 with n=16n = 16 and known σ=8\sigma = 8. What is the power of the test if the true mean is μ=54\mu = 54?

Solution

Under H0H_0, XˉN(50,64/16)=N(50,4)\bar{X} \sim N(50, 64/16) = N(50, 4). The critical value in terms of Xˉ\bar{X} is:

xˉc=50+z0.0184=50+2.326×2=54.652\bar{x}_c = 50 + z_{0.01} \cdot \frac{8}{4} = 50 + 2.326 \times 2 = 54.652

Under H1H_1 (true μ=54\mu = 54): XˉN(54,4)\bar{X} \sim N(54, 4).

Power=P(Xˉ>54.652μ=54)=P(Z>54.652542)=P(Z>0.326)0.372\mathrm{Power} = P(\bar{X} \gt 54.652 \mid \mu = 54) = P\left(Z \gt \frac{54.652 - 54}{2}\right) = P(Z \gt 0.326) \approx 0.372

If you get this wrong, revise: Section 8.6 (Z-Test) and Section 8.2 (Power).

Problem 16. Let XExp(λ)X \sim \mathrm{Exp}(\lambda) and YExp(μ)Y \sim \mathrm{Exp}(\mu) be independent. Show that P(X<Y)=λ/(λ+μ)P(X \lt Y) = \lambda / (\lambda + \mu).

Solution

P(X<Y)=0xλeλxμeμydydx=0λeλxeμxdxP(X \lt Y) = \int_0^{\infty} \int_x^{\infty} \lambda e^{-\lambda x} \mu e^{-\mu y}\, dy\, dx = \int_0^{\infty} \lambda e^{-\lambda x} e^{-\mu x}\, dx

=λ0e(λ+μ)xdx=λλ+μ= \lambda \int_0^{\infty} e^{-(\lambda + \mu)x}\, dx = \frac{\lambda}{\lambda + \mu} \quad \blacksquare

If you get this wrong, revise: Section 5.1 (Joint PDF) and Section 3.2 (Exponential Distribution).

Problem 17. Prove Chebyshev’s inequality: for any random variable XX with finite mean μ\mu and variance σ2\sigma^2 And any k>0k \gt 0:

P(Xμkσ)1k2P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}

Solution

σ2=E[(Xμ)2]=(xμ)2dF(x)\sigma^2 = E[(X - \mu)^2] = \int_{-\infty}^{\infty} (x - \mu)^2\, dF(x)

xμkσ(xμ)2dF(x)xμkσk2σ2dF(x)=k2σ2P(Xμkσ)\geq \int_{|x - \mu| \geq k\sigma} (x - \mu)^2\, dF(x) \geq \int_{|x - \mu| \geq k\sigma} k^2 \sigma^2\, dF(x) = k^2 \sigma^2\, P(|X - \mu| \geq k\sigma)

Therefore P(Xμkσ)1/k2P(|X - \mu| \geq k\sigma) \leq 1/k^2. \blacksquare

If you get this wrong, revise: Section 4.2 (Variance).

Problem 18. Let X1,,XnN(μ,1)X_1, \ldots, X_n \sim N(\mu, 1) with μ\mu unknown. Find the likelihood ratio test statistic for H0:μ=μ0H_0: \mu = \mu_0 vs H1:μμ0H_1: \mu \neq \mu_0 and show it is equivalent to the Z-test.

Solution

Under H0H_0: supL=(2π)n/2exp(12(xiμ0)2)\sup L = (2\pi)^{-n/2} \exp\left(-\frac{1}{2}\sum(x_i - \mu_0)^2\right).

Under H1H0H_1 \cup H_0: supL=(2π)n/2exp(12(xixˉ)2)\sup L = (2\pi)^{-n/2} \exp\left(-\frac{1}{2}\sum(x_i - \bar{x})^2\right).

Λ=L(μ0)L(xˉ)=exp(12[(xiμ0)2(xixˉ)2])=exp(n(xˉμ0)22)\Lambda = \frac{L(\mu_0)}{L(\bar{x})} = \exp\left(-\frac{1}{2}\left[\sum(x_i - \mu_0)^2 - \sum(x_i - \bar{x})^2\right]\right) = \exp\left(-\frac{n(\bar{x} - \mu_0)^2}{2}\right)

2logΛ=n(xˉμ0)2=(xˉμ01/n)2=Z2-2\log \Lambda = n(\bar{x} - \mu_0)^2 = \left(\frac{\bar{x} - \mu_0}{1/\sqrt{n}}\right)^2 = Z^2

Under H0H_0, Z2χ12Z^2 \sim \chi^2_1So we reject when Z>zα/2|Z| \gt z_{\alpha/2}. This is exactly the Z-test. \blacksquare

If you get this wrong, revise: Section 8.4 (Likelihood Ratio Tests) and Section 8.6 (Z-Test).

Problem 19. Let X1,,XnX_1, \ldots, X_n be i.i.d. N(μ,σ2)N(\mu, \sigma^2) with both parameters unknown. Find the MLE For (μ,σ2)(\mu, \sigma^2) and show that σ^MLE2\hat{\sigma}^2_{\mathrm{MLE}} is biased.

Solution

L(μ,σ2)=(2πσ2)n/2exp(12σ2i=1n(xiμ)2)L(\mu, \sigma^2) = (2\pi\sigma^2)^{-n/2} \exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2\right)

(μ,σ2)=n2log(2π)n2log(σ2)12σ2i=1n(xiμ)2\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2

μ=1σ2i=1n(xiμ)=0    μ^=xˉ\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^n (x_i - \mu) = 0 \implies \hat{\mu} = \bar{x}

σ2=n2σ2+12(σ2)2i=1n(xiμ)2=0\frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^n (x_i - \mu)^2 = 0

Substituting μ^=xˉ\hat{\mu} = \bar{x}: σ^2=1ni=1n(xixˉ)2\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2.

To check bias: E[σ^2]=E[n1nS2]=n1nσ2σ2E[\hat{\sigma}^2] = E\left[\frac{n-1}{n} S^2\right] = \frac{n-1}{n} \sigma^2 \neq \sigma^2.

The bias is σ2/n-\sigma^2/n. \blacksquare

If you get this wrong, revise: Section 7.2 (MLE Procedure).

Problem 20. Use the delta method to find the asymptotic distribution of p^(1p^)\hat{p}(1 - \hat{p}) where p^=1ni=1nXi\hat{p} = \frac{1}{n}\sum_{i=1}^n X_i and XiBernoulli(p)X_i \sim \mathrm{Bernoulli}(p).

Solution

By the CLT, n(p^p)dN(0,p(1p))\sqrt{n}(\hat{p} - p) \xrightarrow{d} N(0, p(1-p)).

Let g(t)=t(1t)=tt2g(t) = t(1 - t) = t - t^2. Then g(t)=12tg'(t) = 1 - 2tSo g(p)=12pg'(p) = 1 - 2p.

By the delta method:

n(p^(1p^)p(1p))dN(0,p(1p)(12p)2)\sqrt{n}\left(\hat{p}(1 - \hat{p}) - p(1 - p)\right) \xrightarrow{d} N(0, p(1 - p)(1 - 2p)^2)

If you get this wrong, revise: Section 6.5 (Delta Method) and Section 6.3 (CLT).

Problem 21. Let X1,,XnX_1, \ldots, X_n be i.i.d. From a distribution with finite mean μ\mu and finite variance σ2\sigma^2. Show that the sample mean is a consistent estimator of μ\mu using Chebyshev’s inequality.

Solution

E[Xˉn]=μE[\bar{X}_n] = \mu (unbiased) and Var(Xˉn)=σ2/n\mathrm{Var}(\bar{X}_n) = \sigma^2/n.

By Chebyshev’s inequality, for any ε>0\varepsilon \gt 0:

P(Xˉnμε)Var(Xˉn)ε2=σ2nε2P(|\bar{X}_n - \mu| \geq \varepsilon) \leq \frac{\mathrm{Var}(\bar{X}_n)}{\varepsilon^2} = \frac{\sigma^2}{n\varepsilon^2}

As nn \to \inftyThe right side goes to 0, so Xˉnpμ\bar{X}_n \xrightarrow{p} \mu. \blacksquare

If you get this wrong, revise: Section 6.2 (Law of Large Numbers).

Problem 22. A random sample of size 64 is drawn from a population with unknown mean and standard deviation σ=4\sigma = 4. Find the probability that the sample mean differs from the population mean by more than 1.

Solution

By the CLT, XˉN(μ,σ2/n)=N(μ,16/64)=N(μ,1/4)\bar{X} \approx N(\mu, \sigma^2/n) = N(\mu, 16/64) = N(\mu, 1/4).

P(Xˉμ>1)=P(Xˉμ0.5>2)=P(Z>2)=2(1Φ(2))2(0.0228)=0.0456P(|\bar{X} - \mu| \gt 1) = P\left(\left|\frac{\bar{X} - \mu}{0.5}\right| \gt 2\right) = P(|Z| \gt 2) = 2(1 - \Phi(2)) \approx 2(0.0228) = 0.0456

If you get this wrong, revise: Section 6.3 (CLT).

Problem 23. Let XX and YY have joint PDF f(x,y)=exyf(x, y) = e^{-x - y} for x>0,y>0x \gt 0, y \gt 0. Compute E[XY=y]E[X \mid Y = y] And verify the law of iterated expectations.

Solution

The marginal of YY: fY(y)=0exydx=eyf_Y(y) = \int_0^{\infty} e^{-x-y}\, dx = e^{-y}So YExp(1)Y \sim \mathrm{Exp}(1).

The conditional PDF: fXY(xy)=exyey=exf_{X \mid Y}(x \mid y) = \frac{e^{-x-y}}{e^{-y}} = e^{-x} for x>0x \gt 0.

Note that fXY(xy)f_{X \mid Y}(x \mid y) does not depend on yyConfirming XX and YY are independent.

E[XY=y]=0xexdx=1E[X \mid Y = y] = \int_0^{\infty} x\, e^{-x}\, dx = 1

By the law of iterated expectations: E[E[XY]]=E[1]=1=E[X]E[E[X \mid Y]] = E[1] = 1 = E[X] (since XExp(1)X \sim \mathrm{Exp}(1)). \blacksquare

If you get this wrong, revise: Section 5.3 (Conditional Distributions) and Section 5.3 (Tower Property).

Problem 24. Let XN(0,1)X \sim N(0, 1) and YN(0,1)Y \sim N(0, 1) be independent. Show that X/YX/Y follows a Cauchy Distribution.

Solution

We use the Jacobian method. Let U=X/YU = X/Y and V=YV = Y. Then X=UVX = UV, Y=VY = V with Jacobian J=v|J| = |v|.

fU,V(u,v)=12πe(uv)2/2ev2/2v=v2πev2(u2+1)/2f_{U,V}(u,v) = \frac{1}{2\pi} e^{-(uv)^2/2}\, e^{-v^2/2}\, |v| = \frac{|v|}{2\pi} e^{-v^2(u^2 + 1)/2}

Integrating out vv:

fU(u)=v2πev2(1+u2)/2dv=1π0vev2(1+u2)/2dvf_U(u) = \int_{-\infty}^{\infty} \frac{|v|}{2\pi} e^{-v^2(1+u^2)/2}\, dv = \frac{1}{\pi} \int_0^{\infty} v\, e^{-v^2(1+u^2)/2}\, dv

Let w=v2(1+u2)/2w = v^2(1+u^2)/2So dw=v(1+u2)dvdw = v(1+u^2)\, dv:

fU(u)=1π(1+u2)0ewdw=1π(1+u2)f_U(u) = \frac{1}{\pi(1+u^2)} \int_0^{\infty} e^{-w}\, dw = \frac{1}{\pi(1+u^2)}

This is the standard Cauchy distribution. Note that E[X/Y]=E[|X/Y|] = \inftySo the mean does not exist. \blacksquare

If you get this wrong, revise: Section 5.6 (Jacobian Method).

Problem 25. Prove that for any events A,B,CA, B, C:

P(ABC)=P(A)P(BA)P(CAB)P(A \cap B \cap C) = P(A)\, P(B \mid A)\, P(C \mid A \cap B)

Solution

By the definition of conditional probability applied twice:

P(BCA)=P(ABC)P(A)P(B \cap C \mid A) = \frac{P(A \cap B \cap C)}{P(A)}

P(CAB)=P(ABC)P(AB)P(C \mid A \cap B) = \frac{P(A \cap B \cap C)}{P(A \cap B)}

From the second: P(ABC)=P(CAB)P(AB)P(A \cap B \cap C) = P(C \mid A \cap B)\, P(A \cap B). And P(AB)=P(BA)P(A)P(A \cap B) = P(B \mid A)\, P(A).

Substituting: P(ABC)=P(A)P(BA)P(CAB)P(A \cap B \cap C) = P(A)\, P(B \mid A)\, P(C \mid A \cap B). \blacksquare

This is the chain rule of probability, which generalises to nn events.

If you get this wrong, revise: Section 1.5 (Conditional Probability).

Problem 26. Show that the Poisson distribution is infinitely divisible: if XPoisson(λ)X \sim \mathrm{Poisson}(\lambda) Then XX can be expressed as the sum of nn i.i.d. Random variables for any positive integer nn.

Solution

The MGF of XPoisson(λ)X \sim \mathrm{Poisson}(\lambda) is MX(t)=exp(λ(et1))M_X(t) = \exp(\lambda(e^t - 1)).

For any integer n1n \geq 1We can write:

MX(t)=[exp(λn(et1))]nM_X(t) = \left[\exp\left(\frac{\lambda}{n}(e^t - 1)\right)\right]^n

Each factor exp(λn(et1))\exp\left(\frac{\lambda}{n}(e^t - 1)\right) is the MGF of Poisson(λ/n)\mathrm{Poisson}(\lambda/n). Therefore X=Y1++YnX = Y_1 + \cdots + Y_n where YiPoisson(λ/n)Y_i \sim \mathrm{Poisson}(\lambda/n) are i.i.d. \blacksquare

If you get this wrong, revise: Section 4.3 (MGFs) and Section 3.1 (Poisson Distribution).

Common Pitfalls

  • Confusing PDF and CDF. PDF f(x)f(x): probability density; CDF F(x)=P(Xx)=xf(t)dtF(x) = P(X \leq x) = \int_{-\infty}^x f(t)\, dt. Fix: F(x)=f(x)F'(x) = f(x); P(a<X<b)=F(b)F(a)P(a < X < b) = F(b) - F(a).
  • Wrong central limit theorem application. The CLT applies to the sample mean, not individual observations, and requires sufficiently large nn. Fix: XˉndN(μ,σ2/n)\bar{X}_n \xrightarrow{d} N(\mu, \sigma^2/n) as nn \to \infty.
  • Confusing type I and type II errors. Type I: rejecting H0H_0 when it is true (α\alpha). Type II: failing to reject H0H_0 when it is false (β\beta). Fix: Type I = false positive; Type II = false negative. Decreasing one increases the other.

Worked Examples

Example 1: Normal distribution

Problem. XN(100,152)X \sim N(100, 15^2). Find P(X>130)P(X > 130).

Solution. Z=13010015=2.0Z = \frac{130 - 100}{15} = 2.0. P(X>130)=P(Z>2)=1Φ(2)10.9772=0.0228P(X > 130) = P(Z > 2) = 1 - \Phi(2) \approx 1 - 0.9772 = 0.0228.

\blacksquare

Example 2: Hypothesis test

Problem. Test H0:μ=50H_0: \mu = 50 vs H1:μ>50H_1: \mu > 50 given xˉ=53\bar{x} = 53, s=8s = 8, n=25n = 25, α=0.05\alpha = 0.05.

Solution. t=53508/25=31.6=1.875t = \frac{53 - 50}{8/\sqrt{25}} = \frac{3}{1.6} = 1.875. Critical value: t0.05,24=1.711t_{0.05, 24} = 1.711. Since 1.875>1.7111.875 > 1.711, reject H0H_0 at the 5% level.

\blacksquare

Summary

  • Continuous distributions: PDF integrates to 1; CDF gives cumulative probability.
  • Normal distribution: XN(μ,σ2)X \sim N(\mu, \sigma^2); standardise: Z=(Xμ)/σZ = (X - \mu)/\sigma.
  • Central limit theorem: sample mean is approximately normal for large nn.
  • Hypothesis testing: state H0H_0 and H1H_1, choose significance level, compute test statistic, compare with critical value.

Cross-References

TopicSiteLink
Probability and Statistics (Overview)WyattsNotesView
ProbabilityWyattsNotesView
Real AnalysisWyattsNotesView
Probability — Harvard Stat 110HarvardView