Skip to content

Machine Learning Fundamentals

1. Linear Regression

1.1 Model

Given features xRd\mathbf{x} \in \mathbb{R}^d, predict a continuous target yy:

y^=wTx+b=w1x1+w2x2++wdxd+b\hat{y} = \mathbf{w}^T \mathbf{x} + b = w_1 x_1 + w_2 x_2 + \cdots + w_d x_d + b

In matrix form (with bias absorbed): y^=wTx\hat{y} = \mathbf{w}^T \mathbf{x} where x=[1,x1,,xd]T\mathbf{x} = [1, x_1, \ldots, x_d]^T.

1.2 Ordinary Least Squares (OLS)

Minimize the mean squared error:

L(w)=1ni=1n(yiwTxi)2=1nyXw2\mathcal{L}(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \mathbf{w}^T \mathbf{x}_i)^2 = \frac{1}{n} \|\mathbf{y} - X\mathbf{w}\|^2

Closed-form solution:

w=(XTX)1XTy\mathbf{w}^* = (X^T X)^{-1} X^T \mathbf{y}

Assumptions:

  • XTXX^T X is invertible (features are not linearly dependent).
  • OLS is unbiased: E[w^]=wtrue\mathbb{E}[\hat{\mathbf{w}}] = \mathbf{w}_{\text{true}}.

1.3 Gradient Descent

When closed-form is expensive (dd large), use iterative optimization:

GRADIENT_DESCENT(X, y, α, iterations):
w = zeros(d+1) // including bias
for t = 1 to iterations:
predictions = X @ w
gradient = (2/n) * X.T @ (predictions - y)
w = w - α * gradient
return w

Learning rate α\alpha: Too large → divergence; too small → slow convergence.

Batch gradient descent: Uses all nn examples per step. Stochastic gradient descent (SGD): Uses one example per step. Noisy but fast. Mini-batch SGD: Uses BB examples per step. Balances efficiency and stability.

1.4 Variants

SGD(X, y, α, epochs, batch_size):
w = zeros(d+1)
for epoch = 1 to epochs:
shuffle data
for batch in get_batches(X, y, batch_size):
predictions = batch.X @ w
gradient = (2/|batch|) * batch.X.T @ (predictions - batch.y)
w = w - α * gradient
return w

2. Logistic Regression

2.1 Model

Binary classification: predict P(y=1x)P(y = 1 | \mathbf{x}):

y^=σ(wTx+b)=11+e(wTx+b)\hat{y} = \sigma(\mathbf{w}^T \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}}

Sigmoid function σ(z)\sigma(z): Maps any real number to (0,1)(0, 1).

σ(z)=11+ez,σ"(z)=σ(z)(1σ(z))\sigma(z) = \frac{1}{1+e^{-z}}, \quad \sigma"(z) = \sigma(z)(1 - \sigma(z))

2.2 Loss Function: Binary Cross-Entropy

L=1ni=1n[yilog(y^i)+(1yi)log(1y^i)]\mathcal{L} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]

Gradient:

wL=1ni=1n(y^iyi)xi\nabla_{\mathbf{w}} \mathcal{L} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i) \mathbf{x}_i

Update rule:

wwα1ni=1n(y^iyi)xi\mathbf{w} \leftarrow \mathbf{w} - \alpha \cdot \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i) \mathbf{x}_i

2.3 Multi-Class: Softmax Regression

For KK classes, use the softmax function:

P(y=kx)=ezkj=1Kezj,zk=wkTx+bkP(y = k | \mathbf{x}) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}, \quad z_k = \mathbf{w}_k^T \mathbf{x} + b_k

Loss: Categorical cross-entropy:

L=1ni=1nk=1Kyiklog(y^ik)\mathcal{L} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik} \log(\hat{y}_{ik})

where yik=1y_{ik} = 1 if example ii belongs to class kk, else 00.

3. Decision Trees

3.1 Structure

A tree where:

  • Internal nodes test a feature (e.g., “is x13x_1 \leq 3?”).
  • Leaves assign a class label (or regression value).
  • Branches correspond to test outcomes.

3.2 Splitting Criteria (Classification)

Gini impurity:

Gini(S)=1k=1Kpk2\text{Gini}(S) = 1 - \sum_{k=1}^{K} p_k^2

where pkp_k is the proportion of class kk in set SS.

Information gain (Entropy):

H(S)=k=1Kpklog2pkH(S) = -\sum_{k=1}^{K} p_k \log_2 p_k

Gain(S,A)=H(S)vValues(A)SvSH(Sv)\text{Gain}(S, A) = H(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} H(S_v)

Choose the feature and threshold that maximizes gain (or minimizes impurity).

3.3 Splitting Criteria (Regression)

Reduction in variance:

Split Score=Var(S)vSvSVar(Sv)\text{Split Score} = \text{Var}(S) - \sum_{v} \frac{|S_v|}{|S|} \text{Var}(S_v)

3.4 Training Algorithm

BUILD_TREE(data, max_depth, min_samples):
if all labels same: return Leaf(majority_label)
if max_depth == 0 or len(data) < min_samples:
return Leaf(majority_label)
best_feature, best_threshold = None, None
best_score = -∞
for each feature f:
for each threshold t in unique values of f:
score = IMPURITY(data, f, t)
if score > best_score:
best_score = score
best_feature = f
best_threshold = t
left = data where data[f] <= t
right = data where data[f] > t
return Node(best_feature, best_threshold,
BUILD_TREE(left, max_depth-1, min_samples),
BUILD_TREE(right, max_depth-1, min_samples))

3.5 Pruning

Pre-pruning: Stop growing early (max depth, min samples per leaf, min impurity decrease).

Post-pruning (cost-complexity pruning):

Grow full tree, then prune branches:

Rα(T)=R(T)+αTR_\alpha(T) = R(T) + \alpha |T|

where R(T)R(T) = training error, T|T| = number of leaves, α\alpha = complexity parameter.

3.6 Ensemble Methods

Bagging (Random Forest):

RANDOM_FOREST(data, n_trees, B):
trees = []
for i = 1 to n_trees:
sample = bootstrap_sample(data, size = len(data))
tree = BUILD_TREE(sample, random feature subset at each split)
trees.append(tree)
PREDICT(x):
return majority_vote([tree.predict(x) for tree in trees])

Boosting (Gradient Boosting):

GRADIENT_BOOST(data, n_estimators, lr):
F(x) = 0 // initial prediction
residuals = y
for m = 1 to n_estimators:
tree = BUILD_TREE(X, residuals, shallow)
update = tree.predict(X)
F = F + lr * update
residuals = y - F // new residuals (negative gradient)
return F

4. Neural Networks

4.1 Perceptron

A single neuron:

y^=σ(wTx+b)=σ(j=1dwjxj+b)\hat{y} = \sigma(\mathbf{w}^T \mathbf{x} + b) = \sigma\left(\sum_{j=1}^{d} w_j x_j + b\right)

Limitation: Can only learn linearly separable functions (single-layer cannot solve XOR).

4.2 Multi-Layer Perceptron (MLP)

Layers of neurons connected by weights:

Layer l=0 (input): a[0] = x (d neurons)
Layer l=1 (hidden): z[1] = W[1]a[0] + b[1]
a[1] = activation(z[1]) (h neurons)
Layer l=2 (output): z[2] = W[2]a[1] + b[2]
a[2] = activation(z[2]) (K neurons)

4.3 Activation Functions

FunctionFormulaRangeProperties
Sigmoidσ(z)=11+ez\sigma(z) = \frac{1}{1+e^{-z}}(0,1)(0, 1)Smooth, vanishing gradient
Tanhtanh(z)=ezezez+ez\tanh(z) = \frac{e^z-e^{-z}}{e^z+e^{-z}}(1,1)(-1, 1)Zero-centered
ReLUReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)[0,)[0, \infty)Fast, but “dead neurons”
Leaky ReLUmax(αz,z)\max(\alpha z, z), α=0.01\alpha = 0.01(,)(-\infty, \infty)Fixes dead neurons
Softmaxezkjezj\frac{e^{z_k}}{\sum_j e^{z_j}}(0,1)K(0, 1)^KFor multi-class output

4.4 Backpropagation

Compute gradients layer by layer using the chain rule.

FORWARD(x):
a[0] = x
for l = 1 to L:
z[l] = W[l] @ a[l-1] + b[l]
a[l] = activation(z[l])
return a[L]
BACKWARD(x, y):
// Forward pass to compute activations
activations = FORWARD(x)
// Output layer gradient
δ[L] = ∂L/∂z[L] = (a[L] - y) * activation'(z[L])
// Backward pass
for l = L-1 downto 1:
δ[l] = (W[l+1].T @ δ[l+1]) * activation'(z[l])
// Compute parameter gradients
for l = 1 to L:
dW[l] = δ[l] @ a[l-1].T / n
db[l] = mean(δ[l], axis=1)
return dW, db

Update rule (SGD):

W[l]W[l]αdW[l]W[l] \leftarrow W[l] - \alpha \cdot dW[l] b[l]b[l]αdb[l]b[l] \leftarrow b[l] - \alpha \cdot db[l]

4.5 Universal Approximation Theorem

A feedforward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of Rd\mathbb{R}^d to arbitrary precision.

Caveat: The theorem does not say the network will learn readily, or that a shallow network is practical.

5. Loss Functions

5.1 Regression Losses

Mean Squared Error (MSE):

LMSE=1ni=1n(yiy^i)2\mathcal{L}_{\text{MSE}} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

Sensitive to outliers (quadratic penalty).

Mean Absolute Error (MAE):

LMAE=1ni=1nyiy^i\mathcal{L}_{\text{MAE}} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|

More robust to outliers (linear penalty).

Huber loss: MSE for small errors, MAE for large errors:

Lδ={12(yy^)2if yy^δδ(yy^δ2)otherwise\mathcal{L}_\delta = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \\ \delta(|y - \hat{y}| - \frac{\delta}{2}) & \text{otherwise} \end{cases}

5.2 Classification Losses

Binary Cross-Entropy:

LBCE=1ni=1n[yilogy^i+(1yi)log(1y^i)]\mathcal{L}_{\text{BCE}} = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log \hat{y}_i + (1 - y_i) \log(1 - \hat{y}_i)]

Categorical Cross-Entropy:

LCE=k=1Kyklogy^k\mathcal{L}_{\text{CE}} = -\sum_{k=1}^{K} y_k \log \hat{y}_k

Hinge loss (SVM):

Lhinge=max(0,1yy^)\mathcal{L}_{\text{hinge}} = \max(0, 1 - y \cdot \hat{y})

5.3 Loss Selection Guide

TaskLoss Function
RegressionMSE, MAE, Huber
BinaryBinary Cross-Entropy
Multi-classCategorical Cross-Entropy
ImbalancedFocal loss, weighted CE

6. Regularization

6.1 L1 Regularization (Lasso)

LL1=Lbase+λj=1dwj\mathcal{L}_{\text{L1}} = \mathcal{L}_{\text{base}} + \lambda \sum_{j=1}^{d} |w_j|

Effect: Produces sparse solutions (drives some weights to exactly zero). Useful for feature selection.

6.2 L2 Regularization (Ridge)

LL2=Lbase+λj=1dwj2=Lbase+λw2\mathcal{L}_{\text{L2}} = \mathcal{L}_{\text{base}} + \lambda \sum_{j=1}^{d} w_j^2 = \mathcal{L}_{\text{base}} + \lambda \|\mathbf{w}\|^2

Effect: Penalizes large weights. Shrinks all weights toward zero but rarely makes them exactly zero.

Update rule (L2):

wwα(L+2λw)\mathbf{w} \leftarrow \mathbf{w} - \alpha \left(\nabla \mathcal{L} + 2\lambda \mathbf{w}\right)

6.3 Elastic Net

Combines L1 and L2:

L=Lbase+λ1w1+λ2w2\mathcal{L} = \mathcal{L}_{\text{base}} + \lambda_1 \|\mathbf{w}\|_1 + \lambda_2 \|\mathbf{w}\|^2

Benefits of both: sparse features + stable coefficients.

6.4 Dropout

During training, randomly set neuron activations to zero with probability pp:

DROPOUT_LAYER(a, p, training):
if training:
mask = random_bernoulli(1-p, size=a.shape) / (1-p)
return a * mask
else:
return a

Effect: Prevents co-adaptation of neurons. Implicit ensemble of sub-networks. Test time: no dropout (use full network, scaled by 1p1-p).

6.5 Early Stopping

Monitor validation loss during training. Stop when validation loss starts increasing:

EARLY_STOPPING(train_fn, patience):
best_val_loss = INF
patience_counter = 0
for epoch in epochs:
train_fn(epoch)
val_loss = evaluate_validation()
if val_loss < best_val_loss:
best_val_loss = val_loss
save_model()
patience_counter = 0
else:
patience_counter += 1
if patience_counter >= patience:
break
return best_model

6.6 Regularization Comparison

MethodEffectWhen to Use
L1 (Lasso)Sparsity, feature selectionMany irrelevant features
L2 (Ridge)Weight shrinkageAll features potentially useful
Elastic NetBothCorrelated features
DropoutPrevents co-adaptationNeural networks
Early stoppingPrevents overfittingAll models, especially NNs

7. Evaluation Metrics

7.1 Confusion Matrix

Predicted PositivePredicted Negative
Actual PositiveTP (True Positive)FN (False Negative)
Actual NegativeFP (False Positive)TN (True Negative)

7.2 Classification Metrics

Accuracy: TP+TNTP+TN+FP+FN\frac{TP + TN}{TP + TN + FP + FN}

Misleading with imbalanced classes (e.g., 99% negative: predict all negative → 99% accuracy).

Precision: TPTP+FP\frac{TP}{TP + FP} (of all positive predictions, how many were correct?)

Recall (Sensitivity): TPTP+FN\frac{TP}{TP + FN} (of all actual positives, how many were found?)

F1 Score: Harmonic mean of precision and recall:

F1=2PrecisionRecallPrecision+RecallF_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

Specificity: TNTN+FP\frac{TN}{TN + FP} (of all actual negatives, how many were correctly identified?)

FβF_\beta Score: Weighted harmonic mean (gives β\beta times more importance to recall):

Fβ=(1+β2)PrecisionRecall(β2Precision)+RecallF_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{(\beta^2 \cdot \text{Precision}) + \text{Recall}}

7.3 ROC Curve and AUC

ROC curve: Plots TPR (Recall) vs. FPR (FPFP+TN\frac{FP}{FP + TN}) at various classification thresholds.

AUC (Area Under Curve):

  • AUC = 1.0: Perfect classifier.
  • AUC = 0.5: Random guessing.
  • AUC < 0.5: Worse than random (invert predictions).

Interpretation: Probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example.

7.4 Regression Metrics

MSE: 1n(yiy^i)2\frac{1}{n}\sum(y_i - \hat{y}_i)^2

RMSE: MSE\sqrt{\text{MSE}} (same units as yy)

MAE: 1nyiy^i\frac{1}{n}\sum|y_i - \hat{y}_i|

R2R^2 (Coefficient of Determination):

R2=1(yiy^i)2(yiyˉ)2R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}

R2=1R^2 = 1: Perfect fit. R2=0R^2 = 0: Predicts mean. R2<0R^2 < 0: Worse than predicting mean.

7.5 Cross-Validation

k-fold cross-validation:

K_FOLD_CV(model_fn, X, y, k):
indices = shuffle(range(len(X)))
fold_size = len(X) // k
scores = []
for i in 0 to k-1:
val_idx = indices[i*fold_size : (i+1)*fold_size]
train_idx = all other indices
model = model_fn(X[train_idx], y[train_idx])
score = evaluate(model, X[val_idx], y[val_idx])
scores.append(score)
return mean(scores), std(scores)

Stratified k-fold: Maintains class distribution in each fold (important for imbalanced data).

8. Bias-Variance Tradeoff

8.1 Decomposition

For a model predicting f^(x)\hat{f}(\mathbf{x}) of true function f(x)f(\mathbf{x}) with noise ϵ\epsilon:

Expected Error=Bias2+Variance+Irreducible Error\text{Expected Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}

where:

Bias2=(E[f^(x)]f(x))2\text{Bias}^2 = (\mathbb{E}[\hat{f}(\mathbf{x})] - f(\mathbf{x}))^2 Variance=E[(f^(x)E[f^(x)])2]\text{Variance} = \mathbb{E}[(\hat{f}(\mathbf{x}) - \mathbb{E}[\hat{f}(\mathbf{x})])^2] Irreducible Error=σϵ2\text{Irreducible Error} = \sigma^2_\epsilon

8.2 Tradeoff

Model ComplexityBiasVarianceTotal Error
Low (underfit)HighLowHigh
OptimalModerateModerateMinimal
High (overfit)LowHighHigh

8.3 Overfitting and Underfitting

Overfitting: Model learns noise in training data. Training error \ll validation error.

Signs:

  • High training accuracy, low validation accuracy.
  • Large gap between training and test performance.

Remedies:

  • More training data.
  • Reduce model complexity (fewer parameters, shallower tree, smaller network).
  • Regularization (L1, L2, dropout).
  • Early stopping.
  • Cross-validation for hyperparameter tuning.

Underfitting: Model fails to capture the underlying pattern. Both training and validation error are high.

Remedies:

  • Increase model complexity.
  • Add more features.
  • Reduce regularization.
  • Train longer.

8.4 Learning Curves

Plot training and validation error vs. training set size:

  • High bias (underfit): Both curves plateau at high error, close together.
  • High variance (overfit): Training error low, validation error high with a gap.
  • Good fit: Both errors low and converging.

9. Common Pitfalls

  1. Training without splitting data. Evaluating on training data gives overly optimistic metrics. Always use train/validation/test splits or cross-validation.

  2. Using accuracy on imbalanced datasets. A 99% negative dataset with a model that always predicts negative achieves 99% accuracy but learns nothing. Use precision, recall, F1, or AUC instead.

  3. Leaking information from test to train. Fitting preprocessing (scaling, imputation) on the entire dataset before splitting. Fit transformers on training data only, then apply to validation/test.

  4. Ignoring feature scaling. Gradient descent converges faster with standardized features. Unscaled features cause slow convergence and may bias regularized models.

  5. Setting learning rate too high or too low. Too high: loss diverges or oscillates. Too low: training is prohibitively slow. Use learning rate schedules or adaptive optimizers (Adam, RMSprop).

  6. Overfitting to validation set through hyperparameter tuning. Repeated tuning on the same validation set causes “validation leakage.” Use nested cross-validation or a held-out test set touched only once.

  7. Ignoring data distribution shifts. If training and test data come from different distributions (e.g., different time periods), model performance may degrade. Monitor for covariate shift and concept drift.

Worked Examples

Example 1: Perceptron Learning Algorithm

Problem: Train a perceptron to learn the AND function with inputs (0,0)->0, (0,1)->0, (1,0)->0, (1,1)->1. Initial weights w=(0,0), bias b=0, learning rate eta=1. Solution: (0,0): 0>=0, correct. (0,1): 0>=0, correct. (1,0): 0>=0, correct. (1,1): 0>=0, incorrect. Update: w1+=11=1, w2+=11=1, b+=-10=0. Next epoch: (1,1): 1+1=2>=0, correct. (0,1): 0+1=1>=0, correct (should be 0 — error). Update: w1+=10=1, w2+=1*(-1)=0, b+=-11=-1. (0,1): 0+0-1=-1<0, correct. (1,0): 1+0-1=0>=0, correct (should be 0 — error). Update: w1+=1(-1)=0, w2+=1*0=0, b+=-1*(-1)=0. Converged: (0,0)->0, (0,1)->0, (1,0)->0, (1,1)->0 (wrong!). Perceptron with this update rule may not converge for all functions.

Example 2: K-Nearest Neighbours Classification

Problem: Given training data: (1,1)->A, (1,3)->A, (3,1)->B, (3,3)->B. Classify the point (2,2) using k=3. Solution: Distances: to (1,1)=sqrt(2)=1.41 (A), to (1,3)=sqrt(2)=1.41 (A), to (3,1)=sqrt(2)=1.41 (B), to (3,3)=sqrt(2)=1.41 (B). All distances are equal. Nearest 3 points depend on tie-breaking; any combination of 3 points yields either 2A+1B or 2B+1A. The point (2,2) lies equidistant from all training data, so the classification is ambiguous.

Summary

  • Linear regression (OLS, gradient descent) predicts continuous targets; logistic regression predicts probabilities for binary classification.
  • Decision trees split on features using impurity measures (Gini, entropy); ensembles (Random Forest, Gradient Boosting) improve performance.
  • Neural networks learn hierarchical features through backpropagation with activation functions (ReLU, sigmoid, softmax).
  • Loss functions (MSE, cross-entropy, Huber) measure prediction error and guide optimization.
  • Regularization (L1, L2, dropout, early stopping) prevents overfitting by penalizing complexity.
  • Evaluation metrics (accuracy, precision, recall, F1, ROC-AUC, R2R^2) must match the task and class distribution.
  • Bias-variance tradeoff guides model complexity: underfitting (high bias) vs. overfitting (high variance).

Cross-References

TopicLink
Algorithms OverviewView
Complexity TheoryView
Algorithm DesignView