Machine Learning Fundamentals
1. Linear Regression
1.1 Model
Given features , predict a continuous target :
In matrix form (with bias absorbed): where .
1.2 Ordinary Least Squares (OLS)
Minimize the mean squared error:
Closed-form solution:
Assumptions:
- is invertible (features are not linearly dependent).
- OLS is unbiased: .
1.3 Gradient Descent
When closed-form is expensive ( large), use iterative optimization:
GRADIENT_DESCENT(X, y, α, iterations): w = zeros(d+1) // including bias for t = 1 to iterations: predictions = X @ w gradient = (2/n) * X.T @ (predictions - y) w = w - α * gradient return wLearning rate : Too large → divergence; too small → slow convergence.
Batch gradient descent: Uses all examples per step. Stochastic gradient descent (SGD): Uses one example per step. Noisy but fast. Mini-batch SGD: Uses examples per step. Balances efficiency and stability.
1.4 Variants
SGD(X, y, α, epochs, batch_size): w = zeros(d+1) for epoch = 1 to epochs: shuffle data for batch in get_batches(X, y, batch_size): predictions = batch.X @ w gradient = (2/|batch|) * batch.X.T @ (predictions - batch.y) w = w - α * gradient return w2. Logistic Regression
2.1 Model
Binary classification: predict :
Sigmoid function : Maps any real number to .
2.2 Loss Function: Binary Cross-Entropy
Gradient:
Update rule:
2.3 Multi-Class: Softmax Regression
For classes, use the softmax function:
Loss: Categorical cross-entropy:
where if example belongs to class , else .
3. Decision Trees
3.1 Structure
A tree where:
- Internal nodes test a feature (e.g., “is ?”).
- Leaves assign a class label (or regression value).
- Branches correspond to test outcomes.
3.2 Splitting Criteria (Classification)
Gini impurity:
where is the proportion of class in set .
Information gain (Entropy):
Choose the feature and threshold that maximizes gain (or minimizes impurity).
3.3 Splitting Criteria (Regression)
Reduction in variance:
3.4 Training Algorithm
BUILD_TREE(data, max_depth, min_samples): if all labels same: return Leaf(majority_label) if max_depth == 0 or len(data) < min_samples: return Leaf(majority_label)
best_feature, best_threshold = None, None best_score = -∞ for each feature f: for each threshold t in unique values of f: score = IMPURITY(data, f, t) if score > best_score: best_score = score best_feature = f best_threshold = t
left = data where data[f] <= t right = data where data[f] > t return Node(best_feature, best_threshold, BUILD_TREE(left, max_depth-1, min_samples), BUILD_TREE(right, max_depth-1, min_samples))3.5 Pruning
Pre-pruning: Stop growing early (max depth, min samples per leaf, min impurity decrease).
Post-pruning (cost-complexity pruning):
Grow full tree, then prune branches:
where = training error, = number of leaves, = complexity parameter.
3.6 Ensemble Methods
Bagging (Random Forest):
RANDOM_FOREST(data, n_trees, B): trees = [] for i = 1 to n_trees: sample = bootstrap_sample(data, size = len(data)) tree = BUILD_TREE(sample, random feature subset at each split) trees.append(tree)
PREDICT(x): return majority_vote([tree.predict(x) for tree in trees])Boosting (Gradient Boosting):
GRADIENT_BOOST(data, n_estimators, lr): F(x) = 0 // initial prediction residuals = y for m = 1 to n_estimators: tree = BUILD_TREE(X, residuals, shallow) update = tree.predict(X) F = F + lr * update residuals = y - F // new residuals (negative gradient) return F4. Neural Networks
4.1 Perceptron
A single neuron:
Limitation: Can only learn linearly separable functions (single-layer cannot solve XOR).
4.2 Multi-Layer Perceptron (MLP)
Layers of neurons connected by weights:
Layer l=0 (input): a[0] = x (d neurons)Layer l=1 (hidden): z[1] = W[1]a[0] + b[1] a[1] = activation(z[1]) (h neurons)Layer l=2 (output): z[2] = W[2]a[1] + b[2] a[2] = activation(z[2]) (K neurons)4.3 Activation Functions
| Function | Formula | Range | Properties |
|---|---|---|---|
| Sigmoid | Smooth, vanishing gradient | ||
| Tanh | Zero-centered | ||
| ReLU | Fast, but “dead neurons” | ||
| Leaky ReLU | , | Fixes dead neurons | |
| Softmax | For multi-class output |
4.4 Backpropagation
Compute gradients layer by layer using the chain rule.
FORWARD(x): a[0] = x for l = 1 to L: z[l] = W[l] @ a[l-1] + b[l] a[l] = activation(z[l]) return a[L]
BACKWARD(x, y): // Forward pass to compute activations activations = FORWARD(x)
// Output layer gradient δ[L] = ∂L/∂z[L] = (a[L] - y) * activation'(z[L])
// Backward pass for l = L-1 downto 1: δ[l] = (W[l+1].T @ δ[l+1]) * activation'(z[l])
// Compute parameter gradients for l = 1 to L: dW[l] = δ[l] @ a[l-1].T / n db[l] = mean(δ[l], axis=1)
return dW, dbUpdate rule (SGD):
4.5 Universal Approximation Theorem
A feedforward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of to arbitrary precision.
Caveat: The theorem does not say the network will learn readily, or that a shallow network is practical.
5. Loss Functions
5.1 Regression Losses
Mean Squared Error (MSE):
Sensitive to outliers (quadratic penalty).
Mean Absolute Error (MAE):
More robust to outliers (linear penalty).
Huber loss: MSE for small errors, MAE for large errors:
5.2 Classification Losses
Binary Cross-Entropy:
Categorical Cross-Entropy:
Hinge loss (SVM):
5.3 Loss Selection Guide
| Task | Loss Function |
|---|---|
| Regression | MSE, MAE, Huber |
| Binary | Binary Cross-Entropy |
| Multi-class | Categorical Cross-Entropy |
| Imbalanced | Focal loss, weighted CE |
6. Regularization
6.1 L1 Regularization (Lasso)
Effect: Produces sparse solutions (drives some weights to exactly zero). Useful for feature selection.
6.2 L2 Regularization (Ridge)
Effect: Penalizes large weights. Shrinks all weights toward zero but rarely makes them exactly zero.
Update rule (L2):
6.3 Elastic Net
Combines L1 and L2:
Benefits of both: sparse features + stable coefficients.
6.4 Dropout
During training, randomly set neuron activations to zero with probability :
DROPOUT_LAYER(a, p, training): if training: mask = random_bernoulli(1-p, size=a.shape) / (1-p) return a * mask else: return aEffect: Prevents co-adaptation of neurons. Implicit ensemble of sub-networks. Test time: no dropout (use full network, scaled by ).
6.5 Early Stopping
Monitor validation loss during training. Stop when validation loss starts increasing:
EARLY_STOPPING(train_fn, patience): best_val_loss = INF patience_counter = 0 for epoch in epochs: train_fn(epoch) val_loss = evaluate_validation() if val_loss < best_val_loss: best_val_loss = val_loss save_model() patience_counter = 0 else: patience_counter += 1 if patience_counter >= patience: break return best_model6.6 Regularization Comparison
| Method | Effect | When to Use |
|---|---|---|
| L1 (Lasso) | Sparsity, feature selection | Many irrelevant features |
| L2 (Ridge) | Weight shrinkage | All features potentially useful |
| Elastic Net | Both | Correlated features |
| Dropout | Prevents co-adaptation | Neural networks |
| Early stopping | Prevents overfitting | All models, especially NNs |
7. Evaluation Metrics
7.1 Confusion Matrix
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | TP (True Positive) | FN (False Negative) |
| Actual Negative | FP (False Positive) | TN (True Negative) |
7.2 Classification Metrics
Accuracy:
Misleading with imbalanced classes (e.g., 99% negative: predict all negative → 99% accuracy).
Precision: (of all positive predictions, how many were correct?)
Recall (Sensitivity): (of all actual positives, how many were found?)
F1 Score: Harmonic mean of precision and recall:
Specificity: (of all actual negatives, how many were correctly identified?)
Score: Weighted harmonic mean (gives times more importance to recall):
7.3 ROC Curve and AUC
ROC curve: Plots TPR (Recall) vs. FPR () at various classification thresholds.
AUC (Area Under Curve):
- AUC = 1.0: Perfect classifier.
- AUC = 0.5: Random guessing.
- AUC < 0.5: Worse than random (invert predictions).
Interpretation: Probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example.
7.4 Regression Metrics
MSE:
RMSE: (same units as )
MAE:
(Coefficient of Determination):
: Perfect fit. : Predicts mean. : Worse than predicting mean.
7.5 Cross-Validation
k-fold cross-validation:
K_FOLD_CV(model_fn, X, y, k): indices = shuffle(range(len(X))) fold_size = len(X) // k scores = [] for i in 0 to k-1: val_idx = indices[i*fold_size : (i+1)*fold_size] train_idx = all other indices model = model_fn(X[train_idx], y[train_idx]) score = evaluate(model, X[val_idx], y[val_idx]) scores.append(score) return mean(scores), std(scores)Stratified k-fold: Maintains class distribution in each fold (important for imbalanced data).
8. Bias-Variance Tradeoff
8.1 Decomposition
For a model predicting of true function with noise :
where:
8.2 Tradeoff
| Model Complexity | Bias | Variance | Total Error |
|---|---|---|---|
| Low (underfit) | High | Low | High |
| Optimal | Moderate | Moderate | Minimal |
| High (overfit) | Low | High | High |
8.3 Overfitting and Underfitting
Overfitting: Model learns noise in training data. Training error validation error.
Signs:
- High training accuracy, low validation accuracy.
- Large gap between training and test performance.
Remedies:
- More training data.
- Reduce model complexity (fewer parameters, shallower tree, smaller network).
- Regularization (L1, L2, dropout).
- Early stopping.
- Cross-validation for hyperparameter tuning.
Underfitting: Model fails to capture the underlying pattern. Both training and validation error are high.
Remedies:
- Increase model complexity.
- Add more features.
- Reduce regularization.
- Train longer.
8.4 Learning Curves
Plot training and validation error vs. training set size:
- High bias (underfit): Both curves plateau at high error, close together.
- High variance (overfit): Training error low, validation error high with a gap.
- Good fit: Both errors low and converging.
9. Common Pitfalls
Training without splitting data. Evaluating on training data gives overly optimistic metrics. Always use train/validation/test splits or cross-validation.
Using accuracy on imbalanced datasets. A 99% negative dataset with a model that always predicts negative achieves 99% accuracy but learns nothing. Use precision, recall, F1, or AUC instead.
Leaking information from test to train. Fitting preprocessing (scaling, imputation) on the entire dataset before splitting. Fit transformers on training data only, then apply to validation/test.
Ignoring feature scaling. Gradient descent converges faster with standardized features. Unscaled features cause slow convergence and may bias regularized models.
Setting learning rate too high or too low. Too high: loss diverges or oscillates. Too low: training is prohibitively slow. Use learning rate schedules or adaptive optimizers (Adam, RMSprop).
Overfitting to validation set through hyperparameter tuning. Repeated tuning on the same validation set causes “validation leakage.” Use nested cross-validation or a held-out test set touched only once.
Ignoring data distribution shifts. If training and test data come from different distributions (e.g., different time periods), model performance may degrade. Monitor for covariate shift and concept drift.
Worked Examples
Example 1: Perceptron Learning Algorithm
Problem: Train a perceptron to learn the AND function with inputs (0,0)->0, (0,1)->0, (1,0)->0, (1,1)->1. Initial weights w=(0,0), bias b=0, learning rate eta=1. Solution: (0,0): 0>=0, correct. (0,1): 0>=0, correct. (1,0): 0>=0, correct. (1,1): 0>=0, incorrect. Update: w1+=11=1, w2+=11=1, b+=-10=0. Next epoch: (1,1): 1+1=2>=0, correct. (0,1): 0+1=1>=0, correct (should be 0 — error). Update: w1+=10=1, w2+=1*(-1)=0, b+=-11=-1. (0,1): 0+0-1=-1<0, correct. (1,0): 1+0-1=0>=0, correct (should be 0 — error). Update: w1+=1(-1)=0, w2+=1*0=0, b+=-1*(-1)=0. Converged: (0,0)->0, (0,1)->0, (1,0)->0, (1,1)->0 (wrong!). Perceptron with this update rule may not converge for all functions.
Example 2: K-Nearest Neighbours Classification
Problem: Given training data: (1,1)->A, (1,3)->A, (3,1)->B, (3,3)->B. Classify the point (2,2) using k=3. Solution: Distances: to (1,1)=sqrt(2)=1.41 (A), to (1,3)=sqrt(2)=1.41 (A), to (3,1)=sqrt(2)=1.41 (B), to (3,3)=sqrt(2)=1.41 (B). All distances are equal. Nearest 3 points depend on tie-breaking; any combination of 3 points yields either 2A+1B or 2B+1A. The point (2,2) lies equidistant from all training data, so the classification is ambiguous.
Summary
- Linear regression (OLS, gradient descent) predicts continuous targets; logistic regression predicts probabilities for binary classification.
- Decision trees split on features using impurity measures (Gini, entropy); ensembles (Random Forest, Gradient Boosting) improve performance.
- Neural networks learn hierarchical features through backpropagation with activation functions (ReLU, sigmoid, softmax).
- Loss functions (MSE, cross-entropy, Huber) measure prediction error and guide optimization.
- Regularization (L1, L2, dropout, early stopping) prevents overfitting by penalizing complexity.
- Evaluation metrics (accuracy, precision, recall, F1, ROC-AUC, ) must match the task and class distribution.
- Bias-variance tradeoff guides model complexity: underfitting (high bias) vs. overfitting (high variance).
Cross-References
| Topic | Link |
|---|---|
| Algorithms Overview | View |
| Complexity Theory | View |
| Algorithm Design | View |