Classical Machine Learning: The Foundations Behind Intelligent Systems
Overview

Classical Machine Learning: The Foundations Behind Intelligent Systems

Classical Machine Learning: The Foundations Behind Intelligent Systems
March 1, 2026 · 22 min read (57 min read total) · 2 parts
· ·

Statistical & Analytical Foundations

Statistics is the backbone of data science. In an interview, always connect math to business meaning. The goal is not to recite formulas but to explain WHY something matters.

Probability Distributions: The Building Blocks

A probability distribution describes how the values of a random variable are spread out. It tells us which outcomes are likely and how likely they are. Some distributions describe discrete outcomes (such as counts or binary events), while others describe continuous values (such as measurements).

Understanding these distributions helps us:

  • model uncertainty
  • understand real-world data
  • build statistical and machine learning models.

Mean and Standard Deviation

Mean (μ)

The mean represents the average value of a distribution. It indicates where the center of the data lies.

If exam scores average 70, then μ = 70.

Standard Deviation (σ)

The standard deviation measures how spread out the data is around the mean.

Small σ → values are close to the mean

Large σ → values are more spread out

μ = 70, σ = 5 → most scores close to 70
μ = 70, σ = 20 → scores vary widely
Tip (The crowd analogy)

Imagine a group photo of people standing in a line. The mean is the person standing in the center of the group, the average position where everyone balances out. The standard deviation tells you how spread out the crowd is from that center person. If everyone stands close to the center, the spread is small (small σ). If people are scattered far away, the spread is large (large σ).

Key Distributions You Must Know Cold

Note (How distributions are connected)

A Bernoulli distribution describes one trial with two outcomes.

  • Now imagine repeating that experiment many times. If you repeat a Bernoulli trial n times and count the number of successes, the result follows a Binomial distribution.
  • Sometimes we care about rare events over time. When the number of trials becomes very large but the probability of success is small, the Binomial distribution behaves like a Poisson distribution.
  • If we look at the average of many observations, something interesting happens. According to the Central Limit Theorem, the distribution of sample averages tends to look Normal(bell-shaped), even if the original data is not perfectly normal.
  • Now consider the time between events. If events happen randomly at a constant average rate (as modeled by the Poisson distribution), the waiting time between those events follows an Exponential distribution.
  • Sometimes we are uncertain about probabilities themselves. The Beta distribution is often used to represent uncertainty about a probability value, such as the true success rate of a Bernoulli process.
  • Finally, when working with small datasets, we cannot estimate variability very precisely. In that case, we use the t-distribution, which is similar to the normal distribution but has heavier tails to account for extra uncertainty.
01
Bernoullip=0.3

A single binary trial — success (1) or failure (0). The atomic building block of all discrete distributions.

Key properties
E[X] = p = 0.3
Var[X] = p(1−p) = 0.21
Special case: Binomial with n=1
Use cases
Single coin flip
Single ad click
One pass/fail test
Important

If asked ‘When would you use Poisson vs Normal?’ say: Poisson for count data with rare events in fixed intervals; Normal for continuous measurements when CLT applies. Always give a concrete business example

Hypothesis Testing

Hypothesis testing is a statistical framework used to make decisions based on data. It helps determine whether an observed effect is real or simply due to random chance. Understanding this process is fundamental in statistics, data science, and machine learning.

The 5-Step Framework

  1. State the hypotheses, define two competing statements:

    • H₀ (Null Hypothesis): No effect or no difference exists.
    • H₁ (Alternative Hypothesis): There is an effect or difference.
  2. Choose a significance level (α). Select the threshold for accepting false positives. A common choice is α = 0.05, meaning we accept a 5% chance of incorrectly rejecting H₀.

  3. Collect data and compute a test statistic. Use the sample data to calculate a statistic such as z, t, χ² (chi-square), or F, depending on the test.

  4. Compute the p-value. The p-value represents the probability of observing results at least as extreme as the current data, assuming the null hypothesis is true.

  5. Make a decision

    • If p < α, reject the null hypothesis.
    • If p ≥ α, fail to reject the null hypothesis.

Type I and Type II Errors

When performing a hypothesis test, two types of mistakes can occur. Type I is when we reject the null hypothesis even though it is actually true. In other words, we detect an effect that does not really exist. Type II is when we fail to reject the null hypothesis even though there is a real effect. In other words, we miss something that actually exists.

Type I → Seeing something that isn't there.
Type II → Missing something that is there.

Confidence Intevals

CI=pointestimate±z×(standarddeviation/n)CI = point_estimate ± z × (standard_deviation / √n)

A confidence interval (CI) gives a range of values where the true value is likely to lie based on the data we collected. Think of it as an estimated range for the real answer.

For example, if we estimate the average height of a population to be 170 cm, a 95% confidence interval might look like:

167 cm — 173 cm

Correlation vs Causation

Correlation measures how strongly two variables move together. It ranges from –1 to +1:

  • +1 → perfect positive relationship (both increase together)

  • –1 → perfect negative relationship (one increases while the other decreases)

  • 0 → no linear relationship

However, correlation does not mean causation. Causation means that one variable directly causes a change in another.

Machine Learning algorithms

When learning any machine learning algorithm, aim to understand it at three levels:

  1. Intuition: what problem the algorithm solves and how it works conceptually
  2. Mathematical foundation: the objective function and optimization method
  3. Practical usage: when to use it and how to tune its hyperparameters

Linear Regression, The Foundation

Linear regression models the relationship between a target variable Y and input features X as a linear combination. It is one of the simplest and most interpretable models in machine learning.

y=β0+β1x1+β2x2+...+βnxn+ϵy = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon

Where:

  • β0\beta_0 = intercept (the baseline value when all features are zero)
  • βi\beta_i = how much Y changes for a one-unit increase in feature xix_i
  • ϵ\epsilon = random noise the model cannot explain

The loss function being minimized is Mean Squared Error — the average of squared prediction errors:

MSE=1ni=1n(yiy^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^{n}(y_i - \hat{y}_i)^2

The closed-form solution — called Ordinary Least Squares — finds the optimal coefficients directly:

β=(XTX)1XTy\beta = (X^T X)^{-1} X^T y
Intuition (What is linear regression really doing?)

Imagine plotting house sizes against prices on a scatter plot. Linear regression draws the one straight line that minimises the total squared distance from every point to the line. It is the “line of least regret” — wrong about every point by the smallest possible amount overall.

Assumptions (the LINE acronym)

Every statistical test you run on a linear model assumes these hold. Violating them does not break the model but it does break your inference.

LetterAssumptionConsequence if violated
LLinearity — the relationship between X and Y is linearModel is systematically biased
IIndependence — residuals do not depend on each otherStandard errors underestimated
NNormality — residuals are normally distributedHypothesis tests become invalid
EEqual variance (homoscedasticity) — residual spread is constantInefficient estimates, invalid CIs

Regularization — Keeping the Model Honest

A model that fits training data perfectly is usually memorising noise, not learning the pattern. Regularization adds a penalty to the loss function that discourages overly large coefficients.

MethodPenalty addedBehaviourBest for
Ridge (L2)λβi2\lambda \sum \beta_i^2Shrinks all coefficients toward zero, never exactly zeroMulticollinearity; all features likely relevant
Lasso (L1)λβi\lambda \sum \|\beta_i\|Can zero out coefficients entirely — automatic feature selectionMany irrelevant features; want a sparse model
ElasticNetλ1L1+λ2L2\lambda_1 L_1 + \lambda_2 L_2Combines both effectsHigh-dimensional data with correlated features

λ\lambda is the regularization strength — larger λ\lambda = heavier penalty. Tune it with cross-validation.

Intuition (The rubber band analogy)

Without regularization, a curve wiggles wildly to pass through every noisy data point. Ridge adds a rubber band that pulls the curve straight — it still fits the data but cannot wiggle too much. Lasso goes further: it can snap the band entirely for features it deems irrelevant, removing them from the model.

Logistic Regression

Despite the name, logistic regression is a classification model. It predicts the probability that an observation belongs to a class by applying a sigmoid function to a linear combination of features.

P(y=1x)=σ(wTx)=11+ewTxP(y=1 \mid x) = \sigma(w^T x) = \frac{1}{1 + e^{-w^T x}}

The sigmoid squishes any real number into (0,1)(0, 1), making the output a valid probability.

Loss function — cross-entropy (log loss), not MSE:

L=1ni=1n[yilog(pi)+(1yi)log(1pi)]\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n}\Big[y_i \log(p_i) + (1 - y_i)\log(1 - p_i)\Big]
Important (Why not MSE for classification?)

MSE combined with the sigmoid creates a non-convex loss surface full of local minima — gradient descent can get stuck. Cross-entropy is convex, which guarantees convergence to the global minimum.

The default decision boundary is P>0.5P > 0.5 \rightarrow class 1, but this threshold is adjustable. Moving it lower increases recall (catches more positives at the cost of more false alarms).

Bias-Variance Tradeoff

Every model’s total error decomposes into three parts:

Total Error=Bias2systematic error+Variancesensitivity to data+Irreducible Noisecan’t fix this\text{Total Error} = \underbrace{\text{Bias}^2}_{\text{systematic error}} + \underbrace{\text{Variance}}_{\text{sensitivity to data}} + \underbrace{\text{Irreducible Noise}}_{\text{can't fix this}}
High Bias (Underfitting)High Variance (Overfitting)
CauseModel too simple, wrong assumptionsModel too complex, memorises noise
Training errorHighLow
Test errorHighHigh
ExampleLinear model on curved dataDeep decision tree on 100 rows
FixMore features, complex model, less regularizationMore data, regularization, simpler model, ensembles
Intuition (The dartboard analogy)

Imagine throwing darts at a bullseye. High bias — darts consistently land to the left of centre (systematic aim error, wrong assumptions). High variance — darts scattered all over the board (inconsistent, sensitive to tiny hand movements). The ideal model has both low bias and low variance: darts clustered around the bullseye.

Tip (Interview visual)

Draw the U-shaped test error curve against model complexity. Show training error always decreasing, test error bottoming out then rising. This single diagram communicates the entire tradeoff and tends to impress interviewers who expect only verbal explanations.

Decision Trees

A decision tree splits data by asking yes/no questions on features, building a hierarchy of if-else rules. Each leaf node holds a prediction.

How the tree decides which question to ask — it picks the split that gives the greatest reduction in impurity:

Gini Impurity=1ipi2lower = purer node\text{Gini Impurity} = 1 - \sum_i p_i^2 \qquad \text{lower = purer node} Entropy=ipilog2(pi)\text{Entropy} = -\sum_i p_i \log_2(p_i) Information Gain=Entropy(parent)weighted Entropy(children)\text{Information Gain} = \text{Entropy(parent)} - \text{weighted Entropy(children)}
HyperparameterWhat it controls
max_depthHow deep the tree grows — lower depth = simpler model
min_samples_splitMinimum samples needed to attempt a split
min_samples_leafMinimum samples required in any leaf
max_featuresNumber of features randomly considered per split
Intuition (Decision trees are flowcharts)

A decision tree is exactly the kind of yes/no flowchart you would draw on a whiteboard: “Is the customer’s age above 30? If yes, is their income above $50k? If no, predict churn.” The algorithm automates finding the best questions to ask and in what order — based purely on what maximises class purity at each step.

Ensemble Methods — The Interview Workhorses

Random Forest

Random Forest trains many deep decision trees independently on different random subsets of data and features, then averages their predictions.

Why does this work? Each tree makes different errors because it sees different data and features. Averaging many imperfect trees cancels out individual mistakes, reducing variance without increasing bias.

Key ideas:

  • Bagging — each tree is trained on a bootstrap sample (roughly 63% of data, sampled with replacement)
  • Random feature subsets — at each split, only a random subset of features is considered. This is what decorrelates the trees and makes RF better than plain bagging.
  • Out-of-bag (OOB) error — the ~37% of samples not used in each tree act as a free validation set, no cross-validation needed.
Tip (When to reach for Random Forest)

Tabular data, high-dimensional features, quick reliable baseline, and situations where you need feature importance without extra work. Robust to outliers and noisy features.

Gradient Boosting (XGBoost, LightGBM, CatBoost)

Where Random Forest builds trees in parallel, Gradient Boosting builds them sequentially. Each new tree corrects the mistakes of the current ensemble by fitting the residual errors (the negative gradient of the loss).

Fm(x)=Fm1(x)+ηhm(x)F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)

where hmh_m is the new tree correcting the residuals, and η\eta is the learning rate that controls how much weight each new tree gets.

LibraryKey innovation
XGBoostRegularization in the tree objective; handles missing values; column subsampling
LightGBMLeaf-wise growth (faster); histogram-based splitting (memory efficient)
CatBoostBuilt-in categorical encoding; ordered boosting prevents target leakage
import xgboost as xgb
model = xgb.XGBClassifier(
n_estimators=1000, # Set high — early stopping will find the real number
learning_rate=0.05, # Lower = better generalization, slower training
max_depth=4, # Keep shallow for boosting (3–6 is typical)
subsample=0.8, # Fraction of rows per tree
colsample_bytree=0.8, # Fraction of features per tree
early_stopping_rounds=50, # Stop if val metric doesn't improve for 50 rounds
eval_metric='auc',
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=100)
Important (Always use early stopping with boosting)

Setting early_stopping_rounds is the single most impactful practice when using gradient boosting. It automatically finds the optimal number of trees by monitoring held-out performance, preventing overfitting without manually tuning n_estimators.

Model Evaluation — Choosing the Right Metric

The choice of metric is as important as the choice of model. Using the wrong one can lead you to ship a model that looks great on paper but fails in production.

Classification Metrics

MetricWhat it measuresUse when
Accuracy(TP+TN)/(TP+TN+FP+FN)(TP+TN)/(TP+TN+FP+FN)Classes are balanced — misleading otherwise
PrecisionTP/(TP+FP)TP/(TP+FP)False positives are costly (spam filter, fraud alert)
RecallTP/(TP+FN)TP/(TP+FN)False negatives are costly (cancer screening, safety)
F1 Score2PR/(P+R)2 \cdot PR/(P+R)Need balance between precision and recall
ROC-AUCArea under ROC curveThreshold-invariant; good general-purpose metric
PR-AUCArea under Precision-Recall curveHeavily imbalanced datasets

Regression Metrics

MetricFormulaUse when
MAEmean of yiy^i\|y_i - \hat{y}_i\|Outliers present; interpretable in original units
MSEmean of (yiy^i)2(y_i - \hat{y}_i)^2Want to heavily penalize large errors
RMSEMSE\sqrt{\text{MSE}}Same units as target — easier to communicate
MAPEmean of yiy^i/yi×100%\|y_i - \hat{y}_i\|/y_i \times 100\%Percentage error — intuitive for business audiences
1SSres/SStot1 - SS_{res}/SS_{tot}Proportion of variance explained (0 to 1)
Important (Always ask this before choosing a metric)

What is the cost of a false positive versus a false negative in your business context? A fraud detection model where missing fraud costs 10,000butafalsealertcosts10,000 but a false alert costs 5 should optimise for recall, not accuracy. This question signals business maturity to interviewers.


Python for Data Science

Pandas — Core Operations

import pandas as pd
import numpy as np
# Load and inspect
df = pd.read_csv('data.csv')
print(df.shape, df.dtypes, df.describe(), df.isnull().sum())
# Filtering
df[df['age'] > 30]
df.query('age > 30 and salary > 50000')
# GroupBy — the most tested operation in DS interviews
df.groupby('segment')['revenue'].agg(['mean', 'sum', 'count']).reset_index()
# Apply a custom function row-by-row
df['new_col'] = df['col'].apply(lambda x: x ** 2 if x > 0 else 0)
# Pivot table
df.pivot_table(values='sales', index='region', columns='quarter', aggfunc='sum')
# Merge (like SQL JOIN)
pd.merge(df1, df2, on='customer_id', how='left')
# Handle missing values
df['col'].fillna(df['col'].median()) # impute with median
df.dropna(subset=['critical_col']) # drop rows where this column is missing

Scikit-learn — The Interview Pipeline

Always wrap preprocessing and model steps in a Pipeline. This is not just a style preference — it prevents data leakage, the single most common mistake in ML interviews and production systems.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score
numeric_features = ['age', 'income']
categorical_features = ['region', 'segment']
preprocessor = ColumnTransformer([
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),
])
pipeline = Pipeline([
('prep', preprocessor),
('model', RandomForestClassifier(n_estimators=100, random_state=42)),
])
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
print('ROC-AUC:', roc_auc_score(y_test, pipeline.predict_proba(X_test)[:, 1]))
Important (Why Pipeline prevents data leakage)

If you fit a StandardScaler on the full dataset before the train/test split, the scaler has seen the test set’s mean and variance — your model has indirectly peeked at unseen data. Wrapping everything in Pipeline guarantees that any fitting step (scaling, encoding) only ever sees training data, keeping the test set truly held-out.

Handling Class Imbalance

When one class is far rarer than another (e.g., 1% fraud, 99% legitimate), accuracy becomes meaningless — a model that always predicts “not fraud” is 99% accurate and completely useless. Four strategies in order of how to try them:

# 1. Class weights — easiest, try this first
RandomForestClassifier(class_weight='balanced')
# 2. SMOTE — creates synthetic minority samples to balance the dataset
from imblearn.over_sampling import SMOTE
X_res, y_res = SMOTE(random_state=42).fit_resample(X_train, y_train)
# 3. Threshold tuning — 0.5 is an arbitrary default
# Lower threshold = more recall (catches more positives, more false alarms)
y_pred = (model.predict_proba(X_test)[:, 1] > 0.3).astype(int)
# 4. Evaluate with PR-AUC, not ROC-AUC
from sklearn.metrics import average_precision_score
pr_auc = average_precision_score(y_test, model.predict_proba(X_test)[:, 1])

SQL — Window Functions & Advanced Patterns

SQL is tested at every senior data science interview. Knowing window functions is what separates junior from senior answers.

Window Functions

A window function computes a value across a set of rows related to the current row without collapsing the result into one row — unlike GROUP BY. The OVER() clause defines which rows to look at.

-- General syntax
function() OVER (PARTITION BY col ORDER BY col ROWS BETWEEN ... AND ...)

Ranking

SELECT *,
ROW_NUMBER() OVER (PARTITION BY region ORDER BY sales DESC) AS row_num,
RANK() OVER (PARTITION BY region ORDER BY sales DESC) AS rank,
DENSE_RANK() OVER (PARTITION BY region ORDER BY sales DESC) AS dense_rank
FROM sales_table;
FunctionTie behaviourExample output
ROW_NUMBER()Always unique1, 2, 3, 4
RANK()Skips after tie1, 2, 2, 4
DENSE_RANK()No skip after tie1, 2, 2, 3

Running Totals and Moving Averages

SELECT date, sales,
SUM(sales) OVER (ORDER BY date) AS running_total,
AVG(sales) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7day
FROM daily_sales;

LAG / LEAD — Compare Rows to Previous or Next

SELECT date, revenue,
LAG(revenue, 1) OVER (ORDER BY date) AS prev_day_revenue,
LEAD(revenue, 1) OVER (ORDER BY date) AS next_day_revenue,
revenue - LAG(revenue, 1) OVER (ORDER BY date) AS day_over_day_change
FROM daily_revenue;

NTILE — Percentile Buckets

SELECT customer_id, lifetime_value,
NTILE(4) OVER (ORDER BY lifetime_value DESC) AS ltv_quartile,
NTILE(10) OVER (ORDER BY lifetime_value DESC) AS ltv_decile
FROM customers;

CTEs and Complex Patterns

-- Top 3 customers by revenue per region
WITH customer_revenue AS (
SELECT customer_id, region,
SUM(order_amount) AS total_revenue
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id, region
),
ranked AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY region ORDER BY total_revenue DESC) AS rn
FROM customer_revenue
)
SELECT * FROM ranked WHERE rn <= 3;
-- Month-over-month growth rate
WITH monthly AS (
SELECT DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue
FROM orders
GROUP BY 1
)
SELECT month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
ROUND(
100.0 * (revenue - LAG(revenue) OVER (ORDER BY month))
/ NULLIF(LAG(revenue) OVER (ORDER BY month), 0),
2
) AS mom_growth_pct
FROM monthly;

Classic Interview SQL Problems

-- Second highest salary
SELECT DISTINCT salary FROM employees ORDER BY salary DESC LIMIT 1 OFFSET 1;
-- Customers who bought in January but not February
SELECT DISTINCT customer_id FROM orders WHERE MONTH(order_date) = 1
EXCEPT
SELECT DISTINCT customer_id FROM orders WHERE MONTH(order_date) = 2;
-- First date cumulative sales crossed 1 million
WITH running AS (
SELECT date, SUM(sales) OVER (ORDER BY date) AS cum_sales
FROM daily_sales
)
SELECT MIN(date) FROM running WHERE cum_sales >= 1000000;
-- 7-day retention: users who were active 7 days after their first activity
WITH first_seen AS (
SELECT user_id, MIN(activity_date) AS cohort_date
FROM user_activity
GROUP BY user_id
)
SELECT f.user_id
FROM first_seen f
JOIN user_activity a
ON f.user_id = a.user_id
AND a.activity_date = f.cohort_date + INTERVAL '7 days';

A/B Testing & Causal Inference

A/B Test Design — End to End

Before running an experiment, answer six questions. Skipping any of them leads to results you cannot trust.

  1. Primary metric — one metric only. Testing many metrics simultaneously inflates the false positive rate.
  2. Minimum Detectable Effect (MDE) — what is the smallest improvement that is meaningful to the business?
  3. Required sample size — calculated from significance level, power, MDE, and baseline conversion rate.
  4. Randomization unit — user, session, or request? The unit must match the level at which the metric is measured.
  5. Duration — run until the target sample size is reached, and for at least one full business cycle (e.g., one full week to capture weekday/weekend behaviour).
  6. Guardrail metrics — metrics that must not worsen (e.g., latency, revenue per user) even if the primary metric improves.

Sample Size Calculation

from scipy import stats
import numpy as np
def sample_size(baseline_rate, mde, alpha=0.05, power=0.80):
"""
baseline_rate : current conversion rate, e.g. 0.10 for 10%
mde : minimum detectable effect, e.g. 0.02 to detect a 2pp lift
alpha : significance level (false positive rate)
power : 1 - beta (probability of detecting a real effect)
"""
p1 = baseline_rate
p2 = baseline_rate + mde
pooled = (p1 + p2) / 2
z_alpha = stats.norm.ppf(1 - alpha / 2) # two-tailed
z_beta = stats.norm.ppf(power)
n = (
z_alpha * np.sqrt(2 * pooled * (1 - pooled)) +
z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2))
) ** 2 / mde ** 2
return int(np.ceil(n))
# 10% baseline, want to detect a 2 percentage point lift
print(sample_size(0.10, 0.02)) # → ~3,843 per group

Common A/B Testing Pitfalls

Note (The peeking problem)

Checking results before the planned sample size is reached inflates the Type I error rate dramatically. If you peek five times during an experiment, your effective false positive rate is closer to 22% than 5%. Commit to a fixed sample design, or use sequential testing methods (SPRT) that are designed for continuous monitoring.

Note (Network effects and SUTVA violation)

If users in control and treatment interact with each other (social features, marketplaces, referral programs), the treatment bleeds into the control group and the experiment is contaminated. Use cluster randomization — assign entire geographies, cohorts, or friend groups to one arm — instead of individual-level randomization.

Note (Multiple comparisons and p-hacking)

Testing 20 metrics at α=0.05\alpha = 0.05 expects one false positive by chance alone. Apply Bonferroni correction: use αadjusted=0.05/num_tests\alpha_{\text{adjusted}} = 0.05 / \text{num\_tests} as your threshold. Better still — pre-register your primary metric before the experiment starts.

Note (Simpson's Paradox)

Aggregated results can reverse when broken into subgroups. A treatment that looks harmful overall can be beneficial for every individual subgroup, and vice versa. Always segment your analysis after seeing the overall result.

Causal Inference — When You Cannot Randomize

A/B tests require the ability to randomize. In practice, randomization is often impossible: a policy was already rolled out, an ethical concern prevents a control group, or the treatment is something historical. Quasi-experimental methods let you estimate causal effects from observational data.

Difference-in-Differences (DiD)

When to use: treatment was rolled out to some groups (cities, cohorts, stores) but not others, and you have data from before and after the rollout.

DiD=(YˉT,afterYˉT,before)(YˉC,afterYˉC,before)\text{DiD} = (\bar{Y}_{T,\text{after}} - \bar{Y}_{T,\text{before}}) - (\bar{Y}_{C,\text{after}} - \bar{Y}_{C,\text{before}})

The key assumption is parallel trends: without the treatment, the treated and control groups would have evolved at the same rate. You cannot prove this, but you can make it plausible.

import statsmodels.formula.api as smf
# The coefficient on treated:post is the DiD estimate
result = smf.ols(
'outcome ~ treated + post + treated:post + controls',
data=df
).fit()
print(result.summary())
Tip (Validating parallel trends)

Plot treated and control group outcomes over time before the treatment date. If they track each other closely, the assumption is plausible. Run a placebo test: apply the DiD method to a pre-treatment period where you know the effect should be zero. A significant result there signals a problem.

Regression Discontinuity Design (RDD)

When to use: treatment is assigned based on a hard threshold on a continuous variable — credit score ≥ 700 gets a loan, test score ≥ 70 earns a scholarship.

Intuition: people just above and just below the threshold are essentially identical except for which side they fell on. Comparing their outcomes in a narrow window around the cutoff gives a clean causal estimate.

  • Sharp RDD — everyone above the threshold gets treatment (hard cutoff)
  • Fuzzy RDD — treatment probability jumps at the threshold (use IV to estimate)
  • McCrary test — verify that no one is gaming their score to cross the cutoff (check for a discontinuity in the density of the running variable at the threshold)

Instrumental Variables (IV)

When to use: your variable of interest X is correlated with unobserved confounders (endogeneity problem), making OLS estimates biased.

A valid instrument Z must satisfy two conditions:

  1. Relevance — Z affects X (testable: first-stage F-statistic should be > 10)
  2. Exclusion restriction — Z affects Y only through X, not directly (untestable, requires theory)

Two-Stage Least Squares (2SLS):

  • Stage 1: regress X on Z → get X^\hat{X} (the part of X explained only by Z)
  • Stage 2: regress Y on X^\hat{X} → the coefficient is the causal estimate

Classic examples: draft lottery as an instrument for military service; distance to college as an instrument for years of education.

Caveat (IV estimates LATE, not ATE)

IV estimates the Local Average Treatment Effect — the causal effect for compliers (people whose treatment status was actually changed by the instrument). This may not generalize to the broader population.


Forecasting, Anomaly Detection & Price Elasticity

Time Series Forecasting

Any time series can be decomposed into four components:

ComponentDescriptionExample
TrendLong-term directionRevenue growing year over year
SeasonalityPeriodic repeating patternHoliday sales spike every December
CyclicalMulti-year economic cyclesRevenue declining during a recession
NoiseUnpredictable random variationDay-to-day fluctuations

ARIMA

ARIMA is the classical statistical approach for univariate time series. The three parameters control what information the model uses:

ParameterWhat it addsHow to choose
p (AR order)Past values of YLook at the Partial Autocorrelation Function (PACF)
d (differencing)Remove trend to make series stationaryAugmented Dickey-Fuller test — difference until p<0.05p < 0.05
q (MA order)Past forecast errorsLook at the Autocorrelation Function (ACF)

For seasonal data: SARIMA(p,d,q)(P,D,Q,s)(p,d,q)(P,D,Q,s) where ss is the seasonal period (e.g., 12 for monthly data with annual seasonality).

Prophet

Prophet is Meta’s open-source library designed for business time series with strong seasonality, holidays, and missing data. It fits an additive model:

y(t)=trend(t)+seasonality(t)+holidays(t)+ϵy(t) = \text{trend}(t) + \text{seasonality}(t) + \text{holidays}(t) + \epsilon
from prophet import Prophet
import pandas as pd
# Prophet requires columns named 'ds' (date) and 'y' (value)
df = pd.read_csv('sales.csv').rename(columns={'date': 'ds', 'sales': 'y'})
model = Prophet(
yearly_seasonality=True,
weekly_seasonality=True,
daily_seasonality=False,
changepoint_prior_scale=0.05, # Flexibility of the trend — higher = more flexible
)
model.add_country_holidays(country_name='US')
model.fit(df)
future = model.make_future_dataframe(periods=90) # 90-day forecast
forecast = model.predict(future)
model.plot(forecast) # Full forecast with uncertainty bands
model.plot_components(forecast) # Decomposition: trend + each seasonality component
Tip (ARIMA vs Prophet)

Use ARIMA when you need statistical rigour and you have a clean, stationary univariate series. Use Prophet when you have multiple strong seasonalities, missing data, known holidays, and need a model that non-experts can run and interpret — it handles the messiness of real business data gracefully.

Price Elasticity of Demand

Price elasticity measures how sensitive demand is to a change in price.

E=%ΔQ%ΔP=QPPQE = \frac{\% \Delta Q}{\% \Delta P} = \frac{\partial Q}{\partial P} \cdot \frac{P}{Q}
ElasticityMeaningTypical examples
E>1\|E\| > 1Elastic — demand drops significantly with a price increaseLuxury goods, commodities, price-sensitive categories
E<1\|E\| < 1Inelastic — demand barely changesNecessities, utilities, strong brand loyalty
E=1\|E\| = 1Unit elastic — revenue stays constantRare in practice

Estimate elasticity with a log-log regression — the coefficient on log price is the elasticity directly:

ln(Q)=α+Eln(P)+controls\ln(Q) = \alpha + E \cdot \ln(P) + \text{controls}
import numpy as np
from sklearn.linear_model import LinearRegression
df['log_sales'] = np.log(df['sales'])
df['log_price'] = np.log(df['price'])
X = df[['log_price', 'log_income', 'log_competitor_price']]
y = df['log_sales']
model = LinearRegression().fit(X, y)
elasticity = model.coef_[0] # Typically negative: higher price → lower demand
print(f'Price elasticity: {elasticity:.3f}')
# Optimal price (Lerner condition — valid only when |E| > 1)
marginal_cost = 10
optimal_price = marginal_cost * elasticity / (elasticity + 1)
print(f'Optimal price: ${optimal_price:.2f}')

Anomaly Detection

MethodHow it worksUse when
Z-scoreFlag points more than 3σ from the meanUnivariate, stationary, normally distributed
IQR methodFlag outside [Q11.5IQR, Q3+1.5IQR][Q_1 - 1.5 \cdot IQR,\ Q_3 + 1.5 \cdot IQR]Skewed data, robust to outliers
Isolation ForestAnomalies are isolated in fewer tree splitsMultivariate, high-dimensional
Local Outlier Factor (LOF)Compares local density to neighboursVarying cluster densities
AutoencoderHigh reconstruction error flags anomaliesComplex high-dimensional patterns
Prophet residualsFit a forecast, flag large residualsTime series anomaly detection
CUSUM / EWMASequential monitoring of cumulative deviationsReal-time production monitoring
from sklearn.ensemble import IsolationForest
model = IsolationForest(contamination=0.05, random_state=42) # 5% expected anomaly rate
preds = model.fit_predict(X) # -1 = anomaly, 1 = normal
anomalies = X[preds == -1]

Liked this article? this post and share it with a friend. Have a question, feedback or simply wish to contact me privately? Shoot me a DM and I'll do my best to get back to you.

Have a wonderful day.

– Sarath