Classical Machine Learning: The Foundations Behind Intelligent Systems

Statistical & Analytical Foundations

Statistics is the backbone of data science. In an interview, always connect math to business meaning. The goal is not to recite formulas but to explain WHY something matters.

Probability Distributions: The Building Blocks

A probability distribution describes how the values of a random variable are spread out. It tells us which outcomes are likely and how likely they are. Some distributions describe discrete outcomes (such as counts or binary events), while others describe continuous values (such as measurements).

Understanding these distributions helps us:

model uncertainty
understand real-world data
build statistical and machine learning models.

Mean and Standard Deviation

Mean (μ)

The mean represents the average value of a distribution. It indicates where the center of the data lies.

1
If exam scores average 70, then μ = 70.

Standard Deviation (σ)

The standard deviation measures how spread out the data is around the mean.

Small σ → values are close to the mean

Large σ → values are more spread out

1
μ = 70, σ = 5 → most scores close to 70
2
μ = 70, σ = 20 → scores vary widely

Tip (The crowd analogy)

Imagine a group photo of people standing in a line. The mean is the person standing in the center of the group, the average position where everyone balances out. The standard deviation tells you how spread out the crowd is from that center person. If everyone stands close to the center, the spread is small (small σ). If people are scattered far away, the spread is large (large σ).

Key Distributions You Must Know Cold

Note (How distributions are connected)

A Bernoulli distribution describes one trial with two outcomes.

Now imagine repeating that experiment many times. If you repeat a Bernoulli trial n times and count the number of successes, the result follows a Binomial distribution.
Sometimes we care about rare events over time. When the number of trials becomes very large but the probability of success is small, the Binomial distribution behaves like a Poisson distribution.
If we look at the average of many observations, something interesting happens. According to the Central Limit Theorem, the distribution of sample averages tends to look Normal(bell-shaped), even if the original data is not perfectly normal.
Now consider the time between events. If events happen randomly at a constant average rate (as modeled by the Poisson distribution), the waiting time between those events follows an Exponential distribution.
Sometimes we are uncertain about probabilities themselves. The Beta distribution is often used to represent uncertainty about a probability value, such as the true success rate of a Bernoulli process.
Finally, when working with small datasets, we cannot estimate variability very precisely. In that case, we use the t-distribution, which is similar to the normal distribution but has heavier tails to account for extra uncertainty.

Bernoullip=0.3

A single binary trial — success (1) or failure (0). The atomic building block of all discrete distributions.

Key properties

E[X] = p = 0.3

Var[X] = p(1−p) = 0.21

Special case: Binomial with n=1

Use cases

Single coin flip

Single ad click

One pass/fail test

Important

If asked ‘When would you use Poisson vs Normal?’ say: Poisson for count data with rare events in fixed intervals; Normal for continuous measurements when CLT applies. Always give a concrete business example

Hypothesis Testing

Hypothesis testing is a statistical framework used to make decisions based on data. It helps determine whether an observed effect is real or simply due to random chance. Understanding this process is fundamental in statistics, data science, and machine learning.

The 5-Step Framework

State the hypotheses, define two competing statements:
- H₀ (Null Hypothesis): No effect or no difference exists.
- H₁ (Alternative Hypothesis): There is an effect or difference.
Choose a significance level (α). Select the threshold for accepting false positives. A common choice is α = 0.05, meaning we accept a 5% chance of incorrectly rejecting H₀.
Collect data and compute a test statistic. Use the sample data to calculate a statistic such as z, t, χ² (chi-square), or F, depending on the test.
Compute the p-value. The p-value represents the probability of observing results at least as extreme as the current data, assuming the null hypothesis is true.
Make a decision
- If p < α, reject the null hypothesis.
- If p ≥ α, fail to reject the null hypothesis.

Type I and Type II Errors

When performing a hypothesis test, two types of mistakes can occur. Type I is when we reject the null hypothesis even though it is actually true. In other words, we detect an effect that does not really exist. Type II is when we fail to reject the null hypothesis even though there is a real effect. In other words, we miss something that actually exists.

1
Type I → Seeing something that isn't there.
2
Type II → Missing something that is there.

Confidence Intevals

CI = point_estimate ± z × (standard_deviation / √n)

A confidence interval (CI) gives a range of values where the true value is likely to lie based on the data we collected. Think of it as an estimated range for the real answer.

For example, if we estimate the average height of a population to be 170 cm, a 95% confidence interval might look like:

1
167 cm — 173 cm

Correlation vs Causation

Correlation measures how strongly two variables move together. It ranges from –1 to +1:

+1 → perfect positive relationship (both increase together)
–1 → perfect negative relationship (one increases while the other decreases)
0 → no linear relationship

However, correlation does not mean causation. Causation means that one variable directly causes a change in another.

Machine Learning algorithms

When learning any machine learning algorithm, aim to understand it at three levels:

Intuition: what problem the algorithm solves and how it works conceptually
Mathematical foundation: the objective function and optimization method
Practical usage: when to use it and how to tune its hyperparameters

Linear Regression, The Foundation

Linear regression models the relationship between a target variable Y and input features X as a linear combination. It is one of the simplest and most interpretable models in machine learning.

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon

Where:

$\beta_0$ = intercept (the baseline value when all features are zero)
$\beta_i$ = how much Y changes for a one-unit increase in feature $x_i$
$\epsilon$ = random noise the model cannot explain

The loss function being minimized is Mean Squared Error — the average of squared prediction errors:

\text{MSE} = \frac{1}{n} \sum_{i=1}^{n}(y_i - \hat{y}_i)^2

The closed-form solution — called Ordinary Least Squares — finds the optimal coefficients directly:

\beta = (X^T X)^{-1} X^T y

Intuition (What is linear regression really doing?)

Imagine plotting house sizes against prices on a scatter plot. Linear regression draws the one straight line that minimises the total squared distance from every point to the line. It is the “line of least regret” — wrong about every point by the smallest possible amount overall.

Assumptions (the LINE acronym)

Every statistical test you run on a linear model assumes these hold. Violating them does not break the model but it does break your inference.

Letter	Assumption	Consequence if violated
L	Linearity — the relationship between X and Y is linear	Model is systematically biased
I	Independence — residuals do not depend on each other	Standard errors underestimated
N	Normality — residuals are normally distributed	Hypothesis tests become invalid
E	Equal variance (homoscedasticity) — residual spread is constant	Inefficient estimates, invalid CIs

Regularization — Keeping the Model Honest

A model that fits training data perfectly is usually memorising noise, not learning the pattern. Regularization adds a penalty to the loss function that discourages overly large coefficients.

Method	Penalty added	Behaviour	Best for
Ridge (L2)	$\lambda \sum \beta_i^2$	Shrinks all coefficients toward zero, never exactly zero	Multicollinearity; all features likely relevant
Lasso (L1)	$\lambda \sum \\|\beta_i\\|$	Can zero out coefficients entirely — automatic feature selection	Many irrelevant features; want a sparse model
ElasticNet	$\lambda_1 L_1 + \lambda_2 L_2$	Combines both effects	High-dimensional data with correlated features

$\lambda$ is the regularization strength — larger $\lambda$ = heavier penalty. Tune it with cross-validation.

Intuition (The rubber band analogy)

Without regularization, a curve wiggles wildly to pass through every noisy data point. Ridge adds a rubber band that pulls the curve straight — it still fits the data but cannot wiggle too much. Lasso goes further: it can snap the band entirely for features it deems irrelevant, removing them from the model.

Logistic Regression

Despite the name, logistic regression is a classification model. It predicts the probability that an observation belongs to a class by applying a sigmoid function to a linear combination of features.

P(y=1 \mid x) = \sigma(w^T x) = \frac{1}{1 + e^{-w^T x}}

The sigmoid squishes any real number into $(0, 1)$ , making the output a valid probability.

Loss function — cross-entropy (log loss), not MSE:

\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n}\Big[y_i \log(p_i) + (1 - y_i)\log(1 - p_i)\Big]

Important (Why not MSE for classification?)

MSE combined with the sigmoid creates a non-convex loss surface full of local minima — gradient descent can get stuck. Cross-entropy is convex, which guarantees convergence to the global minimum.

The default decision boundary is $P > 0.5 \rightarrow$ class 1, but this threshold is adjustable. Moving it lower increases recall (catches more positives at the cost of more false alarms).

Bias-Variance Tradeoff

Every model’s total error decomposes into three parts:

\text{Total Error} = \underbrace{\text{Bias}^2}_{\text{systematic error}} + \underbrace{\text{Variance}}_{\text{sensitivity to data}} + \underbrace{\text{Irreducible Noise}}_{\text{can't fix this}}

	High Bias (Underfitting)	High Variance (Overfitting)
Cause	Model too simple, wrong assumptions	Model too complex, memorises noise
Training error	High	Low
Test error	High	High
Example	Linear model on curved data	Deep decision tree on 100 rows
Fix	More features, complex model, less regularization	More data, regularization, simpler model, ensembles

Intuition (The dartboard analogy)

Imagine throwing darts at a bullseye. High bias — darts consistently land to the left of centre (systematic aim error, wrong assumptions). High variance — darts scattered all over the board (inconsistent, sensitive to tiny hand movements). The ideal model has both low bias and low variance: darts clustered around the bullseye.

Tip (Interview visual)

Draw the U-shaped test error curve against model complexity. Show training error always decreasing, test error bottoming out then rising. This single diagram communicates the entire tradeoff and tends to impress interviewers who expect only verbal explanations.

Decision Trees

A decision tree splits data by asking yes/no questions on features, building a hierarchy of if-else rules. Each leaf node holds a prediction.

How the tree decides which question to ask — it picks the split that gives the greatest reduction in impurity:

\text{Gini Impurity} = 1 - \sum_i p_i^2 \qquad \text{lower = purer node}

\text{Entropy} = -\sum_i p_i \log_2(p_i)

\text{Information Gain} = \text{Entropy(parent)} - \text{weighted Entropy(children)}

Hyperparameter	What it controls
`max_depth`	How deep the tree grows — lower depth = simpler model
`min_samples_split`	Minimum samples needed to attempt a split
`min_samples_leaf`	Minimum samples required in any leaf
`max_features`	Number of features randomly considered per split

Intuition (Decision trees are flowcharts)

A decision tree is exactly the kind of yes/no flowchart you would draw on a whiteboard: “Is the customer’s age above 30? If yes, is their income above $50k? If no, predict churn.” The algorithm automates finding the best questions to ask and in what order — based purely on what maximises class purity at each step.

Ensemble Methods — The Interview Workhorses

Random Forest

Random Forest trains many deep decision trees independently on different random subsets of data and features, then averages their predictions.

Why does this work? Each tree makes different errors because it sees different data and features. Averaging many imperfect trees cancels out individual mistakes, reducing variance without increasing bias.

Key ideas:

Bagging — each tree is trained on a bootstrap sample (roughly 63% of data, sampled with replacement)
Random feature subsets — at each split, only a random subset of features is considered. This is what decorrelates the trees and makes RF better than plain bagging.
Out-of-bag (OOB) error — the ~37% of samples not used in each tree act as a free validation set, no cross-validation needed.

Tip (When to reach for Random Forest)

Tabular data, high-dimensional features, quick reliable baseline, and situations where you need feature importance without extra work. Robust to outliers and noisy features.

Gradient Boosting (XGBoost, LightGBM, CatBoost)

Where Random Forest builds trees in parallel, Gradient Boosting builds them sequentially. Each new tree corrects the mistakes of the current ensemble by fitting the residual errors (the negative gradient of the loss).

F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)

where $h_m$ is the new tree correcting the residuals, and $\eta$ is the learning rate that controls how much weight each new tree gets.

Library	Key innovation
XGBoost	Regularization in the tree objective; handles missing values; column subsampling
LightGBM	Leaf-wise growth (faster); histogram-based splitting (memory efficient)
CatBoost	Built-in categorical encoding; ordered boosting prevents target leakage

1
import xgboost as xgb
2

3
model = xgb.XGBClassifier(
4
    n_estimators=1000,          # Set high — early stopping will find the real number
5
    learning_rate=0.05,         # Lower = better generalization, slower training
6
    max_depth=4,                # Keep shallow for boosting (3–6 is typical)
7
    subsample=0.8,              # Fraction of rows per tree
8
    colsample_bytree=0.8,       # Fraction of features per tree
9
    early_stopping_rounds=50,   # Stop if val metric doesn't improve for 50 rounds
10
    eval_metric='auc',
11
)
12
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=100)

Important (Always use early stopping with boosting)

Setting early_stopping_rounds is the single most impactful practice when using gradient boosting. It automatically finds the optimal number of trees by monitoring held-out performance, preventing overfitting without manually tuning n_estimators.

Model Evaluation — Choosing the Right Metric

The choice of metric is as important as the choice of model. Using the wrong one can lead you to ship a model that looks great on paper but fails in production.

Classification Metrics

Metric	What it measures	Use when
Accuracy	$(TP+TN)/(TP+TN+FP+FN)$	Classes are balanced — misleading otherwise
Precision	$TP/(TP+FP)$	False positives are costly (spam filter, fraud alert)
Recall	$TP/(TP+FN)$	False negatives are costly (cancer screening, safety)
F1 Score	$2 \cdot PR/(P+R)$	Need balance between precision and recall
ROC-AUC	Area under ROC curve	Threshold-invariant; good general-purpose metric
PR-AUC	Area under Precision-Recall curve	Heavily imbalanced datasets

Regression Metrics

Metric	Formula	Use when
MAE	mean of $\\|y_i - \hat{y}_i\\|$	Outliers present; interpretable in original units
MSE	mean of $(y_i - \hat{y}_i)^2$	Want to heavily penalize large errors
RMSE	$\sqrt{\text{MSE}}$	Same units as target — easier to communicate
MAPE	mean of $\\|y_i - \hat{y}_i\\|/y_i \times 100\%$	Percentage error — intuitive for business audiences
R²	$1 - SS_{res}/SS_{tot}$	Proportion of variance explained (0 to 1)

Important (Always ask this before choosing a metric)

What is the cost of a false positive versus a false negative in your business context? A fraud detection model where missing fraud costs $10,000 but a false alert costs$ 5 should optimise for recall, not accuracy. This question signals business maturity to interviewers.

Python for Data Science

Pandas — Core Operations

1
import pandas as pd
2
import numpy as np
3

4
# Load and inspect
5
df = pd.read_csv('data.csv')
6
print(df.shape, df.dtypes, df.describe(), df.isnull().sum())
7

8
# Filtering
9
df[df['age'] > 30]
10
df.query('age > 30 and salary > 50000')
11

12
# GroupBy — the most tested operation in DS interviews
13
df.groupby('segment')['revenue'].agg(['mean', 'sum', 'count']).reset_index()
14

15
# Apply a custom function row-by-row
16
df['new_col'] = df['col'].apply(lambda x: x ** 2 if x > 0 else 0)
17

18
# Pivot table
19
df.pivot_table(values='sales', index='region', columns='quarter', aggfunc='sum')
20

21
# Merge (like SQL JOIN)
22
pd.merge(df1, df2, on='customer_id', how='left')
23

24
# Handle missing values
25
df['col'].fillna(df['col'].median())   # impute with median
26
df.dropna(subset=['critical_col'])     # drop rows where this column is missing

Scikit-learn — The Interview Pipeline

Always wrap preprocessing and model steps in a Pipeline. This is not just a style preference — it prevents data leakage, the single most common mistake in ML interviews and production systems.

1
from sklearn.pipeline import Pipeline
2
from sklearn.preprocessing import StandardScaler, OneHotEncoder
3
from sklearn.compose import ColumnTransformer
4
from sklearn.ensemble import RandomForestClassifier
5
from sklearn.model_selection import train_test_split
6
from sklearn.metrics import classification_report, roc_auc_score
7

8
numeric_features     = ['age', 'income']
9
categorical_features = ['region', 'segment']
10

11
preprocessor = ColumnTransformer([
12
    ('num', StandardScaler(),                       numeric_features),
13
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),
14
])
15

16
pipeline = Pipeline([
17
    ('prep',  preprocessor),
18
    ('model', RandomForestClassifier(n_estimators=100, random_state=42)),
19
])
20

21
X_train, X_test, y_train, y_test = train_test_split(
22
    X, y, test_size=0.2, stratify=y, random_state=42
23
)
24
pipeline.fit(X_train, y_train)
25

26
y_pred = pipeline.predict(X_test)
27
print(classification_report(y_test, y_pred))
28
print('ROC-AUC:', roc_auc_score(y_test, pipeline.predict_proba(X_test)[:, 1]))

Important (Why Pipeline prevents data leakage)

If you fit a StandardScaler on the full dataset before the train/test split, the scaler has seen the test set’s mean and variance — your model has indirectly peeked at unseen data. Wrapping everything in Pipeline guarantees that any fitting step (scaling, encoding) only ever sees training data, keeping the test set truly held-out.

Handling Class Imbalance

When one class is far rarer than another (e.g., 1% fraud, 99% legitimate), accuracy becomes meaningless — a model that always predicts “not fraud” is 99% accurate and completely useless. Four strategies in order of how to try them:

1
# 1. Class weights — easiest, try this first
2
RandomForestClassifier(class_weight='balanced')
3

4
# 2. SMOTE — creates synthetic minority samples to balance the dataset
5
from imblearn.over_sampling import SMOTE
6
X_res, y_res = SMOTE(random_state=42).fit_resample(X_train, y_train)
7

8
# 3. Threshold tuning — 0.5 is an arbitrary default
9
# Lower threshold = more recall (catches more positives, more false alarms)
10
y_pred = (model.predict_proba(X_test)[:, 1] > 0.3).astype(int)
11

12
# 4. Evaluate with PR-AUC, not ROC-AUC
13
from sklearn.metrics import average_precision_score
14
pr_auc = average_precision_score(y_test, model.predict_proba(X_test)[:, 1])

SQL — Window Functions & Advanced Patterns

SQL is tested at every senior data science interview. Knowing window functions is what separates junior from senior answers.

Window Functions

A window function computes a value across a set of rows related to the current row without collapsing the result into one row — unlike GROUP BY. The OVER() clause defines which rows to look at.

1
-- General syntax
2
function() OVER (PARTITION BY col ORDER BY col ROWS BETWEEN ... AND ...)

Ranking

1
SELECT *,
2
    ROW_NUMBER() OVER (PARTITION BY region ORDER BY sales DESC) AS row_num,
3
    RANK()       OVER (PARTITION BY region ORDER BY sales DESC) AS rank,
4
    DENSE_RANK() OVER (PARTITION BY region ORDER BY sales DESC) AS dense_rank
5
FROM sales_table;

Function	Tie behaviour	Example output
`ROW_NUMBER()`	Always unique	1, 2, 3, 4
`RANK()`	Skips after tie	1, 2, 2, 4
`DENSE_RANK()`	No skip after tie	1, 2, 2, 3

Running Totals and Moving Averages

1
SELECT date, sales,
2
    SUM(sales) OVER (ORDER BY date) AS running_total,
3
    AVG(sales) OVER (
4
        ORDER BY date
5
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
6
    ) AS moving_avg_7day
7
FROM daily_sales;

LAG / LEAD — Compare Rows to Previous or Next

1
SELECT date, revenue,
2
    LAG(revenue, 1)  OVER (ORDER BY date) AS prev_day_revenue,
3
    LEAD(revenue, 1) OVER (ORDER BY date) AS next_day_revenue,
4
    revenue - LAG(revenue, 1) OVER (ORDER BY date) AS day_over_day_change
5
FROM daily_revenue;

NTILE — Percentile Buckets

1
SELECT customer_id, lifetime_value,
2
    NTILE(4)  OVER (ORDER BY lifetime_value DESC) AS ltv_quartile,
3
    NTILE(10) OVER (ORDER BY lifetime_value DESC) AS ltv_decile
4
FROM customers;

CTEs and Complex Patterns

1
-- Top 3 customers by revenue per region
2
WITH customer_revenue AS (
3
    SELECT customer_id, region,
4
           SUM(order_amount) AS total_revenue
5
    FROM orders
6
    WHERE order_date >= '2024-01-01'
7
    GROUP BY customer_id, region
8
),
9
ranked AS (
10
    SELECT *,
11
           ROW_NUMBER() OVER (PARTITION BY region ORDER BY total_revenue DESC) AS rn
12
    FROM customer_revenue
13
)
14
SELECT * FROM ranked WHERE rn <= 3;

1
-- Month-over-month growth rate
2
WITH monthly AS (
3
    SELECT DATE_TRUNC('month', order_date) AS month,
4
           SUM(amount) AS revenue
5
    FROM orders
6
    GROUP BY 1
7
)
8
SELECT month,
9
       revenue,
10
       LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
11
       ROUND(
12
           100.0 * (revenue - LAG(revenue) OVER (ORDER BY month))
13
                 / NULLIF(LAG(revenue) OVER (ORDER BY month), 0),
14
           2
15
       ) AS mom_growth_pct
16
FROM monthly;

Classic Interview SQL Problems

1
-- Second highest salary
2
SELECT DISTINCT salary FROM employees ORDER BY salary DESC LIMIT 1 OFFSET 1;
3

4
-- Customers who bought in January but not February
5
SELECT DISTINCT customer_id FROM orders WHERE MONTH(order_date) = 1
6
EXCEPT
7
SELECT DISTINCT customer_id FROM orders WHERE MONTH(order_date) = 2;
8

9
-- First date cumulative sales crossed 1 million
10
WITH running AS (
11
    SELECT date, SUM(sales) OVER (ORDER BY date) AS cum_sales
12
    FROM daily_sales
13
)
14
SELECT MIN(date) FROM running WHERE cum_sales >= 1000000;
15

16
-- 7-day retention: users who were active 7 days after their first activity
17
WITH first_seen AS (
18
    SELECT user_id, MIN(activity_date) AS cohort_date
19
    FROM user_activity
20
    GROUP BY user_id
21
)
22
SELECT f.user_id
23
FROM first_seen f
24
JOIN user_activity a
25
  ON f.user_id = a.user_id
26
 AND a.activity_date = f.cohort_date + INTERVAL '7 days';

A/B Testing & Causal Inference

A/B Test Design — End to End

Before running an experiment, answer six questions. Skipping any of them leads to results you cannot trust.

Primary metric — one metric only. Testing many metrics simultaneously inflates the false positive rate.
Minimum Detectable Effect (MDE) — what is the smallest improvement that is meaningful to the business?
Required sample size — calculated from significance level, power, MDE, and baseline conversion rate.
Randomization unit — user, session, or request? The unit must match the level at which the metric is measured.
Duration — run until the target sample size is reached, and for at least one full business cycle (e.g., one full week to capture weekday/weekend behaviour).
Guardrail metrics — metrics that must not worsen (e.g., latency, revenue per user) even if the primary metric improves.

Sample Size Calculation

1
from scipy import stats
2
import numpy as np
3

4
def sample_size(baseline_rate, mde, alpha=0.05, power=0.80):
5
    """
6
    baseline_rate : current conversion rate, e.g. 0.10 for 10%
7
    mde           : minimum detectable effect, e.g. 0.02 to detect a 2pp lift
8
    alpha         : significance level (false positive rate)
9
    power         : 1 - beta (probability of detecting a real effect)
10
    """
11
    p1 = baseline_rate
12
    p2 = baseline_rate + mde
13
    pooled = (p1 + p2) / 2
14

15
    z_alpha = stats.norm.ppf(1 - alpha / 2)   # two-tailed
16
    z_beta  = stats.norm.ppf(power)
17

18
    n = (
19
        z_alpha * np.sqrt(2 * pooled * (1 - pooled)) +
20
        z_beta  * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2))
21
    ) ** 2 / mde ** 2
22

23
    return int(np.ceil(n))
24

25
# 10% baseline, want to detect a 2 percentage point lift
26
print(sample_size(0.10, 0.02))   # → ~3,843 per group

Common A/B Testing Pitfalls

Note (The peeking problem)

Checking results before the planned sample size is reached inflates the Type I error rate dramatically. If you peek five times during an experiment, your effective false positive rate is closer to 22% than 5%. Commit to a fixed sample design, or use sequential testing methods (SPRT) that are designed for continuous monitoring.

Note (Network effects and SUTVA violation)

If users in control and treatment interact with each other (social features, marketplaces, referral programs), the treatment bleeds into the control group and the experiment is contaminated. Use cluster randomization — assign entire geographies, cohorts, or friend groups to one arm — instead of individual-level randomization.

Note (Multiple comparisons and p-hacking)

Testing 20 metrics at $\alpha = 0.05$ expects one false positive by chance alone. Apply Bonferroni correction: use $\alpha_{\text{adjusted}} = 0.05 / \text{num\_tests}$ as your threshold. Better still — pre-register your primary metric before the experiment starts.

Note (Simpson's Paradox)

Aggregated results can reverse when broken into subgroups. A treatment that looks harmful overall can be beneficial for every individual subgroup, and vice versa. Always segment your analysis after seeing the overall result.

Causal Inference — When You Cannot Randomize

A/B tests require the ability to randomize. In practice, randomization is often impossible: a policy was already rolled out, an ethical concern prevents a control group, or the treatment is something historical. Quasi-experimental methods let you estimate causal effects from observational data.

Difference-in-Differences (DiD)

When to use: treatment was rolled out to some groups (cities, cohorts, stores) but not others, and you have data from before and after the rollout.

\text{DiD} = (\bar{Y}_{T,\text{after}} - \bar{Y}_{T,\text{before}}) - (\bar{Y}_{C,\text{after}} - \bar{Y}_{C,\text{before}})

The key assumption is parallel trends: without the treatment, the treated and control groups would have evolved at the same rate. You cannot prove this, but you can make it plausible.

1
import statsmodels.formula.api as smf
2

3
# The coefficient on treated:post is the DiD estimate
4
result = smf.ols(
5
    'outcome ~ treated + post + treated:post + controls',
6
    data=df
7
).fit()
8
print(result.summary())

Tip (Validating parallel trends)

Plot treated and control group outcomes over time before the treatment date. If they track each other closely, the assumption is plausible. Run a placebo test: apply the DiD method to a pre-treatment period where you know the effect should be zero. A significant result there signals a problem.

Regression Discontinuity Design (RDD)

When to use: treatment is assigned based on a hard threshold on a continuous variable — credit score ≥ 700 gets a loan, test score ≥ 70 earns a scholarship.

Intuition: people just above and just below the threshold are essentially identical except for which side they fell on. Comparing their outcomes in a narrow window around the cutoff gives a clean causal estimate.

Sharp RDD — everyone above the threshold gets treatment (hard cutoff)
Fuzzy RDD — treatment probability jumps at the threshold (use IV to estimate)
McCrary test — verify that no one is gaming their score to cross the cutoff (check for a discontinuity in the density of the running variable at the threshold)

Instrumental Variables (IV)

When to use: your variable of interest X is correlated with unobserved confounders (endogeneity problem), making OLS estimates biased.

A valid instrument Z must satisfy two conditions:

Relevance — Z affects X (testable: first-stage F-statistic should be > 10)
Exclusion restriction — Z affects Y only through X, not directly (untestable, requires theory)

Two-Stage Least Squares (2SLS):

Stage 1: regress X on Z → get $\hat{X}$ (the part of X explained only by Z)
Stage 2: regress Y on $\hat{X}$ → the coefficient is the causal estimate

Classic examples: draft lottery as an instrument for military service; distance to college as an instrument for years of education.

Caveat (IV estimates LATE, not ATE)

IV estimates the Local Average Treatment Effect — the causal effect for compliers (people whose treatment status was actually changed by the instrument). This may not generalize to the broader population.

Forecasting, Anomaly Detection & Price Elasticity

Time Series Forecasting

Any time series can be decomposed into four components:

Component	Description	Example
Trend	Long-term direction	Revenue growing year over year
Seasonality	Periodic repeating pattern	Holiday sales spike every December
Cyclical	Multi-year economic cycles	Revenue declining during a recession
Noise	Unpredictable random variation	Day-to-day fluctuations

ARIMA

ARIMA is the classical statistical approach for univariate time series. The three parameters control what information the model uses:

Parameter	What it adds	How to choose
p (AR order)	Past values of Y	Look at the Partial Autocorrelation Function (PACF)
d (differencing)	Remove trend to make series stationary	Augmented Dickey-Fuller test — difference until $p < 0.05$
q (MA order)	Past forecast errors	Look at the Autocorrelation Function (ACF)

For seasonal data: SARIMA $(p,d,q)(P,D,Q,s)$ where $s$ is the seasonal period (e.g., 12 for monthly data with annual seasonality).

Prophet

Prophet is Meta’s open-source library designed for business time series with strong seasonality, holidays, and missing data. It fits an additive model:

y(t) = \text{trend}(t) + \text{seasonality}(t) + \text{holidays}(t) + \epsilon

1
from prophet import Prophet
2
import pandas as pd
3

4
# Prophet requires columns named 'ds' (date) and 'y' (value)
5
df = pd.read_csv('sales.csv').rename(columns={'date': 'ds', 'sales': 'y'})
6

7
model = Prophet(
8
    yearly_seasonality=True,
9
    weekly_seasonality=True,
10
    daily_seasonality=False,
11
    changepoint_prior_scale=0.05,   # Flexibility of the trend — higher = more flexible
12
)
13
model.add_country_holidays(country_name='US')
14
model.fit(df)
15

16
future   = model.make_future_dataframe(periods=90)   # 90-day forecast
17
forecast = model.predict(future)
18

19
model.plot(forecast)             # Full forecast with uncertainty bands
20
model.plot_components(forecast)  # Decomposition: trend + each seasonality component

Tip (ARIMA vs Prophet)

Use ARIMA when you need statistical rigour and you have a clean, stationary univariate series. Use Prophet when you have multiple strong seasonalities, missing data, known holidays, and need a model that non-experts can run and interpret — it handles the messiness of real business data gracefully.

Price Elasticity of Demand

Price elasticity measures how sensitive demand is to a change in price.

E = \frac{\% \Delta Q}{\% \Delta P} = \frac{\partial Q}{\partial P} \cdot \frac{P}{Q}

Elasticity	Meaning	Typical examples
$\\|E\\| > 1$	Elastic — demand drops significantly with a price increase	Luxury goods, commodities, price-sensitive categories
$\\|E\\| < 1$	Inelastic — demand barely changes	Necessities, utilities, strong brand loyalty
$\\|E\\| = 1$	Unit elastic — revenue stays constant	Rare in practice

Estimate elasticity with a log-log regression — the coefficient on log price is the elasticity directly:

\ln(Q) = \alpha + E \cdot \ln(P) + \text{controls}

1
import numpy as np
2
from sklearn.linear_model import LinearRegression
3

4
df['log_sales'] = np.log(df['sales'])
5
df['log_price'] = np.log(df['price'])
6

7
X = df[['log_price', 'log_income', 'log_competitor_price']]
8
y = df['log_sales']
9

10
model      = LinearRegression().fit(X, y)
11
elasticity = model.coef_[0]   # Typically negative: higher price → lower demand
12

13
print(f'Price elasticity: {elasticity:.3f}')
14

15
# Optimal price (Lerner condition — valid only when |E| > 1)
16
marginal_cost = 10
17
optimal_price = marginal_cost * elasticity / (elasticity + 1)
18
print(f'Optimal price: ${optimal_price:.2f}')

Anomaly Detection

Method	How it works	Use when
Z-score	Flag points more than 3σ from the mean	Univariate, stationary, normally distributed
IQR method	Flag outside $[Q_1 - 1.5 \cdot IQR,\ Q_3 + 1.5 \cdot IQR]$	Skewed data, robust to outliers
Isolation Forest	Anomalies are isolated in fewer tree splits	Multivariate, high-dimensional
Local Outlier Factor (LOF)	Compares local density to neighbours	Varying cluster densities
Autoencoder	High reconstruction error flags anomalies	Complex high-dimensional patterns
Prophet residuals	Fit a forecast, flag large residuals	Time series anomaly detection
CUSUM / EWMA	Sequential monitoring of cumulative deviations	Real-time production monitoring

1
from sklearn.ensemble import IsolationForest
2

3
model     = IsolationForest(contamination=0.05, random_state=42)  # 5% expected anomaly rate
4
preds     = model.fit_predict(X)   # -1 = anomaly, 1 = normal
5
anomalies = X[preds == -1]