Statistical & Analytical Foundations
Statistics is the backbone of data science. In an interview, always connect math to business meaning. The goal is not to recite formulas but to explain WHY something matters.
Probability Distributions: The Building Blocks
A probability distribution describes how the values of a random variable are spread out. It tells us which outcomes are likely and how likely they are. Some distributions describe discrete outcomes (such as counts or binary events), while others describe continuous values (such as measurements).
Understanding these distributions helps us:
- model uncertainty
- understand real-world data
- build statistical and machine learning models.
Mean and Standard Deviation
Mean (μ)
The mean represents the average value of a distribution. It indicates where the center of the data lies.
If exam scores average 70, then μ = 70.Standard Deviation (σ)
The standard deviation measures how spread out the data is around the mean.
Small σ → values are close to the mean
Large σ → values are more spread out
μ = 70, σ = 5 → most scores close to 70μ = 70, σ = 20 → scores vary widelyTip (The crowd analogy)
Imagine a group photo of people standing in a line. The mean is the person standing in the center of the group, the average position where everyone balances out. The standard deviation tells you how spread out the crowd is from that center person. If everyone stands close to the center, the spread is small (small σ). If people are scattered far away, the spread is large (large σ).
Key Distributions You Must Know Cold
Note (How distributions are connected)
A Bernoulli distribution describes one trial with two outcomes.
- Now imagine repeating that experiment many times. If you repeat a Bernoulli trial n times and count the number of successes, the result follows a Binomial distribution.
- Sometimes we care about rare events over time. When the number of trials becomes very large but the probability of success is small, the Binomial distribution behaves like a Poisson distribution.
- If we look at the average of many observations, something interesting happens. According to the Central Limit Theorem, the distribution of sample averages tends to look Normal(bell-shaped), even if the original data is not perfectly normal.
- Now consider the time between events. If events happen randomly at a constant average rate (as modeled by the Poisson distribution), the waiting time between those events follows an Exponential distribution.
- Sometimes we are uncertain about probabilities themselves. The Beta distribution is often used to represent uncertainty about a probability value, such as the true success rate of a Bernoulli process.
- Finally, when working with small datasets, we cannot estimate variability very precisely. In that case, we use the t-distribution, which is similar to the normal distribution but has heavier tails to account for extra uncertainty.
p=0.3A single binary trial — success (1) or failure (0). The atomic building block of all discrete distributions.
E[X] = p = 0.3Var[X] = p(1−p) = 0.21Special case: Binomial with n=1Important
If asked ‘When would you use Poisson vs Normal?’ say: Poisson for count data with rare events in fixed intervals; Normal for continuous measurements when CLT applies. Always give a concrete business example
Hypothesis Testing
Hypothesis testing is a statistical framework used to make decisions based on data. It helps determine whether an observed effect is real or simply due to random chance. Understanding this process is fundamental in statistics, data science, and machine learning.
The 5-Step Framework
-
State the hypotheses, define two competing statements:
- H₀ (Null Hypothesis): No effect or no difference exists.
- H₁ (Alternative Hypothesis): There is an effect or difference.
-
Choose a significance level (α). Select the threshold for accepting false positives. A common choice is α = 0.05, meaning we accept a 5% chance of incorrectly rejecting H₀.
-
Collect data and compute a test statistic. Use the sample data to calculate a statistic such as z, t, χ² (chi-square), or F, depending on the test.
-
Compute the p-value. The p-value represents the probability of observing results at least as extreme as the current data, assuming the null hypothesis is true.
-
Make a decision
- If p < α, reject the null hypothesis.
- If p ≥ α, fail to reject the null hypothesis.
Type I and Type II Errors
When performing a hypothesis test, two types of mistakes can occur. Type I is when we reject the null hypothesis even though it is actually true. In other words, we detect an effect that does not really exist. Type II is when we fail to reject the null hypothesis even though there is a real effect. In other words, we miss something that actually exists.
Type I → Seeing something that isn't there.Type II → Missing something that is there.Confidence Intevals
A confidence interval (CI) gives a range of values where the true value is likely to lie based on the data we collected. Think of it as an estimated range for the real answer.
For example, if we estimate the average height of a population to be 170 cm, a 95% confidence interval might look like:
167 cm — 173 cmCorrelation vs Causation
Correlation measures how strongly two variables move together. It ranges from –1 to +1:
-
+1 → perfect positive relationship (both increase together)
-
–1 → perfect negative relationship (one increases while the other decreases)
-
0 → no linear relationship
However, correlation does not mean causation. Causation means that one variable directly causes a change in another.
Machine Learning algorithms
When learning any machine learning algorithm, aim to understand it at three levels:
- Intuition: what problem the algorithm solves and how it works conceptually
- Mathematical foundation: the objective function and optimization method
- Practical usage: when to use it and how to tune its hyperparameters
Linear Regression, The Foundation
Linear regression models the relationship between a target variable Y and input features X as a linear combination. It is one of the simplest and most interpretable models in machine learning.
Where:
- = intercept (the baseline value when all features are zero)
- = how much Y changes for a one-unit increase in feature
- = random noise the model cannot explain
The loss function being minimized is Mean Squared Error — the average of squared prediction errors:
The closed-form solution — called Ordinary Least Squares — finds the optimal coefficients directly:
Intuition (What is linear regression really doing?)
Imagine plotting house sizes against prices on a scatter plot. Linear regression draws the one straight line that minimises the total squared distance from every point to the line. It is the “line of least regret” — wrong about every point by the smallest possible amount overall.
Assumptions (the LINE acronym)
Every statistical test you run on a linear model assumes these hold. Violating them does not break the model but it does break your inference.
| Letter | Assumption | Consequence if violated |
|---|---|---|
| L | Linearity — the relationship between X and Y is linear | Model is systematically biased |
| I | Independence — residuals do not depend on each other | Standard errors underestimated |
| N | Normality — residuals are normally distributed | Hypothesis tests become invalid |
| E | Equal variance (homoscedasticity) — residual spread is constant | Inefficient estimates, invalid CIs |
Regularization — Keeping the Model Honest
A model that fits training data perfectly is usually memorising noise, not learning the pattern. Regularization adds a penalty to the loss function that discourages overly large coefficients.
| Method | Penalty added | Behaviour | Best for |
|---|---|---|---|
| Ridge (L2) | Shrinks all coefficients toward zero, never exactly zero | Multicollinearity; all features likely relevant | |
| Lasso (L1) | Can zero out coefficients entirely — automatic feature selection | Many irrelevant features; want a sparse model | |
| ElasticNet | Combines both effects | High-dimensional data with correlated features |
is the regularization strength — larger = heavier penalty. Tune it with cross-validation.
Intuition (The rubber band analogy)
Without regularization, a curve wiggles wildly to pass through every noisy data point. Ridge adds a rubber band that pulls the curve straight — it still fits the data but cannot wiggle too much. Lasso goes further: it can snap the band entirely for features it deems irrelevant, removing them from the model.
Logistic Regression
Despite the name, logistic regression is a classification model. It predicts the probability that an observation belongs to a class by applying a sigmoid function to a linear combination of features.
The sigmoid squishes any real number into , making the output a valid probability.
Loss function — cross-entropy (log loss), not MSE:
Important (Why not MSE for classification?)
MSE combined with the sigmoid creates a non-convex loss surface full of local minima — gradient descent can get stuck. Cross-entropy is convex, which guarantees convergence to the global minimum.
The default decision boundary is class 1, but this threshold is adjustable. Moving it lower increases recall (catches more positives at the cost of more false alarms).
Bias-Variance Tradeoff
Every model’s total error decomposes into three parts:
| High Bias (Underfitting) | High Variance (Overfitting) | |
|---|---|---|
| Cause | Model too simple, wrong assumptions | Model too complex, memorises noise |
| Training error | High | Low |
| Test error | High | High |
| Example | Linear model on curved data | Deep decision tree on 100 rows |
| Fix | More features, complex model, less regularization | More data, regularization, simpler model, ensembles |
Intuition (The dartboard analogy)
Imagine throwing darts at a bullseye. High bias — darts consistently land to the left of centre (systematic aim error, wrong assumptions). High variance — darts scattered all over the board (inconsistent, sensitive to tiny hand movements). The ideal model has both low bias and low variance: darts clustered around the bullseye.
Tip (Interview visual)
Draw the U-shaped test error curve against model complexity. Show training error always decreasing, test error bottoming out then rising. This single diagram communicates the entire tradeoff and tends to impress interviewers who expect only verbal explanations.
Decision Trees
A decision tree splits data by asking yes/no questions on features, building a hierarchy of if-else rules. Each leaf node holds a prediction.
How the tree decides which question to ask — it picks the split that gives the greatest reduction in impurity:
| Hyperparameter | What it controls |
|---|---|
max_depth | How deep the tree grows — lower depth = simpler model |
min_samples_split | Minimum samples needed to attempt a split |
min_samples_leaf | Minimum samples required in any leaf |
max_features | Number of features randomly considered per split |
Intuition (Decision trees are flowcharts)
A decision tree is exactly the kind of yes/no flowchart you would draw on a whiteboard: “Is the customer’s age above 30? If yes, is their income above $50k? If no, predict churn.” The algorithm automates finding the best questions to ask and in what order — based purely on what maximises class purity at each step.
Ensemble Methods — The Interview Workhorses
Random Forest
Random Forest trains many deep decision trees independently on different random subsets of data and features, then averages their predictions.
Why does this work? Each tree makes different errors because it sees different data and features. Averaging many imperfect trees cancels out individual mistakes, reducing variance without increasing bias.
Key ideas:
- Bagging — each tree is trained on a bootstrap sample (roughly 63% of data, sampled with replacement)
- Random feature subsets — at each split, only a random subset of features is considered. This is what decorrelates the trees and makes RF better than plain bagging.
- Out-of-bag (OOB) error — the ~37% of samples not used in each tree act as a free validation set, no cross-validation needed.
Tip (When to reach for Random Forest)
Tabular data, high-dimensional features, quick reliable baseline, and situations where you need feature importance without extra work. Robust to outliers and noisy features.
Gradient Boosting (XGBoost, LightGBM, CatBoost)
Where Random Forest builds trees in parallel, Gradient Boosting builds them sequentially. Each new tree corrects the mistakes of the current ensemble by fitting the residual errors (the negative gradient of the loss).
where is the new tree correcting the residuals, and is the learning rate that controls how much weight each new tree gets.
| Library | Key innovation |
|---|---|
| XGBoost | Regularization in the tree objective; handles missing values; column subsampling |
| LightGBM | Leaf-wise growth (faster); histogram-based splitting (memory efficient) |
| CatBoost | Built-in categorical encoding; ordered boosting prevents target leakage |
import xgboost as xgb
model = xgb.XGBClassifier( n_estimators=1000, # Set high — early stopping will find the real number learning_rate=0.05, # Lower = better generalization, slower training max_depth=4, # Keep shallow for boosting (3–6 is typical) subsample=0.8, # Fraction of rows per tree colsample_bytree=0.8, # Fraction of features per tree early_stopping_rounds=50, # Stop if val metric doesn't improve for 50 rounds eval_metric='auc',)model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=100)Important (Always use early stopping with boosting)
Setting early_stopping_rounds is the single most impactful practice when using gradient boosting. It automatically finds the optimal number of trees by monitoring held-out performance, preventing overfitting without manually tuning n_estimators.
Model Evaluation — Choosing the Right Metric
The choice of metric is as important as the choice of model. Using the wrong one can lead you to ship a model that looks great on paper but fails in production.
Classification Metrics
| Metric | What it measures | Use when |
|---|---|---|
| Accuracy | Classes are balanced — misleading otherwise | |
| Precision | False positives are costly (spam filter, fraud alert) | |
| Recall | False negatives are costly (cancer screening, safety) | |
| F1 Score | Need balance between precision and recall | |
| ROC-AUC | Area under ROC curve | Threshold-invariant; good general-purpose metric |
| PR-AUC | Area under Precision-Recall curve | Heavily imbalanced datasets |
Regression Metrics
| Metric | Formula | Use when |
|---|---|---|
| MAE | mean of | Outliers present; interpretable in original units |
| MSE | mean of | Want to heavily penalize large errors |
| RMSE | Same units as target — easier to communicate | |
| MAPE | mean of | Percentage error — intuitive for business audiences |
| R² | Proportion of variance explained (0 to 1) |
Important (Always ask this before choosing a metric)
What is the cost of a false positive versus a false negative in your business context? A fraud detection model where missing fraud costs 5 should optimise for recall, not accuracy. This question signals business maturity to interviewers.
Python for Data Science
Pandas — Core Operations
import pandas as pdimport numpy as np
# Load and inspectdf = pd.read_csv('data.csv')print(df.shape, df.dtypes, df.describe(), df.isnull().sum())
# Filteringdf[df['age'] > 30]df.query('age > 30 and salary > 50000')
# GroupBy — the most tested operation in DS interviewsdf.groupby('segment')['revenue'].agg(['mean', 'sum', 'count']).reset_index()
# Apply a custom function row-by-rowdf['new_col'] = df['col'].apply(lambda x: x ** 2 if x > 0 else 0)
# Pivot tabledf.pivot_table(values='sales', index='region', columns='quarter', aggfunc='sum')
# Merge (like SQL JOIN)pd.merge(df1, df2, on='customer_id', how='left')
# Handle missing valuesdf['col'].fillna(df['col'].median()) # impute with mediandf.dropna(subset=['critical_col']) # drop rows where this column is missingScikit-learn — The Interview Pipeline
Always wrap preprocessing and model steps in a Pipeline. This is not just a style preference — it prevents data leakage, the single most common mistake in ML interviews and production systems.
from sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScaler, OneHotEncoderfrom sklearn.compose import ColumnTransformerfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import classification_report, roc_auc_score
numeric_features = ['age', 'income']categorical_features = ['region', 'segment']
preprocessor = ColumnTransformer([ ('num', StandardScaler(), numeric_features), ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),])
pipeline = Pipeline([ ('prep', preprocessor), ('model', RandomForestClassifier(n_estimators=100, random_state=42)),])
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, stratify=y, random_state=42)pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)print(classification_report(y_test, y_pred))print('ROC-AUC:', roc_auc_score(y_test, pipeline.predict_proba(X_test)[:, 1]))Important (Why Pipeline prevents data leakage)
If you fit a StandardScaler on the full dataset before the train/test split, the scaler has seen the test set’s mean and variance — your model has indirectly peeked at unseen data. Wrapping everything in Pipeline guarantees that any fitting step (scaling, encoding) only ever sees training data, keeping the test set truly held-out.
Handling Class Imbalance
When one class is far rarer than another (e.g., 1% fraud, 99% legitimate), accuracy becomes meaningless — a model that always predicts “not fraud” is 99% accurate and completely useless. Four strategies in order of how to try them:
# 1. Class weights — easiest, try this firstRandomForestClassifier(class_weight='balanced')
# 2. SMOTE — creates synthetic minority samples to balance the datasetfrom imblearn.over_sampling import SMOTEX_res, y_res = SMOTE(random_state=42).fit_resample(X_train, y_train)
# 3. Threshold tuning — 0.5 is an arbitrary default# Lower threshold = more recall (catches more positives, more false alarms)y_pred = (model.predict_proba(X_test)[:, 1] > 0.3).astype(int)
# 4. Evaluate with PR-AUC, not ROC-AUCfrom sklearn.metrics import average_precision_scorepr_auc = average_precision_score(y_test, model.predict_proba(X_test)[:, 1])SQL — Window Functions & Advanced Patterns
SQL is tested at every senior data science interview. Knowing window functions is what separates junior from senior answers.
Window Functions
A window function computes a value across a set of rows related to the current row without collapsing the result into one row — unlike GROUP BY. The OVER() clause defines which rows to look at.
-- General syntaxfunction() OVER (PARTITION BY col ORDER BY col ROWS BETWEEN ... AND ...)Ranking
SELECT *, ROW_NUMBER() OVER (PARTITION BY region ORDER BY sales DESC) AS row_num, RANK() OVER (PARTITION BY region ORDER BY sales DESC) AS rank, DENSE_RANK() OVER (PARTITION BY region ORDER BY sales DESC) AS dense_rankFROM sales_table;| Function | Tie behaviour | Example output |
|---|---|---|
ROW_NUMBER() | Always unique | 1, 2, 3, 4 |
RANK() | Skips after tie | 1, 2, 2, 4 |
DENSE_RANK() | No skip after tie | 1, 2, 2, 3 |
Running Totals and Moving Averages
SELECT date, sales, SUM(sales) OVER (ORDER BY date) AS running_total, AVG(sales) OVER ( ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ) AS moving_avg_7dayFROM daily_sales;LAG / LEAD — Compare Rows to Previous or Next
SELECT date, revenue, LAG(revenue, 1) OVER (ORDER BY date) AS prev_day_revenue, LEAD(revenue, 1) OVER (ORDER BY date) AS next_day_revenue, revenue - LAG(revenue, 1) OVER (ORDER BY date) AS day_over_day_changeFROM daily_revenue;NTILE — Percentile Buckets
SELECT customer_id, lifetime_value, NTILE(4) OVER (ORDER BY lifetime_value DESC) AS ltv_quartile, NTILE(10) OVER (ORDER BY lifetime_value DESC) AS ltv_decileFROM customers;CTEs and Complex Patterns
-- Top 3 customers by revenue per regionWITH customer_revenue AS ( SELECT customer_id, region, SUM(order_amount) AS total_revenue FROM orders WHERE order_date >= '2024-01-01' GROUP BY customer_id, region),ranked AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY region ORDER BY total_revenue DESC) AS rn FROM customer_revenue)SELECT * FROM ranked WHERE rn <= 3;-- Month-over-month growth rateWITH monthly AS ( SELECT DATE_TRUNC('month', order_date) AS month, SUM(amount) AS revenue FROM orders GROUP BY 1)SELECT month, revenue, LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue, ROUND( 100.0 * (revenue - LAG(revenue) OVER (ORDER BY month)) / NULLIF(LAG(revenue) OVER (ORDER BY month), 0), 2 ) AS mom_growth_pctFROM monthly;Classic Interview SQL Problems
-- Second highest salarySELECT DISTINCT salary FROM employees ORDER BY salary DESC LIMIT 1 OFFSET 1;
-- Customers who bought in January but not FebruarySELECT DISTINCT customer_id FROM orders WHERE MONTH(order_date) = 1EXCEPTSELECT DISTINCT customer_id FROM orders WHERE MONTH(order_date) = 2;
-- First date cumulative sales crossed 1 millionWITH running AS ( SELECT date, SUM(sales) OVER (ORDER BY date) AS cum_sales FROM daily_sales)SELECT MIN(date) FROM running WHERE cum_sales >= 1000000;
-- 7-day retention: users who were active 7 days after their first activityWITH first_seen AS ( SELECT user_id, MIN(activity_date) AS cohort_date FROM user_activity GROUP BY user_id)SELECT f.user_idFROM first_seen fJOIN user_activity a ON f.user_id = a.user_id AND a.activity_date = f.cohort_date + INTERVAL '7 days';A/B Testing & Causal Inference
A/B Test Design — End to End
Before running an experiment, answer six questions. Skipping any of them leads to results you cannot trust.
- Primary metric — one metric only. Testing many metrics simultaneously inflates the false positive rate.
- Minimum Detectable Effect (MDE) — what is the smallest improvement that is meaningful to the business?
- Required sample size — calculated from significance level, power, MDE, and baseline conversion rate.
- Randomization unit — user, session, or request? The unit must match the level at which the metric is measured.
- Duration — run until the target sample size is reached, and for at least one full business cycle (e.g., one full week to capture weekday/weekend behaviour).
- Guardrail metrics — metrics that must not worsen (e.g., latency, revenue per user) even if the primary metric improves.
Sample Size Calculation
from scipy import statsimport numpy as np
def sample_size(baseline_rate, mde, alpha=0.05, power=0.80): """ baseline_rate : current conversion rate, e.g. 0.10 for 10% mde : minimum detectable effect, e.g. 0.02 to detect a 2pp lift alpha : significance level (false positive rate) power : 1 - beta (probability of detecting a real effect) """ p1 = baseline_rate p2 = baseline_rate + mde pooled = (p1 + p2) / 2
z_alpha = stats.norm.ppf(1 - alpha / 2) # two-tailed z_beta = stats.norm.ppf(power)
n = ( z_alpha * np.sqrt(2 * pooled * (1 - pooled)) + z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2)) ) ** 2 / mde ** 2
return int(np.ceil(n))
# 10% baseline, want to detect a 2 percentage point liftprint(sample_size(0.10, 0.02)) # → ~3,843 per groupCommon A/B Testing Pitfalls
Note (The peeking problem)
Checking results before the planned sample size is reached inflates the Type I error rate dramatically. If you peek five times during an experiment, your effective false positive rate is closer to 22% than 5%. Commit to a fixed sample design, or use sequential testing methods (SPRT) that are designed for continuous monitoring.
Note (Network effects and SUTVA violation)
If users in control and treatment interact with each other (social features, marketplaces, referral programs), the treatment bleeds into the control group and the experiment is contaminated. Use cluster randomization — assign entire geographies, cohorts, or friend groups to one arm — instead of individual-level randomization.
Note (Multiple comparisons and p-hacking)
Testing 20 metrics at expects one false positive by chance alone. Apply Bonferroni correction: use as your threshold. Better still — pre-register your primary metric before the experiment starts.
Note (Simpson's Paradox)
Aggregated results can reverse when broken into subgroups. A treatment that looks harmful overall can be beneficial for every individual subgroup, and vice versa. Always segment your analysis after seeing the overall result.
Causal Inference — When You Cannot Randomize
A/B tests require the ability to randomize. In practice, randomization is often impossible: a policy was already rolled out, an ethical concern prevents a control group, or the treatment is something historical. Quasi-experimental methods let you estimate causal effects from observational data.
Difference-in-Differences (DiD)
When to use: treatment was rolled out to some groups (cities, cohorts, stores) but not others, and you have data from before and after the rollout.
The key assumption is parallel trends: without the treatment, the treated and control groups would have evolved at the same rate. You cannot prove this, but you can make it plausible.
import statsmodels.formula.api as smf
# The coefficient on treated:post is the DiD estimateresult = smf.ols( 'outcome ~ treated + post + treated:post + controls', data=df).fit()print(result.summary())Tip (Validating parallel trends)
Plot treated and control group outcomes over time before the treatment date. If they track each other closely, the assumption is plausible. Run a placebo test: apply the DiD method to a pre-treatment period where you know the effect should be zero. A significant result there signals a problem.
Regression Discontinuity Design (RDD)
When to use: treatment is assigned based on a hard threshold on a continuous variable — credit score ≥ 700 gets a loan, test score ≥ 70 earns a scholarship.
Intuition: people just above and just below the threshold are essentially identical except for which side they fell on. Comparing their outcomes in a narrow window around the cutoff gives a clean causal estimate.
- Sharp RDD — everyone above the threshold gets treatment (hard cutoff)
- Fuzzy RDD — treatment probability jumps at the threshold (use IV to estimate)
- McCrary test — verify that no one is gaming their score to cross the cutoff (check for a discontinuity in the density of the running variable at the threshold)
Instrumental Variables (IV)
When to use: your variable of interest X is correlated with unobserved confounders (endogeneity problem), making OLS estimates biased.
A valid instrument Z must satisfy two conditions:
- Relevance — Z affects X (testable: first-stage F-statistic should be > 10)
- Exclusion restriction — Z affects Y only through X, not directly (untestable, requires theory)
Two-Stage Least Squares (2SLS):
- Stage 1: regress X on Z → get (the part of X explained only by Z)
- Stage 2: regress Y on → the coefficient is the causal estimate
Classic examples: draft lottery as an instrument for military service; distance to college as an instrument for years of education.
Caveat (IV estimates LATE, not ATE)
IV estimates the Local Average Treatment Effect — the causal effect for compliers (people whose treatment status was actually changed by the instrument). This may not generalize to the broader population.
Forecasting, Anomaly Detection & Price Elasticity
Time Series Forecasting
Any time series can be decomposed into four components:
| Component | Description | Example |
|---|---|---|
| Trend | Long-term direction | Revenue growing year over year |
| Seasonality | Periodic repeating pattern | Holiday sales spike every December |
| Cyclical | Multi-year economic cycles | Revenue declining during a recession |
| Noise | Unpredictable random variation | Day-to-day fluctuations |
ARIMA
ARIMA is the classical statistical approach for univariate time series. The three parameters control what information the model uses:
| Parameter | What it adds | How to choose |
|---|---|---|
| p (AR order) | Past values of Y | Look at the Partial Autocorrelation Function (PACF) |
| d (differencing) | Remove trend to make series stationary | Augmented Dickey-Fuller test — difference until |
| q (MA order) | Past forecast errors | Look at the Autocorrelation Function (ACF) |
For seasonal data: SARIMA where is the seasonal period (e.g., 12 for monthly data with annual seasonality).
Prophet
Prophet is Meta’s open-source library designed for business time series with strong seasonality, holidays, and missing data. It fits an additive model:
from prophet import Prophetimport pandas as pd
# Prophet requires columns named 'ds' (date) and 'y' (value)df = pd.read_csv('sales.csv').rename(columns={'date': 'ds', 'sales': 'y'})
model = Prophet( yearly_seasonality=True, weekly_seasonality=True, daily_seasonality=False, changepoint_prior_scale=0.05, # Flexibility of the trend — higher = more flexible)model.add_country_holidays(country_name='US')model.fit(df)
future = model.make_future_dataframe(periods=90) # 90-day forecastforecast = model.predict(future)
model.plot(forecast) # Full forecast with uncertainty bandsmodel.plot_components(forecast) # Decomposition: trend + each seasonality componentTip (ARIMA vs Prophet)
Use ARIMA when you need statistical rigour and you have a clean, stationary univariate series. Use Prophet when you have multiple strong seasonalities, missing data, known holidays, and need a model that non-experts can run and interpret — it handles the messiness of real business data gracefully.
Price Elasticity of Demand
Price elasticity measures how sensitive demand is to a change in price.
| Elasticity | Meaning | Typical examples |
|---|---|---|
| Elastic — demand drops significantly with a price increase | Luxury goods, commodities, price-sensitive categories | |
| Inelastic — demand barely changes | Necessities, utilities, strong brand loyalty | |
| Unit elastic — revenue stays constant | Rare in practice |
Estimate elasticity with a log-log regression — the coefficient on log price is the elasticity directly:
import numpy as npfrom sklearn.linear_model import LinearRegression
df['log_sales'] = np.log(df['sales'])df['log_price'] = np.log(df['price'])
X = df[['log_price', 'log_income', 'log_competitor_price']]y = df['log_sales']
model = LinearRegression().fit(X, y)elasticity = model.coef_[0] # Typically negative: higher price → lower demand
print(f'Price elasticity: {elasticity:.3f}')
# Optimal price (Lerner condition — valid only when |E| > 1)marginal_cost = 10optimal_price = marginal_cost * elasticity / (elasticity + 1)print(f'Optimal price: ${optimal_price:.2f}')Anomaly Detection
| Method | How it works | Use when |
|---|---|---|
| Z-score | Flag points more than 3σ from the mean | Univariate, stationary, normally distributed |
| IQR method | Flag outside | Skewed data, robust to outliers |
| Isolation Forest | Anomalies are isolated in fewer tree splits | Multivariate, high-dimensional |
| Local Outlier Factor (LOF) | Compares local density to neighbours | Varying cluster densities |
| Autoencoder | High reconstruction error flags anomalies | Complex high-dimensional patterns |
| Prophet residuals | Fit a forecast, flag large residuals | Time series anomaly detection |
| CUSUM / EWMA | Sequential monitoring of cumulative deviations | Real-time production monitoring |
from sklearn.ensemble import IsolationForest
model = IsolationForest(contamination=0.05, random_state=42) # 5% expected anomaly ratepreds = model.fit_predict(X) # -1 = anomaly, 1 = normalanomalies = X[preds == -1]