Workshop Details
Building Counterfactual Scenario Methods for Feature Evaluations
From Potential Outcomes to Causal Forests — A Hands-On Workshop in Retail Analytics
1111 110th Ave NE, Bellevue, WA 98004
From Potential Outcomes to Causal Forests — A Hands-On Workshop in Retail Analytics
A Hands-On Workshop: From Potential Outcomes to Causal Forests in Retail Analytics
import pandas as pd import numpy as np # Load UCI Online Retail dataset df = pd.read_excel('Online_Retail.xlsx') # Step 1: Remove cancellations and invalid entries df = df[~df['InvoiceNo'].astype(str).str.startswith('C')] df = df[df['CustomerID'].notna()] df = df[df['UnitPrice'] > 0] df = df[df['Quantity'] > 0] # Step 2: Compute TotalSpend per transaction line df['TotalSpend'] = df['Quantity'] * df['UnitPrice'] # Step 3: Compute RFM features at customer level reference_date = df['InvoiceDate'].max() + pd.Timedelta(days=1) rfm = df.groupby('CustomerID').agg( Recency=('InvoiceDate', lambda x: (reference_date - x.max()).days), Frequency=('InvoiceNo', 'nunique'), Monetary=('TotalSpend', 'sum') ).reset_index() # Step 4: Define treatment — discount exposure product_median_price = df.groupby('StockCode')['UnitPrice'].transform('median') df['IsDiscounted'] = (df['UnitPrice'] < product_median_price).astype(int) # Customer-level treatment: majority of purchases at discount customer_treatment = df.groupby('CustomerID')['IsDiscounted'].mean() customer_treatment = (customer_treatment > 0.5).astype(int).reset_index() customer_treatment.columns = ['CustomerID', 'Treatment'] # Merge treatment with RFM features analysis_df = rfm.merge(customer_treatment, on='CustomerID')
from sklearn.linear_model import LogisticRegression from sklearn.neighbors import NearestNeighbors import numpy as np # Step 1: Estimate propensity scores X = analysis_df[['Recency', 'Frequency', 'Monetary']].values T = analysis_df['Treatment'].values Y = analysis_df['Monetary'].values ps_model = LogisticRegression(max_iter=1000) ps_model.fit(X, T) ps_scores = ps_model.predict_proba(X)[:, 1] # Step 2: Propensity Score Matching (nearest neighbor) treated_idx = np.where(T == 1)[0] control_idx = np.where(T == 0)[0] nn = NearestNeighbors(n_neighbors=1, metric='euclidean') nn.fit(ps_scores[control_idx].reshape(-1, 1)) distances, indices = nn.kneighbors(ps_scores[treated_idx].reshape(-1, 1)) matched_control_idx = control_idx[indices.flatten()] # ATT via PSM att_psm = Y[treated_idx].mean() - Y[matched_control_idx].mean() # Step 3: IPW Estimation ate_ipw = (np.sum(T * Y / ps_scores) - np.sum((1-T) * Y / (1-ps_scores))) / len(Y) print(f"ATT (PSM): {att_psm:.2f}") print(f"ATE (IPW): {ate_ipw:.2f}")
library(MatchIt) library(sandwich) library(lmtest) # Propensity score matching m.out <- matchit(Treatment ~ Recency + Frequency + Monetary, data = analysis_df, method = "nearest", distance = "glm", ratio = 1) summary(m.out) # Balance diagnostics # Extract matched data and estimate ATT m.data <- match.data(m.out) fit <- lm(Outcome ~ Treatment, data = m.data, weights = weights) coeftest(fit, vcov. = vcovCL, cluster = ~subclass)
from econml.dml import LinearDML from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier # Define variables Y = analysis_df['Monetary'].values # Outcome T = analysis_df['Treatment'].values # Treatment X = analysis_df[['Recency', 'Frequency']].values # Effect modifiers W = analysis_df[['Recency', 'Frequency', 'Monetary']].values # Controls # Initialize DML with flexible ML models for nuisance functions dml = LinearDML( model_y=GradientBoostingRegressor(n_estimators=100, max_depth=3), model_t=GradientBoostingClassifier(n_estimators=100, max_depth=3), cv=5, # 5-fold cross-fitting random_state=42 ) # Fit the model dml.fit(Y, T, X=X, W=W) # Get ATE and confidence interval ate = dml.ate(X) ci = dml.ate_interval(X, alpha=0.05) print(f"ATE: {ate:.2f}, 95% CI: [{ci[0]:.2f}, {ci[1]:.2f}]") # Get heterogeneous effects cate = dml.effect(X)
from econml.dml import CausalForestDML from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier # Initialize Causal Forest cf = CausalForestDML( model_y=RandomForestRegressor(n_estimators=100, min_samples_leaf=10), model_t=RandomForestClassifier(n_estimators=100, min_samples_leaf=10), n_estimators=2000, min_samples_leaf=5, random_state=42 ) # Fit on retail data cf.fit(Y, T, X=X, W=W) # Estimate individual treatment effects cate_estimates = cf.effect(X) cate_intervals = cf.effect_interval(X, alpha=0.05) # Identify high-impact customer segments analysis_df['CATE'] = cate_estimates high_impact = analysis_df[analysis_df['CATE'] > np.percentile(cate_estimates, 75)] low_impact = analysis_df[analysis_df['CATE'] < np.percentile(cate_estimates, 25)] print(f"High-impact segment mean CATE: {high_impact['CATE'].mean():.2f}") print(f"Low-impact segment mean CATE: {low_impact['CATE'].mean():.2f}")
library(grf) # Train causal forest cf <- causal_forest( X = as.matrix(analysis_df[, c("Recency", "Frequency", "Monetary")]), Y = analysis_df$Outcome, W = analysis_df$Treatment, num.trees = 2000, honesty = TRUE, sample.fraction = 0.5 ) # Estimate CATE cate <- predict(cf, estimate.variance = TRUE) # Variable importance for heterogeneity drivers varimp <- variable_importance(cf)
import dowhy from dowhy import CausalModel # DoWhy 4-step workflow: Model → Identify → Estimate → Refute model = CausalModel( data=analysis_df, treatment='Treatment', outcome='Monetary', common_causes=['Recency', 'Frequency'] ) # Identify causal effect identified = model.identify_effect() # Estimate using IPW estimate = model.estimate_effect(identified, method_name="backdoor.propensity_score_weighting") # Refutation: placebo treatment refute_placebo = model.refute_estimate(identified, estimate, method_name="placebo_treatment_refuter") # Refutation: random common cause refute_random = model.refute_estimate(identified, estimate, method_name="random_common_cause") print(f"Estimated effect: {estimate.value:.2f}") print(f"Placebo refutation p-value: {refute_placebo.refutation_result['p_value']:.4f}")
| Method | Best Use Case | Key Strength | Limitation |
|---|---|---|---|
| PSM | Promotion impact on similar customers | Intuitive, transparent matching | Requires strong overlap |
| IPW | Population-level ATE estimation | Uses full sample, no data discarded | Sensitive to extreme weights |
| DML | Complex confounding with ML flexibility | Root-n consistent, model-agnostic | Requires correct functional form for θ |
| Causal Forests | Heterogeneous effects across segments | Non-parametric CATE, honest inference | Computationally intensive |
| Lesson | Core Method | Key Formula | Retail Application |
|---|---|---|---|
| 1 | Potential Outcomes | ATE = E[Y(1) – Y(0)] | Framing causal questions |
| 2 | Data Engineering | RFM features, treatment construction | Dataset preparation |
| 3 | PSM & IPW | e(X) = P(W=1|X); τ̂IPW | Promotion impact |
| 4 | Double ML | θ̌₀ with cross-fitting | Pricing & marketing ROI |
| 5 | Causal Forests | τ̂(x) = (1/B) Σ τ̂ᵇ(x) | Customer segmentation |
| 6 | Integration | Refutation + ROI | Business deployment |
Senior Statistician, Amazon
With over 9 years in the field of Data Science and extensive expertise in decision analytics and Python, Dharmateja Priyadarshi Uddandarao specializes in developing causal scientific models to evaluate the economic impact of high-value actions and products. His work focuses on designing causal experiments, forecasting financial scenarios. A prolific author, he contributes to various magazines like The Data Scientist, AI Journ, Analytics Vidhya, ACM, Silicon Valley Journal, and HackerNoon covering causal inference, econometrics, and applied statistical methodology. Dharmateja also actively contributes to the field through mentoring and committee roles in various professional organizations and conferences.