Data Discovery Workshop Series – Building Counterfactual Scenario Methods for Feature Evaluations

Workshop Details

Data Discovery Workshop Series

Building Counterfactual Scenario Methods for Feature Evaluations

From Potential Outcomes to Causal Forests — A Hands-On Workshop in Retail Analytics

Instructor
Dharmateja Priyadarshi Uddandarao
Date
May 9, 2026
In-Person Venue
Bellevue Library Meeting Hall 2
1111 110th Ave NE, Bellevue, WA 98004
Duration
75 Minutes
Audience
Data Scientists, Statisticians & Analysts
Registrations
4 confirmed attendees
Causal Inference Counterfactual Methods Propensity Score Matching Double Machine Learning Causal Forests Retail Analytics Python & R

Workshop Content

Building Counterfactual Scenario Methods for Feature Evaluations

A Hands-On Workshop: From Potential Outcomes to Causal Forests in Retail Analytics

6 Lessons UCI Online Retail Dataset Data Scientists & Statisticians Python & R
📊
Dataset
541,909
Transactions
🎯
Methods
4
Causal Inference
📈
Applications
5+
Retail Use Cases
🧪
Capstone
End-to-End
Project Included
1
Foundations of Causal Inference and the Potential Outcomes Framework
Understanding the Neyman-Rubin Causal Model & Identification Assumptions
Learning Objectives
  • Distinguish between correlation and causation in retail data contexts
  • Articulate the fundamental problem of causal inference
  • Define the Neyman-Rubin potential outcomes framework
  • Identify the three critical assumptions for causal identification in observational data
Mathematical Foundations
The Neyman-Rubin Causal Model (NRCM) assumes each unit has a set of possible outcomes with respect to different treatments. For each customer i, we define two potential outcomes: Yi(1) under treatment and Yi(0) under control.
Switching Equation
Yi = Di · Yi(1) + (1 – Di) · Yi(0)
Average Treatment Effect (ATE)
ATE = E[Y(1) – Y(0)]
Average Treatment Effect on the Treated (ATT)
ATT = E[Y(1) – Y(0) | W = 1]
Conditional Average Treatment Effect (CATE)
τ(x) = E[Y(1) – Y(0) | X = x]
Three Critical Assumptions
1. Consistency (SUTVA)
The observed outcome under a given treatment corresponds exactly to the potential outcome defined for that treatment, with no interference between units.
2. Ignorability (Unconfoundedness)
{Y(1), Y(0)} ⊥ D | X — treatment assignment is independent of potential outcomes conditional on observed covariates.
3. Overlap (Positivity)
0 < P(D=1|X) < 1 for all X — every unit has a non-zero probability of receiving any treatment.
Retail Application
Business Question: “Does offering a discount causally increase customer lifetime revenue?” Here, Yi(1) is total spending if customer i receives a discount, and Yi(0) is their spending without. We never observe both — this is the counterfactual gap.
📝 Exercises & Assignments
  1. Identify three potential causal questions from the UCI Online Retail dataset
  2. Specify the treatment, outcome, and potential confounders for each question
  3. Discuss which of the three assumptions might be violated in each scenario
  4. Write a memo explaining why a simple mean comparison does not yield a causal estimate
2
Data Preparation and Treatment Construction
Preprocessing the UCI Online Retail Dataset for Causal Analysis
Learning Objectives
  • Preprocess retail transaction data for causal analysis
  • Engineer RFM (Recency, Frequency, Monetary) features as confounders
  • Operationalize treatment definitions from observational data
  • Assess initial covariate balance between treatment and control groups
Dataset Overview
UCI Online Retail Dataset: 541,909 transactions from a UK-based e-commerce retailer (Dec 2010 – Dec 2011). Contains 8 variables: InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country.
Implementation — Python
Python
import pandas as pd
import numpy as np

# Load UCI Online Retail dataset
df = pd.read_excel('Online_Retail.xlsx')

# Step 1: Remove cancellations and invalid entries
df = df[~df['InvoiceNo'].astype(str).str.startswith('C')]
df = df[df['CustomerID'].notna()]
df = df[df['UnitPrice'] > 0]
df = df[df['Quantity'] > 0]

# Step 2: Compute TotalSpend per transaction line
df['TotalSpend'] = df['Quantity'] * df['UnitPrice']

# Step 3: Compute RFM features at customer level
reference_date = df['InvoiceDate'].max() + pd.Timedelta(days=1)
rfm = df.groupby('CustomerID').agg(
    Recency=('InvoiceDate', lambda x: (reference_date - x.max()).days),
    Frequency=('InvoiceNo', 'nunique'),
    Monetary=('TotalSpend', 'sum')
).reset_index()

# Step 4: Define treatment — discount exposure
product_median_price = df.groupby('StockCode')['UnitPrice'].transform('median')
df['IsDiscounted'] = (df['UnitPrice'] < product_median_price).astype(int)

# Customer-level treatment: majority of purchases at discount
customer_treatment = df.groupby('CustomerID')['IsDiscounted'].mean()
customer_treatment = (customer_treatment > 0.5).astype(int).reset_index()
customer_treatment.columns = ['CustomerID', 'Treatment']

# Merge treatment with RFM features
analysis_df = rfm.merge(customer_treatment, on='CustomerID')
Covariate Balance Assessment
Standardized Mean Difference (SMD)
SMD = (X̄treated – X̄control) / √((s²treated + s²control) / 2)

Target: SMD < 0.1 after matching or weighting
Tip: Using spend-based stratification (80% middle-range, 10% each upper/lower bounds) can improve covariate balance and reduce confounding.
📝 Exercises & Assignments
  1. Implement the full preprocessing pipeline on the UCI dataset
  2. Create two alternative treatment definitions: bulk purchase (Quantity ≥ 12) and cross-country comparison
  3. Compute and visualize the distribution of confounders across treatment groups
  4. Document any overlap violations you observe
3
Propensity Score Matching & Inverse Probability Weighting
Classical Methods for Causal Estimation in Observational Studies
Learning Objectives
  • Estimate propensity scores using logistic regression
  • Implement nearest-neighbor propensity score matching
  • Construct IPW estimators for ATE and ATT
  • Diagnose covariate balance after matching/weighting
  • Interpret results in the context of retail promotion effectiveness
Mathematical Foundations
Propensity Score
e(X) = P(W = 1 | X) = Pr(Treatment | Covariates)
ATT via PSM
τ̂ATT = (1/N₁) Σi∈treated [Yi – Ŷi(0)]

where Ŷi(0) is the average outcome of control units matched to treated unit i
IPW Estimator for ATE
τ̂IPW = (1/n) Σi [ (Wi · Yi) / e(Xi) – ((1-Wi) · Yi) / (1-e(Xi)) ]
Implementation — Python
Python
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
import numpy as np

# Step 1: Estimate propensity scores
X = analysis_df[['Recency', 'Frequency', 'Monetary']].values
T = analysis_df['Treatment'].values
Y = analysis_df['Monetary'].values

ps_model = LogisticRegression(max_iter=1000)
ps_model.fit(X, T)
ps_scores = ps_model.predict_proba(X)[:, 1]

# Step 2: Propensity Score Matching (nearest neighbor)
treated_idx = np.where(T == 1)[0]
control_idx = np.where(T == 0)[0]

nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(ps_scores[control_idx].reshape(-1, 1))
distances, indices = nn.kneighbors(ps_scores[treated_idx].reshape(-1, 1))
matched_control_idx = control_idx[indices.flatten()]

# ATT via PSM
att_psm = Y[treated_idx].mean() - Y[matched_control_idx].mean()

# Step 3: IPW Estimation
ate_ipw = (np.sum(T * Y / ps_scores) - np.sum((1-T) * Y / (1-ps_scores))) / len(Y)
print(f"ATT (PSM): {att_psm:.2f}")
print(f"ATE (IPW): {ate_ipw:.2f}")
Implementation — R (MatchIt)
R
library(MatchIt)
library(sandwich)
library(lmtest)

# Propensity score matching
m.out <- matchit(Treatment ~ Recency + Frequency + Monetary,
                 data = analysis_df, method = "nearest",
                 distance = "glm", ratio = 1)
summary(m.out)  # Balance diagnostics

# Extract matched data and estimate ATT
m.data <- match.data(m.out)
fit <- lm(Outcome ~ Treatment, data = m.data, weights = weights)
coeftest(fit, vcov. = vcovCL, cluster = ~subclass)
Business Interpretation
The PSM estimate answers: “Among customers exposed to discounts, how much more did they spend compared to what they would have spent without discounts?” PSM isolates the causal spending uplift by matching members with comparable non-members on observable characteristics, removing self-selection bias.
📝 Exercises & Assignments
  1. Implement PSM with different caliper widths (0.05, 0.1, 0.2) and compare results
  2. Create Love plots showing covariate balance before and after matching
  3. Implement stabilized IPW weights to address extreme propensity scores
  4. Compare ATT estimates from PSM versus IPW on the same treatment definition
4
Double Machine Learning for Causal Feature Evaluation
Root-n Consistent Estimation with Neyman Orthogonality & Cross-Fitting
Learning Objectives
  • Understand the partially linear model and its role in causal inference
  • Implement the DML procedure with cross-fitting
  • Apply Neyman orthogonality to achieve root-n consistent estimation
  • Use EconML’s LinearDML for retail feature evaluation
  • Interpret DML estimates for pricing and marketing decisions
Mathematical Foundations
The Partially Linear Model
Yi = θ₀ · Di + g₀(Xi) + ζi
Di = m₀(Xi) + Vi

where θ₀ = causal parameter, g₀(X) = outcome nuisance, m₀(X) = treatment nuisance
Orthogonalized (Debiased) Estimator
θ̌₀ = [1/n Σᵢ V̂ᵢ · Dᵢ]⁻¹ · [1/n Σᵢ V̂ᵢ · (Yᵢ – ĝ₀(Xᵢ))]

where V̂ᵢ = Dᵢ – m̂(Xᵢ) are the residualized treatment values
Neyman Orthogonality
The derivative of the estimating equation with respect to any nuisance parameters equals zero at the true values — small errors in nuisance estimation don’t bias the causal parameter.
Cross-Fitting
Sample splitting ensures the data for estimating nuisance elements is statistically independent from data for estimating causal impacts, avoiding overfitting.
Implementation — Python (EconML)
Python
from econml.dml import LinearDML
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier

# Define variables
Y = analysis_df['Monetary'].values          # Outcome
T = analysis_df['Treatment'].values          # Treatment
X = analysis_df[['Recency', 'Frequency']].values  # Effect modifiers
W = analysis_df[['Recency', 'Frequency', 'Monetary']].values  # Controls

# Initialize DML with flexible ML models for nuisance functions
dml = LinearDML(
    model_y=GradientBoostingRegressor(n_estimators=100, max_depth=3),
    model_t=GradientBoostingClassifier(n_estimators=100, max_depth=3),
    cv=5,  # 5-fold cross-fitting
    random_state=42
)

# Fit the model
dml.fit(Y, T, X=X, W=W)

# Get ATE and confidence interval
ate = dml.ate(X)
ci = dml.ate_interval(X, alpha=0.05)
print(f"ATE: {ate:.2f}, 95% CI: [{ci[0]:.2f}, {ci[1]:.2f}]")

# Get heterogeneous effects
cate = dml.effect(X)
Business Context
Real-World Application (Lowe’s): Retailers like Lowe’s have applied Double Machine Learning with multiple confounders to obtain the correct sign of causal effects of business features such as marketing spend and inventory levels on sales, enabling executives to answer “what if” questions about resource allocation.
📝 Exercises & Assignments
  1. Implement DML with different ML models (Random Forests, Neural Networks, LASSO)
  2. Compare DML estimates with naive OLS regression estimates
  3. Perform a manual cross-fitting procedure to understand the mechanics
  4. Apply DML to estimate the causal effect of bulk purchasing on customer retention
5
Causal Forests and Heterogeneous Treatment Effects
Individual-Level Treatment Effects with Honest Estimation
Learning Objectives
  • Understand the causal forest algorithm and honest estimation
  • Estimate individual-level treatment effects (CATE)
  • Identify customer segments with differential treatment responses
  • Implement causal forests using EconML and R’s grf package
  • Develop CATE-based targeting strategies for retail personalization
Mathematical Foundations
Causal Forest Estimator
τ̂(x) = (1/B) Σb=1B τ̂b(x)

where τ̂b(x) = treatment effect estimate from tree b, B = number of trees
Local Estimation with Doubly-Robust Scores
β̂(z) = [Σᵢ αᵢ(z) · (Yᵢ – μ̂(Zᵢ)) · (Xᵢ – π̂(Zᵢ))] / [Σᵢ αᵢ(z) · (Xᵢ – π̂(Zᵢ))²]

where αᵢ(z) = forest-derived kernel weights, μ̂ = estimated outcome, π̂ = estimated propensity
Honest Estimation
One sample constructs the partition (splitting criterion maximizes treatment effect heterogeneity) and another estimates effects within each leaf — preventing overfitting.
CATE-Based Targeting
Rank customers by estimated treatment effects rather than predicted baseline outcomes. Customers with high predicted outcomes may not be causally affected and thus represent inefficient targets.
Implementation — Python (EconML)
Python
from econml.dml import CausalForestDML
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

# Initialize Causal Forest
cf = CausalForestDML(
    model_y=RandomForestRegressor(n_estimators=100, min_samples_leaf=10),
    model_t=RandomForestClassifier(n_estimators=100, min_samples_leaf=10),
    n_estimators=2000,
    min_samples_leaf=5,
    random_state=42
)

# Fit on retail data
cf.fit(Y, T, X=X, W=W)

# Estimate individual treatment effects
cate_estimates = cf.effect(X)
cate_intervals = cf.effect_interval(X, alpha=0.05)

# Identify high-impact customer segments
analysis_df['CATE'] = cate_estimates
high_impact = analysis_df[analysis_df['CATE'] > np.percentile(cate_estimates, 75)]
low_impact = analysis_df[analysis_df['CATE'] < np.percentile(cate_estimates, 25)]

print(f"High-impact segment mean CATE: {high_impact['CATE'].mean():.2f}")
print(f"Low-impact segment mean CATE: {low_impact['CATE'].mean():.2f}")
Implementation — R (grf)
R
library(grf)

# Train causal forest
cf <- causal_forest(
  X = as.matrix(analysis_df[, c("Recency", "Frequency", "Monetary")]),
  Y = analysis_df$Outcome,
  W = analysis_df$Treatment,
  num.trees = 2000,
  honesty = TRUE,
  sample.fraction = 0.5
)

# Estimate CATE
cate <- predict(cf, estimate.variance = TRUE)

# Variable importance for heterogeneity drivers
varimp <- variable_importance(cf)
📝 Exercises & Assignments
  1. Train a causal forest and visualize the distribution of CATE estimates
  2. Identify the top 3 covariates driving treatment effect heterogeneity
  3. Create customer segments based on CATE quartiles and profile each segment
  4. Compare CATE estimates from Causal Forests with LinearDML
6
Business Applications, Robustness, and Deployment
From Causal Estimates to ROI: Refutation, Decision Frameworks & Capstone
Learning Objectives
  • Apply counterfactual methods to measure real-world retail impacts
  • Conduct robustness checks and refutation tests using DoWhy
  • Translate causal estimates into ROI and business decision frameworks
  • Complete a capstone project integrating all methods from the workshop
Retail Business Applications
🏷️ Promotion Effectiveness
Uplift modeling estimates the Individual Treatment Effect of promotions — identifying which customers are causally affected vs. those who would purchase regardless.
💰 Pricing Strategy
Causal forecasting models the price-demand relationship so retailers can set profit-optimal prices using elasticity as a counterfactual concept.
🛒 Product Recommendations
Causal inference in recommender systems addresses confounding biases and enables counterfactual offline policy evaluation.
⭐ Loyalty Programs
PSM isolates the causal spending uplift of loyalty membership by matching members with comparable non-members on observable characteristics.
Robustness Framework — DoWhy
Python
import dowhy
from dowhy import CausalModel

# DoWhy 4-step workflow: Model → Identify → Estimate → Refute
model = CausalModel(
    data=analysis_df,
    treatment='Treatment',
    outcome='Monetary',
    common_causes=['Recency', 'Frequency']
)

# Identify causal effect
identified = model.identify_effect()

# Estimate using IPW
estimate = model.estimate_effect(identified,
    method_name="backdoor.propensity_score_weighting")

# Refutation: placebo treatment
refute_placebo = model.refute_estimate(identified, estimate,
    method_name="placebo_treatment_refuter")

# Refutation: random common cause
refute_random = model.refute_estimate(identified, estimate,
    method_name="random_common_cause")

print(f"Estimated effect: {estimate.value:.2f}")
print(f"Placebo refutation p-value: {refute_placebo.refutation_result['p_value']:.4f}")
From Causal Estimates to ROI
ROI Pipeline
1. Estimate causal effect (τ̂) using appropriate method
2. Total Incremental Revenue = τ̂ × Naffected customers
3. Net Benefit = Total Incremental Revenue − Implementation Cost
4. ROI = (Incremental Revenue − Cost) / Cost × 100%
Method Comparison
Method Best Use Case Key Strength Limitation
PSM Promotion impact on similar customers Intuitive, transparent matching Requires strong overlap
IPW Population-level ATE estimation Uses full sample, no data discarded Sensitive to extreme weights
DML Complex confounding with ML flexibility Root-n consistent, model-agnostic Requires correct functional form for θ
Causal Forests Heterogeneous effects across segments Non-parametric CATE, honest inference Computationally intensive
Capstone Project
End-to-End Counterfactual Feature Evaluation: Using the UCI Online Retail dataset, participants must define a business question, preprocess data, construct treatment/control groups, apply ≥3 methods (PSM, IPW, DML, Causal Forests), conduct DoWhy refutation, translate findings into ROI estimates, and present to simulated stakeholders.
📝 Exercises & Assignments
  1. Implement the full DoWhy refutation pipeline for your preferred treatment scenario
  2. Compute the ROI of a hypothetical discount campaign using your causal estimates
  3. Write a one-page executive summary translating CATE estimates into a targeting strategy
  4. Compare results across all four methods to assess sensitivity of conclusions

Workshop Summary & Technology Stack

Lesson Core Method Key Formula Retail Application
1 Potential Outcomes ATE = E[Y(1) – Y(0)] Framing causal questions
2 Data Engineering RFM features, treatment construction Dataset preparation
3 PSM & IPW e(X) = P(W=1|X); τ̂IPW Promotion impact
4 Double ML θ̌₀ with cross-fitting Pricing & marketing ROI
5 Causal Forests τ̂(x) = (1/B) Σ τ̂ᵇ(x) Customer segmentation
6 Integration Refutation + ROI Business deployment
Recommended Technology Stack
Python scikit-learn EconML DoWhy pandas numpy R grf MatchIt sandwich

Facilitator: Dharmateja Priyadarshi Uddandarao

Senior Statistician, Amazon

With over 9 years in the field of Data Science and extensive expertise in decision analytics and Python, Dharmateja Priyadarshi Uddandarao specializes in developing causal scientific models to evaluate the economic impact of high-value actions and products. His work focuses on designing causal experiments, forecasting financial scenarios. A prolific author, he contributes to various magazines like The Data Scientist, AI Journ, Analytics Vidhya, ACM, Silicon Valley Journal, and HackerNoon covering causal inference, econometrics, and applied statistical methodology. Dharmateja also actively contributes to the field through mentoring and committee roles in various professional organizations and conferences.

Event Throwbacks

Scroll to Top