Data Discovery Workshop Series – Building Counterfactual Scenario Methods for Feature Evaluations

Registrations closed

Workshop Details

Data Discovery Workshop Series

Building Counterfactual Scenario Methods for Feature Evaluations

From Potential Outcomes to Causal Forests — A Hands-On Workshop in Retail Analytics

Instructor

Dharmateja Priyadarshi Uddandarao

Date

May 9, 2026

In-Person Venue

Bellevue Library Meeting Hall 2
1111 110th Ave NE, Bellevue, WA 98004

Duration

75 Minutes

Audience

Data Scientists, Statisticians & Analysts

Registrations

14 confirmed attendees

Causal Inference Counterfactual Methods Propensity Score Matching Double Machine Learning Causal Forests Retail Analytics Python & R

Virtual participation link

Workshop Content

Building Counterfactual Scenario Methods for Feature Evaluations

A Hands-On Workshop: From Potential Outcomes to Causal Forests in Retail Analytics

6 Lessons UCI Online Retail Dataset Data Scientists & Statisticians Python & R

📊

Dataset

541,909

Transactions

🎯

Methods

Causal Inference

📈

Applications

Retail Use Cases

🧪

Capstone

End-to-End

Project Included

Foundations of Causal Inference and the Potential Outcomes Framework

Understanding the Neyman-Rubin Causal Model & Identification Assumptions

Learning Objectives

Distinguish between correlation and causation in retail data contexts
Articulate the fundamental problem of causal inference
Define the Neyman-Rubin potential outcomes framework
Identify the three critical assumptions for causal identification in observational data

Mathematical Foundations

The Neyman-Rubin Causal Model (NRCM) assumes each unit has a set of possible outcomes with respect to different treatments. For each customer i, we define two potential outcomes: Y_i(1) under treatment and Y_i(0) under control.

Switching Equation

Y_i = D_i · Y_i(1) + (1 – D_i) · Y_i(0)

Average Treatment Effect (ATE)

ATE = E[Y(1) – Y(0)]

Average Treatment Effect on the Treated (ATT)

ATT = E[Y(1) – Y(0) | W = 1]

Conditional Average Treatment Effect (CATE)

τ(x) = E[Y(1) – Y(0) | X = x]

Three Critical Assumptions

1. Consistency (SUTVA)

The observed outcome under a given treatment corresponds exactly to the potential outcome defined for that treatment, with no interference between units.

2. Ignorability (Unconfoundedness)

{Y(1), Y(0)} ⊥ D | X — treatment assignment is independent of potential outcomes conditional on observed covariates.

3. Overlap (Positivity)

0 < P(D=1|X) < 1 for all X — every unit has a non-zero probability of receiving any treatment.

Retail Application

Business Question: “Does offering a discount causally increase customer lifetime revenue?” Here, Y_i(1) is total spending if customer i receives a discount, and Y_i(0) is their spending without. We never observe both — this is the counterfactual gap.

📝 Exercises & Assignments

Identify three potential causal questions from the UCI Online Retail dataset
Specify the treatment, outcome, and potential confounders for each question
Discuss which of the three assumptions might be violated in each scenario
Write a memo explaining why a simple mean comparison does not yield a causal estimate

Data Preparation and Treatment Construction

Preprocessing the UCI Online Retail Dataset for Causal Analysis

Learning Objectives

Preprocess retail transaction data for causal analysis
Engineer RFM (Recency, Frequency, Monetary) features as confounders
Operationalize treatment definitions from observational data
Assess initial covariate balance between treatment and control groups

Dataset Overview

UCI Online Retail Dataset: 541,909 transactions from a UK-based e-commerce retailer (Dec 2010 – Dec 2011). Contains 8 variables: InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country.

Implementation — Python

Python

import pandas as pd
import numpy as np

# Load UCI Online Retail dataset
df = pd.read_excel('Online_Retail.xlsx')

# Step 1: Remove cancellations and invalid entries
df = df[~df['InvoiceNo'].astype(str).str.startswith('C')]
df = df[df['CustomerID'].notna()]
df = df[df['UnitPrice'] > 0]
df = df[df['Quantity'] > 0]

# Step 2: Compute TotalSpend per transaction line
df['TotalSpend'] = df['Quantity'] * df['UnitPrice']

# Step 3: Compute RFM features at customer level
reference_date = df['InvoiceDate'].max() + pd.Timedelta(days=1)
rfm = df.groupby('CustomerID').agg(
    Recency=('InvoiceDate', lambda x: (reference_date - x.max()).days),
    Frequency=('InvoiceNo', 'nunique'),
    Monetary=('TotalSpend', 'sum')
).reset_index()

# Step 4: Define treatment — discount exposure
product_median_price = df.groupby('StockCode')['UnitPrice'].transform('median')
df['IsDiscounted'] = (df['UnitPrice'] < product_median_price).astype(int)

# Customer-level treatment: majority of purchases at discount
customer_treatment = df.groupby('CustomerID')['IsDiscounted'].mean()
customer_treatment = (customer_treatment > 0.5).astype(int).reset_index()
customer_treatment.columns = ['CustomerID', 'Treatment']

# Merge treatment with RFM features
analysis_df = rfm.merge(customer_treatment, on='CustomerID')

Covariate Balance Assessment

Standardized Mean Difference (SMD)

SMD = (X̄_treated – X̄_control) / √((s²_treated + s²_control) / 2)

Target: SMD < 0.1 after matching or weighting

Tip: Using spend-based stratification (80% middle-range, 10% each upper/lower bounds) can improve covariate balance and reduce confounding.

📝 Exercises & Assignments

Implement the full preprocessing pipeline on the UCI dataset
Create two alternative treatment definitions: bulk purchase (Quantity ≥ 12) and cross-country comparison
Compute and visualize the distribution of confounders across treatment groups
Document any overlap violations you observe

Propensity Score Matching & Inverse Probability Weighting

Classical Methods for Causal Estimation in Observational Studies

Learning Objectives

Estimate propensity scores using logistic regression
Implement nearest-neighbor propensity score matching
Construct IPW estimators for ATE and ATT
Diagnose covariate balance after matching/weighting
Interpret results in the context of retail promotion effectiveness

Mathematical Foundations

Propensity Score

e(X) = P(W = 1 | X) = Pr(Treatment | Covariates)

ATT via PSM

τ̂_ATT = (1/N₁) Σ_i∈treated [Y_i – Ŷ_i(0)]

where Ŷ_i(0) is the average outcome of control units matched to treated unit i

IPW Estimator for ATE

τ̂_IPW = (1/n) Σ_i [ (W_i · Y_i) / e(X_i) – ((1-W_i) · Y_i) / (1-e(X_i)) ]

Implementation — Python

Python

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
import numpy as np

# Step 1: Estimate propensity scores
X = analysis_df[['Recency', 'Frequency', 'Monetary']].values
T = analysis_df['Treatment'].values
Y = analysis_df['Monetary'].values

ps_model = LogisticRegression(max_iter=1000)
ps_model.fit(X, T)
ps_scores = ps_model.predict_proba(X)[:, 1]

# Step 2: Propensity Score Matching (nearest neighbor)
treated_idx = np.where(T == 1)[0]
control_idx = np.where(T == 0)[0]

nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(ps_scores[control_idx].reshape(-1, 1))
distances, indices = nn.kneighbors(ps_scores[treated_idx].reshape(-1, 1))
matched_control_idx = control_idx[indices.flatten()]

# ATT via PSM
att_psm = Y[treated_idx].mean() - Y[matched_control_idx].mean()

# Step 3: IPW Estimation
ate_ipw = (np.sum(T * Y / ps_scores) - np.sum((1-T) * Y / (1-ps_scores))) / len(Y)
print(f"ATT (PSM): {att_psm:.2f}")
print(f"ATE (IPW): {ate_ipw:.2f}")

Implementation — R (MatchIt)

library(MatchIt)
library(sandwich)
library(lmtest)

# Propensity score matching
m.out <- matchit(Treatment ~ Recency + Frequency + Monetary,
                 data = analysis_df, method = "nearest",
                 distance = "glm", ratio = 1)
summary(m.out)  # Balance diagnostics

# Extract matched data and estimate ATT
m.data <- match.data(m.out)
fit <- lm(Outcome ~ Treatment, data = m.data, weights = weights)
coeftest(fit, vcov. = vcovCL, cluster = ~subclass)

Business Interpretation

The PSM estimate answers: “Among customers exposed to discounts, how much more did they spend compared to what they would have spent without discounts?” PSM isolates the causal spending uplift by matching members with comparable non-members on observable characteristics, removing self-selection bias.

📝 Exercises & Assignments

Implement PSM with different caliper widths (0.05, 0.1, 0.2) and compare results
Create Love plots showing covariate balance before and after matching
Implement stabilized IPW weights to address extreme propensity scores
Compare ATT estimates from PSM versus IPW on the same treatment definition

Double Machine Learning for Causal Feature Evaluation

Root-n Consistent Estimation with Neyman Orthogonality & Cross-Fitting

Learning Objectives

Understand the partially linear model and its role in causal inference
Implement the DML procedure with cross-fitting
Apply Neyman orthogonality to achieve root-n consistent estimation
Use EconML’s LinearDML for retail feature evaluation
Interpret DML estimates for pricing and marketing decisions

Mathematical Foundations

The Partially Linear Model

Y_i = θ₀ · D_i + g₀(X_i) + ζ_i
D_i = m₀(X_i) + V_i

where θ₀ = causal parameter, g₀(X) = outcome nuisance, m₀(X) = treatment nuisance

Orthogonalized (Debiased) Estimator

θ̌₀ = [1/n Σᵢ V̂ᵢ · Dᵢ]⁻¹ · [1/n Σᵢ V̂ᵢ · (Yᵢ – ĝ₀(Xᵢ))]

where V̂ᵢ = Dᵢ – m̂(Xᵢ) are the residualized treatment values

Neyman Orthogonality

The derivative of the estimating equation with respect to any nuisance parameters equals zero at the true values — small errors in nuisance estimation don’t bias the causal parameter.

Cross-Fitting

Sample splitting ensures the data for estimating nuisance elements is statistically independent from data for estimating causal impacts, avoiding overfitting.

Implementation — Python (EconML)

Python

from econml.dml import LinearDML
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier

# Define variables
Y = analysis_df['Monetary'].values          # Outcome
T = analysis_df['Treatment'].values          # Treatment
X = analysis_df[['Recency', 'Frequency']].values  # Effect modifiers
W = analysis_df[['Recency', 'Frequency', 'Monetary']].values  # Controls

# Initialize DML with flexible ML models for nuisance functions
dml = LinearDML(
    model_y=GradientBoostingRegressor(n_estimators=100, max_depth=3),
    model_t=GradientBoostingClassifier(n_estimators=100, max_depth=3),
    cv=5,  # 5-fold cross-fitting
    random_state=42
)

# Fit the model
dml.fit(Y, T, X=X, W=W)

# Get ATE and confidence interval
ate = dml.ate(X)
ci = dml.ate_interval(X, alpha=0.05)
print(f"ATE: {ate:.2f}, 95% CI: [{ci[0]:.2f}, {ci[1]:.2f}]")

# Get heterogeneous effects
cate = dml.effect(X)

Business Context

Real-World Application (Lowe’s): Retailers like Lowe’s have applied Double Machine Learning with multiple confounders to obtain the correct sign of causal effects of business features such as marketing spend and inventory levels on sales, enabling executives to answer “what if” questions about resource allocation.

📝 Exercises & Assignments

Implement DML with different ML models (Random Forests, Neural Networks, LASSO)
Compare DML estimates with naive OLS regression estimates
Perform a manual cross-fitting procedure to understand the mechanics
Apply DML to estimate the causal effect of bulk purchasing on customer retention

Causal Forests and Heterogeneous Treatment Effects

Individual-Level Treatment Effects with Honest Estimation

Learning Objectives

Understand the causal forest algorithm and honest estimation
Estimate individual-level treatment effects (CATE)
Identify customer segments with differential treatment responses
Implement causal forests using EconML and R’s grf package
Develop CATE-based targeting strategies for retail personalization

Mathematical Foundations

Causal Forest Estimator

τ̂(x) = (1/B) Σ_b=1^B τ̂^b(x)

where τ̂^b(x) = treatment effect estimate from tree b, B = number of trees

Local Estimation with Doubly-Robust Scores

β̂(z) = [Σᵢ αᵢ(z) · (Yᵢ – μ̂(Zᵢ)) · (Xᵢ – π̂(Zᵢ))] / [Σᵢ αᵢ(z) · (Xᵢ – π̂(Zᵢ))²]

where αᵢ(z) = forest-derived kernel weights, μ̂ = estimated outcome, π̂ = estimated propensity

Honest Estimation

One sample constructs the partition (splitting criterion maximizes treatment effect heterogeneity) and another estimates effects within each leaf — preventing overfitting.

CATE-Based Targeting

Rank customers by estimated treatment effects rather than predicted baseline outcomes. Customers with high predicted outcomes may not be causally affected and thus represent inefficient targets.

Implementation — Python (EconML)

Python

from econml.dml import CausalForestDML
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

# Initialize Causal Forest
cf = CausalForestDML(
    model_y=RandomForestRegressor(n_estimators=100, min_samples_leaf=10),
    model_t=RandomForestClassifier(n_estimators=100, min_samples_leaf=10),
    n_estimators=2000,
    min_samples_leaf=5,
    random_state=42
)

# Fit on retail data
cf.fit(Y, T, X=X, W=W)

# Estimate individual treatment effects
cate_estimates = cf.effect(X)
cate_intervals = cf.effect_interval(X, alpha=0.05)

# Identify high-impact customer segments
analysis_df['CATE'] = cate_estimates
high_impact = analysis_df[analysis_df['CATE'] > np.percentile(cate_estimates, 75)]
low_impact = analysis_df[analysis_df['CATE'] < np.percentile(cate_estimates, 25)]

print(f"High-impact segment mean CATE: {high_impact['CATE'].mean():.2f}")
print(f"Low-impact segment mean CATE: {low_impact['CATE'].mean():.2f}")

Implementation — R (grf)

library(grf)

# Train causal forest
cf <- causal_forest(
  X = as.matrix(analysis_df[, c("Recency", "Frequency", "Monetary")]),
  Y = analysis_df$Outcome,
  W = analysis_df$Treatment,
  num.trees = 2000,
  honesty = TRUE,
  sample.fraction = 0.5
)

# Estimate CATE
cate <- predict(cf, estimate.variance = TRUE)

# Variable importance for heterogeneity drivers
varimp <- variable_importance(cf)

📝 Exercises & Assignments

Train a causal forest and visualize the distribution of CATE estimates
Identify the top 3 covariates driving treatment effect heterogeneity
Create customer segments based on CATE quartiles and profile each segment
Compare CATE estimates from Causal Forests with LinearDML

Business Applications, Robustness, and Deployment

From Causal Estimates to ROI: Refutation, Decision Frameworks & Capstone

Learning Objectives

Apply counterfactual methods to measure real-world retail impacts
Conduct robustness checks and refutation tests using DoWhy
Translate causal estimates into ROI and business decision frameworks
Complete a capstone project integrating all methods from the workshop

Retail Business Applications

🏷️ Promotion Effectiveness

Uplift modeling estimates the Individual Treatment Effect of promotions — identifying which customers are causally affected vs. those who would purchase regardless.

💰 Pricing Strategy

Causal forecasting models the price-demand relationship so retailers can set profit-optimal prices using elasticity as a counterfactual concept.

🛒 Product Recommendations

Causal inference in recommender systems addresses confounding biases and enables counterfactual offline policy evaluation.

⭐ Loyalty Programs

PSM isolates the causal spending uplift of loyalty membership by matching members with comparable non-members on observable characteristics.

Robustness Framework — DoWhy

Python

import dowhy
from dowhy import CausalModel

# DoWhy 4-step workflow: Model → Identify → Estimate → Refute
model = CausalModel(
    data=analysis_df,
    treatment='Treatment',
    outcome='Monetary',
    common_causes=['Recency', 'Frequency']
)

# Identify causal effect
identified = model.identify_effect()

# Estimate using IPW
estimate = model.estimate_effect(identified,
    method_name="backdoor.propensity_score_weighting")

# Refutation: placebo treatment
refute_placebo = model.refute_estimate(identified, estimate,
    method_name="placebo_treatment_refuter")

# Refutation: random common cause
refute_random = model.refute_estimate(identified, estimate,
    method_name="random_common_cause")

print(f"Estimated effect: {estimate.value:.2f}")
print(f"Placebo refutation p-value: {refute_placebo.refutation_result['p_value']:.4f}")

From Causal Estimates to ROI

ROI Pipeline

1. Estimate causal effect (τ̂) using appropriate method
2. Total Incremental Revenue = τ̂ × N_{affected customers}
3. Net Benefit = Total Incremental Revenue − Implementation Cost
4. ROI = (Incremental Revenue − Cost) / Cost × 100%

Method Comparison

Method	Best Use Case	Key Strength	Limitation
PSM	Promotion impact on similar customers	Intuitive, transparent matching	Requires strong overlap
IPW	Population-level ATE estimation	Uses full sample, no data discarded	Sensitive to extreme weights
DML	Complex confounding with ML flexibility	Root-n consistent, model-agnostic	Requires correct functional form for θ
Causal Forests	Heterogeneous effects across segments	Non-parametric CATE, honest inference	Computationally intensive

Capstone Project

End-to-End Counterfactual Feature Evaluation: Using the UCI Online Retail dataset, participants must define a business question, preprocess data, construct treatment/control groups, apply ≥3 methods (PSM, IPW, DML, Causal Forests), conduct DoWhy refutation, translate findings into ROI estimates, and present to simulated stakeholders.

📝 Exercises & Assignments

Implement the full DoWhy refutation pipeline for your preferred treatment scenario
Compute the ROI of a hypothetical discount campaign using your causal estimates
Write a one-page executive summary translating CATE estimates into a targeting strategy
Compare results across all four methods to assess sensitivity of conclusions

Workshop Summary & Technology Stack

Lesson	Core Method	Key Formula	Retail Application
1	Potential Outcomes	ATE = E[Y(1) – Y(0)]	Framing causal questions
2	Data Engineering	RFM features, treatment construction	Dataset preparation
3	PSM & IPW	e(X) = P(W=1\|X); τ̂_IPW	Promotion impact
4	Double ML	θ̌₀ with cross-fitting	Pricing & marketing ROI
5	Causal Forests	τ̂(x) = (1/B) Σ τ̂ᵇ(x)	Customer segmentation
6	Integration	Refutation + ROI	Business deployment

Recommended Technology Stack

Python scikit-learn EconML DoWhy pandas numpy R grf MatchIt sandwich

Building Counterfactual Scenario MethodsDownload

Facilitator: Dharmateja Priyadarshi Uddandarao

Senior Statistician, Amazon

With over 9 years in the field of Data Science and extensive expertise in decision analytics and Python, Dharmateja Priyadarshi Uddandarao specializes in developing causal scientific models to evaluate the economic impact of high-value actions and products. His work focuses on designing causal experiments, forecasting financial scenarios. A prolific author, he contributes to various magazines like The Data Scientist, AI Journ, Analytics Vidhya, ACM, Silicon Valley Journal, and HackerNoon covering causal inference, econometrics, and applied statistical methodology. Dharmateja also actively contributes to the field through mentoring and committee roles in various professional organizations and conferences.

Event Throwbacks

No Caption