Project 1: Benchmarking Estimators for Conditional Average Treatment Effect (CATE)

Motivation

Understanding when different CATE estimators succeed or fail is core to both scientific discovery and policy learning. This project builds intuition about the assumptions used in estimators (ignorability, overlap, smoothness, complexity of the outcome and propensity models) and how these interact with data-generating processes (DGPs).

Task Description

Build a rigorous benchmark to compare the state-of-the-art CATE methods, i.e., doubly robust learners (DR-learner), double machine learning (DML), Meta learners (X-/T-/S-/DA-learners), Bayesian Additive Regression Trees (BART), deep-learning approaches (DragonNet, TarNet, RANet), as well as amortized methods like CausalPFN and Do-PFN, on families of DGPs explicitly constructed to favor each algorithm. Since there is no one-size-fits-all approach, analyze what conditions or assumptions lead to an estimator having better results.

A Few Ideas on Synthetic Benchmarks

We specify a few ideas one what aspects of DGPs to take into account when generating the synthetic benchmarks to create a diverse set. Feel free to use your own ideas.

  1. High-dimensional linear outcome or propensity functions with approximate sparsity
  2. Low overlap scenarios — either many treated or many control samples
  3. Non-smooth piecewise CATE functions
  4. Strong nonlinear outcomes/propensities with covariate shift between train and test data
  5. Simple propensity function with complex outcome functions
  6. Complex propensity functions with simple outcome functions
  7. Having different function complexities for $Y_0$ and $Y_1$
  8. Additive noise functions vs. heteroskedastic noise