Code for "Language Models as Causal Effect Generators" (https://arxiv.org/pdf/2411.08019) implementing sequence-driven structural causal models (SD-SCMs). An SD-SCM allows for interventional and counterfactual data generation with a user-defined DAG and LLM-defined structural equations.
See confounder_collider.ipynb for example usage of the functions in sdscm.py to generate two SD-SCMs over the same set of variables (one with a confounder, another with a collider).
The data folder contains 2000 example datasets for benchmarking treatment effect estimation algorithms (1000 from GPT-2, 1000 from Llama-3-8b) based on the following SD-SCM.
This SD-SCM family is defined over 14 variables in order to explore the effect of a tumor’s PD-L1 expression levels on different breast cancer therapy plans.
The file bcancer_generation.ipynb demonstrates data generation using the breast cancer SD-SCM family. The notebook benchmark.ipynb replicates all effect estimation methods tested in the paper's example benchmark.
confounder_collider.ipynb: example usage of the functions insdscm.pyto generate two simple SD-SCMsbcancer_generation.ipynb: example generation of a breast cancer SD-SCM using the config filebreast_cancer_config.jsondata/cancer_example/: 2000 example datasets for benchmarking treatment effect estimation algorithms (1000 from GPT-2, 1000 from Llama-3-8b) based on the breast cancer SD-SCM familybenchmark.ipynb: replication of all effect estimation methods tested in the paper's example benchmarkbcancer_plots.ipynb: some plots of the generated breast cancer datasets
Requirements:
catenets, econml, matplotlib, networkx, numpy, pandas, plotnine, rpy2, scikit-learn, seaborn, torch, tqdm, transformers
@article{bynumcho2024sdscm,
title = {Language Models as Causal Effect Generators},
author = {Bynum, Lucius EJ and Cho, Kyunghyun},
year = {2024},
eprint = {2411.08019},
journal = {arXiv Preprint arXiv:2411.08019},
}
