Interpretable Regulatory Genomics

Interpretable Machine Learning on ENCODE Regulatory Data to Discover Hidden Switches in Non-Coding DNA

This project applies interpretable machine learning to ENCODE-annotated candidate cis-regulatory elements (cCREs) to identify sequence-level patterns — "regulatory switches" — that distinguish active regulatory regions from non-regulatory non-coding DNA. The work progresses through classical baselines, convolutional neural networks, pre-trained foundation model embeddings, and multi-omics extensibility.

Optimized to run on accessible Kaggle-level compute infrastructure (T4 GPU).

Note

Phases 1–3 are complete. Solidification work for cross-cell-type generalization and documentation deliverables is ongoing.

Documentation

All specifications are compiled into professional PDF documents inside docs/:

01-Project Charter — Problem statement, vision, goals, non-goals, and constraints.
02-Product Requirements Document — User personas, use cases, and acceptance criteria.
03-Technical Design Document — System architecture, data flow, and pipeline layers.
04-Research Design Document — Scientific hypotheses, evaluation matrix, and experiments.
05-Dataset Strategy — Data acquisition, labelling, leakage prevention, and QC checks.
06-Modeling Roadmap — Phase-wise modeling plan and baseline configurations.
07-Compute Feasibility Memo — Compute budget and resource constraints.
08-Experiment Tracking & MLOps — Notebook hygiene, config structures, and reproducibility.
09-Risk Register — Scientific, modeling, and timeline risks with mitigations.
10-Roadmap & Milestones — 12-month delivery roadmap and check-off criteria.
11-Glossary & Project Memory — Definitions, locked decisions, and scope limitations.
12-Contributor Onboarding Brief — Setup checklist and coding guidelines.

Discussions

Discussion	Topic	Phase
#1	Classical Interpretable Baselines & SHAP Explainability	Phase 1
#16	Phase 1 Solidification & Baseline Verification	Phase 1
#4	Deep Learning Baselines, Convolutional Filters & Attribution Maps	Phase 2
#18	Multi-class Deep Learning & Cross-Cell-Type Generalization	Phase 2
#5	Pre-trained Foundation Models & Embeddings	Phase 3
#11	Interpretable CNN Discovers Major Regulatory Switches in Non-Coding DNA	Cross-Phase
#12	Extensibility into Multi-Omics or Cell-Type-Specific Prediction	Phase 4

Notebooks

Notebook	Phase	Description
phase1_classical_baseline.ipynb	Phase 1	Logistic Regression, Random Forest, and XGBoost on k-mer features with SHAP explainability
phase1_classical_baseline_ablation.ipynb	Phase 1	Feature ablation (E2), k-mer resolution sweep (E3), and negative set sensitivity (E7)
phase2_deep_learning.ipynb	Phase 2	CNN training, convolutional filter analysis, saliency maps, and integrated gradients
phase2_deep_learning_multiclass_generalization.ipynb	Phase 2	3-class element-type classification and K562 to GM12878 zero-shot evaluation
phase3_pretrained_embeddings.ipynb	Phase 3	Nucleotide Transformer (500M) embeddings, UMAP projections, and embedding-based classification

All notebooks are designed to run on Kaggle with T4 GPU acceleration. They can also be run locally with the appropriate data files (see src/data/download.py).

Project Phases

Phase	Description	Best Model	Test AUROC	Status
Phase 1	Classical Interpretable Baselines	XGBoost (k=4 k-mers)	0.8830	Complete
Phase 2	Deep Learning (CNNs)	AttentionCNN (one-hot)	0.8604	Complete
Phase 2+	Multi-class & Cross-Cell-Type	AttentionCNN (3-class)	81.25% acc	Complete
Phase 3	Pre-trained Foundation Models	Nucleotide Transformer (500M)	0.9176	Complete

Quickstart

git clone https://github.com/PxA-Labs/interpretable-regulatory-genomics.git
cd interpretable-regulatory-genomics
pip install -r requirements.txt
pytest tests/ -v

License

MIT License — see LICENSE.

Contributing

See CONTRIBUTING.md for guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.github		.github
docs		docs
notebook		notebook
scratch		scratch
src		src
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Interpretable Regulatory Genomics

Documentation

Discussions

Notebooks

Project Phases

Quickstart

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Interpretable Regulatory Genomics

Documentation

Discussions

Notebooks

Project Phases

Quickstart

License

Contributing

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages