Interpretable Machine Learning on ENCODE Regulatory Data to Discover Hidden Switches in Non-Coding DNA
This project applies interpretable machine learning to ENCODE-annotated candidate cis-regulatory elements (cCREs) to identify sequence-level patterns — "regulatory switches" — that distinguish active regulatory regions from non-regulatory non-coding DNA. The work progresses through classical baselines, convolutional neural networks, pre-trained foundation model embeddings, and multi-omics extensibility.
Optimized to run on accessible Kaggle-level compute infrastructure (T4 GPU).
Note
Phases 1–3 are complete. Solidification work for cross-cell-type generalization and documentation deliverables is ongoing.
All specifications are compiled into professional PDF documents inside docs/:
- 01-Project Charter — Problem statement, vision, goals, non-goals, and constraints.
- 02-Product Requirements Document — User personas, use cases, and acceptance criteria.
- 03-Technical Design Document — System architecture, data flow, and pipeline layers.
- 04-Research Design Document — Scientific hypotheses, evaluation matrix, and experiments.
- 05-Dataset Strategy — Data acquisition, labelling, leakage prevention, and QC checks.
- 06-Modeling Roadmap — Phase-wise modeling plan and baseline configurations.
- 07-Compute Feasibility Memo — Compute budget and resource constraints.
- 08-Experiment Tracking & MLOps — Notebook hygiene, config structures, and reproducibility.
- 09-Risk Register — Scientific, modeling, and timeline risks with mitigations.
- 10-Roadmap & Milestones — 12-month delivery roadmap and check-off criteria.
- 11-Glossary & Project Memory — Definitions, locked decisions, and scope limitations.
- 12-Contributor Onboarding Brief — Setup checklist and coding guidelines.
| Discussion | Topic | Phase |
|---|---|---|
| #1 | Classical Interpretable Baselines & SHAP Explainability | Phase 1 |
| #16 | Phase 1 Solidification & Baseline Verification | Phase 1 |
| #4 | Deep Learning Baselines, Convolutional Filters & Attribution Maps | Phase 2 |
| #18 | Multi-class Deep Learning & Cross-Cell-Type Generalization | Phase 2 |
| #5 | Pre-trained Foundation Models & Embeddings | Phase 3 |
| #11 | Interpretable CNN Discovers Major Regulatory Switches in Non-Coding DNA | Cross-Phase |
| #12 | Extensibility into Multi-Omics or Cell-Type-Specific Prediction | Phase 4 |
| Notebook | Phase | Description |
|---|---|---|
| phase1_classical_baseline.ipynb | Phase 1 | Logistic Regression, Random Forest, and XGBoost on k-mer features with SHAP explainability |
| phase1_classical_baseline_ablation.ipynb | Phase 1 | Feature ablation (E2), k-mer resolution sweep (E3), and negative set sensitivity (E7) |
| phase2_deep_learning.ipynb | Phase 2 | CNN training, convolutional filter analysis, saliency maps, and integrated gradients |
| phase2_deep_learning_multiclass_generalization.ipynb | Phase 2 | 3-class element-type classification and K562 to GM12878 zero-shot evaluation |
| phase3_pretrained_embeddings.ipynb | Phase 3 | Nucleotide Transformer (500M) embeddings, UMAP projections, and embedding-based classification |
All notebooks are designed to run on Kaggle with T4 GPU acceleration. They can also be run locally with the appropriate data files (see src/data/download.py).
| Phase | Description | Best Model | Test AUROC | Status |
|---|---|---|---|---|
| Phase 1 | Classical Interpretable Baselines | XGBoost (k=4 k-mers) | 0.8830 | Complete |
| Phase 2 | Deep Learning (CNNs) | AttentionCNN (one-hot) | 0.8604 | Complete |
| Phase 2+ | Multi-class & Cross-Cell-Type | AttentionCNN (3-class) | 81.25% acc | Complete |
| Phase 3 | Pre-trained Foundation Models | Nucleotide Transformer (500M) | 0.9176 | Complete |
git clone https://github.com/PxA-Labs/interpretable-regulatory-genomics.git
cd interpretable-regulatory-genomics
pip install -r requirements.txt
pytest tests/ -vMIT License — see LICENSE.
See CONTRIBUTING.md for guidelines.