Skip to content

PxA-Labs/interpretable-regulatory-genomics

Interpretable Regulatory Genomics

Interpretable Machine Learning on ENCODE Regulatory Data to Discover Hidden Switches in Non-Coding DNA

This project applies interpretable machine learning to ENCODE-annotated candidate cis-regulatory elements (cCREs) to identify sequence-level patterns — "regulatory switches" — that distinguish active regulatory regions from non-regulatory non-coding DNA. The work progresses through classical baselines, convolutional neural networks, pre-trained foundation model embeddings, and multi-omics extensibility.

Optimized to run on accessible Kaggle-level compute infrastructure (T4 GPU).

Note

Phases 1–3 are complete. Solidification work for cross-cell-type generalization and documentation deliverables is ongoing.


Documentation

All specifications are compiled into professional PDF documents inside docs/:


Discussions

Discussion Topic Phase
#1 Classical Interpretable Baselines & SHAP Explainability Phase 1
#16 Phase 1 Solidification & Baseline Verification Phase 1
#4 Deep Learning Baselines, Convolutional Filters & Attribution Maps Phase 2
#18 Multi-class Deep Learning & Cross-Cell-Type Generalization Phase 2
#5 Pre-trained Foundation Models & Embeddings Phase 3
#11 Interpretable CNN Discovers Major Regulatory Switches in Non-Coding DNA Cross-Phase
#12 Extensibility into Multi-Omics or Cell-Type-Specific Prediction Phase 4

Notebooks

Notebook Phase Description
phase1_classical_baseline.ipynb Phase 1 Logistic Regression, Random Forest, and XGBoost on k-mer features with SHAP explainability
phase1_classical_baseline_ablation.ipynb Phase 1 Feature ablation (E2), k-mer resolution sweep (E3), and negative set sensitivity (E7)
phase2_deep_learning.ipynb Phase 2 CNN training, convolutional filter analysis, saliency maps, and integrated gradients
phase2_deep_learning_multiclass_generalization.ipynb Phase 2 3-class element-type classification and K562 to GM12878 zero-shot evaluation
phase3_pretrained_embeddings.ipynb Phase 3 Nucleotide Transformer (500M) embeddings, UMAP projections, and embedding-based classification

All notebooks are designed to run on Kaggle with T4 GPU acceleration. They can also be run locally with the appropriate data files (see src/data/download.py).


Project Phases

Phase Description Best Model Test AUROC Status
Phase 1 Classical Interpretable Baselines XGBoost (k=4 k-mers) 0.8830 Complete
Phase 2 Deep Learning (CNNs) AttentionCNN (one-hot) 0.8604 Complete
Phase 2+ Multi-class & Cross-Cell-Type AttentionCNN (3-class) 81.25% acc Complete
Phase 3 Pre-trained Foundation Models Nucleotide Transformer (500M) 0.9176 Complete

Quickstart

git clone https://github.com/PxA-Labs/interpretable-regulatory-genomics.git
cd interpretable-regulatory-genomics
pip install -r requirements.txt
pytest tests/ -v

License

MIT License — see LICENSE.

Contributing

See CONTRIBUTING.md for guidelines.

About

Interpretable machine learning system for discovering hidden regulatory switches in non-coding DNA using ENCODE genomic data. Optimized for resource-constrained, reproducible research.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors