ragdx is a Python workbench for RAG evaluation, diagnosis, optimization, and monitoring.
It sits above an existing RAG application as a quality and optimization control plane rather than trying to replace your runtime framework, retriever stack, or orchestration layer.
- normalizes evaluation signals from external tools into one
EvaluationResult - diagnoses likely failure sources using rules, an explicit causal graph, and optional LLM reasoning
- generates staged optimization plans across corpus, retrieval, generation, orchestration, and joint layers
- executes optimization sessions in
simulate,prepare_only, orexecutemode - persists runs, sessions, traces, feedback, and learned causal priors in a local file store
- provides both a CLI and a Streamlit dashboard for inspection and reporting
ragdx experiment runs the complete optimization pipeline against any
supported corpus and writes a JSON bundle that the bundled Streamlit
dashboards render directly:
corpus -> resolve / synthesise questions -> Bayesian RAG-config search
-> DSPy before / after at the winner config -> composite scoring
-> JSON bundle for the dashboard
Note: the Bayesian search is ragdx's own implementation in
ragdx.optim.bayes_search (sklearn GP + Expected Improvement). It is
inspired by AutoRAG's design but does NOT shell out to AutoRAG itself.
The :class:AutoRAGAdapter is a separate, lighter-weight surface that
only renders an AutoRAG-style YAML config you can hand to AutoRAG
externally.
Required arguments:
| flag | values |
|---|---|
corpus (positional) |
HuggingFace dataset name (org/dataset), .pdf path, or .jsonl corpus path |
--has-gt / --no-gt |
whether the corpus carries ground-truth answers |
--mode |
with_gt, no_gt, both, or auto. with_gt / both require --has-gt |
--questions PATH |
optional JSONL with {question, ground_truth, contexts?} — required when --has-gt and corpus is a PDF or JSONL |
--n-questions, --bo-trials, --bo-init |
search budget and dataset size |
--output-dir |
where result.json is written; defaults to .ragdx_experiment |
--api-key, --api-base, --model |
LLM routing; falls back to ZHIPU_API_KEY / OPENAI_API_KEY env vars |
--save-run |
also persist the run in the local RunStore so it appears in ragdx runs |
export ZHIPU_API_KEY=<your-key>
ragdx experiment explodinggradients/amnesty_qa \
--has-gt --mode both \
--n-questions 5 --bo-trials 8 \
--output-dir .ragdx_optimize_demo
# Render the bundle in the generic dashboard
ragdx experiment-dashboard --bundle .ragdx_optimize_demo/result.jsonragdx experiment <path/to/your.pdf> \
--no-gt \
--n-questions 5 --bo-trials 8 \
--output-dir .ragdx_pdf_no_gt_demo
ragdx experiment-dashboard --bundle .ragdx_pdf_no_gt_demo/result.jsonThe CLI is a thin wrapper around ragdx.run_experiment. The same call
in Python:
from ragdx import run_experiment
result = run_experiment(
corpus="docs/asmpt-esg-report.pdf",
has_gt=False,
n_questions=5,
n_bo_trials=8,
api_key="<your-key>",
output_dir=".ragdx_pdf_no_gt_demo",
)
print(result.bundle["bayes_search"]["no_gt"]["best_params"])
print(result.output_path) # .ragdx_pdf_no_gt_demo/result.jsonThe bundle uses a stable schema_version: 1 layout — meta, questions,
data_diagnostics, objectives, bayes_search, dspy_a_b, extras,
each mode-keyed dict — so the dashboard never branches on single-mode
vs. side-by-side runs. Bundles produced by the older demo scripts are
auto-upgraded by ragdx.experiments.migrate_legacy_bundle.
run_experiment returns an ExperimentResult (.config, .bundle,
.output_path, .save(path?)). For finer control, the underlying
building blocks are independently importable: ragdx.loaders.load_pdf_chunks,
ragdx.datasets.synthesize_questions, ragdx.optim.bayes_search.BayesianSearch,
ragdx.optim.objectives.default_objective, ragdx.optim.dspy_adapter.DSPyAdapter.
The standalone scripts under examples/demo_optimize_gt_modes.py and
examples/demo_pdf_no_gt.py still exist as readable step-by-step
walkthroughs of the same flow.
ragdx is designed to work with, not replace, tools you may already use:
RagasandRAGCheckerfor evaluation inputsDSPyandAutoRAGfor optimization-oriented adaptersLangChainandLlamaIndexfor runtime execution
flowchart TD
A[Evaluation JSON / Tool Outputs / Traces / Feedback] --> B[Normalization]
B --> C[Diagnosis]
C --> D[Optimization Plan]
D --> E[Optimization Session]
E --> F[Runs / Sessions / Feedback / Causal Priors]
F --> G[CLI / Dashboard / Reports]
Typical workflow:
- prepare a normalized evaluation JSON or normalize external evaluator output
- run diagnosis
- inspect the generated optimization plan
- simulate or execute optimization trials
- save runs and sessions
- review results in the CLI or dashboard
- attach feedback and repeat
Python requirement:
- Python
>=3.10
Base install:
pip install -e .Optional extras:
pip install -e ".[openai]"
pip install -e ".[langchain,llamaindex,bo]"
pip install -e ".[all]"Available extras:
openai,anthropic,azure,ollama— LLM providersragas,ragchecker— evaluation backendsdspy,autorag— optimization adapterslangchain,llamaindex— runtime framework adaptersexperiment— everythingragdx experiment/ragdx.run_experimentneed to drive the full pipeline (PDF loader, HF datasets, FAISS, sentence-transformers, langchain HF/OpenAI wrappers, ragas, dspy)bo— heavy Bayesian optimization (Ax / BoTorch / Torch)dev— pytest, ruff, mypy, coverage, buildall— every optional runtime backend (nodev)
ragdx reads its configuration from environment variables. Inspect the resolved configuration at any time with:
ragdx show-configKey environment variables:
| Variable | Default | Description |
|---|---|---|
RAGDX_ROOT |
.ragdx |
Persistence root directory. |
RAGDX_LLM_PROVIDER |
openai |
One of openai, anthropic, azure, ollama, or a custom provider you register. |
RAGDX_LLM_MODEL |
provider default | Override the model identifier. |
RAGDX_LLM_TIMEOUT |
60 |
Per-request timeout (seconds). |
OPENAI_API_KEY / ANTHROPIC_API_KEY / AZURE_OPENAI_API_KEY |
unset | Provider credentials. |
AZURE_OPENAI_ENDPOINT / AZURE_OPENAI_API_VERSION |
unset / 2024-06-01 |
Azure deployment endpoint and API version. |
OLLAMA_HOST |
http://127.0.0.1:11434 |
Ollama HTTP server. |
RAGDX_STRICT_EXECUTE |
true |
Fail loudly in execute mode when no runner is configured. Set to 0 to allow the simulator fallback. |
RAGDX_RUNNER_TIMEOUT_SEC |
unset | Per-trial subprocess timeout. |
RAGDX_BO_BACKEND |
internal |
internal (sklearn GP) or ax (heavy BO backend). |
Any callable matching Callable[[str], str] works as an LLM backend.
Register your own:
from ragdx.llm import register_provider
from ragdx.llm.base import LLMProvider
class MyProvider(LLMProvider):
name = "myprovider"
default_model = "my-model-v1"
def complete(self, prompt: str) -> str:
return my_backend.generate(prompt)
def factory(settings):
return MyProvider(model=settings.model or MyProvider.default_model)
register_provider("myprovider", factory)
# Now `RAGDX_LLM_PROVIDER=myprovider` resolves to your backend.Diagnose a normalized evaluation file:
ragdx diagnose examples/demo_evaluation.jsonGenerate a human-readable plan:
ragdx plan examples/demo_evaluation.json --human-readableRun a simulated optimization session:
ragdx optimize examples/demo_evaluation.json --strategy bayesian --budget 8 --mode simulateInspect saved sessions:
ragdx sessions
ragdx monitor-session <SESSION_ID>
ragdx dashboardSave a run and export a markdown report:
ragdx save examples/demo_evaluation.json --name baseline-demo
ragdx runs
ragdx export-report <RUN_ID> run_report.mdragdx works from a normalized EvaluationResult structure. Minimal input looks like this:
{
"retrieval": {
"context_precision": 0.68,
"context_recall": 0.72
},
"generation": {
"faithfulness": 0.81,
"response_relevancy": 0.79
},
"e2e": {
"answer_correctness": 0.74,
"citation_accuracy": 0.77
},
"metadata": {
"dataset": "demo"
}
}The schema also supports richer evidence such as:
- traces and spans
- evaluator-specific scores
- evaluator calibration data
- production or reviewer feedback events
- raw tool outputs
Diagnosis combines three layers:
- rule-based analysis over metrics, traces, and feedback
- an explicit weighted causal graph with priors and posteriors
- optional LLM refinement or summary
Planning is explicitly:
- stage-aware
- baseline-relative
- multi-objective
- constraint-aware
Plans can target the following stages:
corpusretrievalgenerationorchestrationjoint
The optimizer distinguishes three concepts that should not be conflated:
objective_weights: trade-off coefficientstarget_thresholdsandtarget_specs: target regions relative to baselineconstraint_bounds: feasibility limits such as latency, cost, or hallucination ceilings
Main commands:
ragdx diagnoseragdx planragdx optimizeragdx experiment— end-to-end run (corpus → AutoRAG BO → DSPy A/B → JSON bundle)ragdx saveragdx compareragdx runsragdx sessionsragdx monitor-sessionragdx normalize-toolsragdx export-reportragdx attach-feedbackragdx feedback-summaryragdx explain-planragdx show-runner-templatesragdx dashboard
Dashboard entry points:
ragdx dashboardragdx-dashboard
simulate: validate planning and session orchestration without external runnersprepare_only: emit configs and session artifacts without executing trialsexecute: launch external trial runners and ingest their output
Recommended starting point:
- use
simulatefirst - move to
prepare_onlyonce the plan shape looks correct - use
executeonly after runner commands and runtime metadata are wired up
LLM features require the openai extra and an API key.
pip install -e ".[openai]"export OPENAI_API_KEY=your_key
export RAGDX_LLM_MODEL=gpt-4o-mini # or any other OpenAI-compatible model idExamples:
ragdx diagnose examples/demo_evaluation.json --use-llm
ragdx diagnose examples/demo_evaluation.json --use-both
ragdx plan examples/demo_evaluation.json --use-llm-planner --human-readableFor execute mode, configure runner commands through environment variables.
Supported runner variables:
RAGDX_DSPY_RUNNER_CMDRAGDX_AUTORAG_RUNNER_CMDRAGDX_LANGCHAIN_RUNNER_CMDRAGDX_LLAMAINDEX_RUNNER_CMD
Runner templates can use:
{config}{output}{workdir}{trial_id}{session_id}{tool}
Example for LangChain:
export RAGDX_LANGCHAIN_RUNNER_CMD='python examples/run_langchain_trial.py --config {config} --output {output}'
ragdx optimize examples/demo_evaluation_langchain.json --strategy bayesian --budget 6 --mode executeExample evaluation metadata for a runtime-backed run:
{
"metadata": {
"runtime_framework": "langchain",
"dataset_path": "examples/demo_dataset.jsonl",
"pipeline_module": "examples.langchain_pipeline:create_pipeline"
}
}By default, ragdx stores state in local hidden folders:
.ragdx/runs.ragdx/optimization/sessions.ragdx/feedback.ragdx/causal/priors.json
This makes local experimentation simple, but it is not a shared metadata service.
from ragdx.core.diagnosis import RAGDiagnosisEngine
from ragdx.optim.planner import OptimizationPlanner
from ragdx.schemas.models import EvaluationResult
result = EvaluationResult(
retrieval={"context_recall": 0.72, "context_precision": 0.68},
generation={"faithfulness": 0.81, "response_relevancy": 0.79},
e2e={"answer_correctness": 0.74, "citation_accuracy": 0.77},
)
report = RAGDiagnosisEngine().diagnose(result)
plan = OptimizationPlanner().build_plan(report, result=result, strategy="bayesian", budget=8)
print(report.summary)
print(plan.objective_metric)The detailed documentation lives under docs:
- Overview
- Architecture
- Data Models
- Workflows
- CLI and Dashboard
- Configuration
- Diagnosis and Optimization
- Runtime Integrations
- Extension Guide
- Examples
- Limitations and Roadmap
Suggested reading order for new users:
src/ragdx/core evaluation, normalization, comparison, diagnosis
src/ragdx/engines rule-based and LLM diagnosis, evaluator adapters
src/ragdx/optim planner, executor, BO adapter, runtime adapters
src/ragdx/schemas Pydantic models
src/ragdx/storage runs, sessions, feedback, reports
src/ragdx/ui Streamlit dashboard
src/ragdx/utils reporting and plan explanation helpers
examples/ example evaluations, pipelines, and trial runners
tests/ test suite
docs/ detailed markdown documentation
Current boundaries to be aware of:
- the default store is local and file-based
executemode still depends on your runtime environment and runner scripts- diagnosis quality depends on evaluator quality and audit quality
- LLM reasoning is a structured aid, not ground truth
- heavy Bayesian optimization backends are optional and depend on extra packages
PYTHONPATH=src pytest -q