A RAG pipeline for querying Building Information Modeling (BIM) data — specifically IFC files. It started as an experiment to prove that standard chunked-text RAG is the wrong approach for spatial data, and turned into a full agentic system with a graph database at its core.
The short version: ask it "What HVAC equipment is on Level 2?" and it will go find the exact answer using Neo4j, cross-check it, and fall back to deterministic IFC AST traversal if anything looks wrong.
IFC files encode buildings as a strict spatial hierarchy: Project → Site → Building → Floor → Element. When you chunk that file and embed it into a vector database the way standard RAG does, you destroy that hierarchy. The LLM ends up retrieving chunks that mention "Level 2" somewhere in the text, not chunks that are actually from Level 2. In practice this means the model confidently describes elements from the wrong floor.
I call this spatial blindness — the retriever finds semantically similar text, not spatially correct elements.
The fix isn't to write better prompts. The fix is to use the right data structure for the job. A floor containment relationship in a building is a graph edge, not something you should infer from cosine similarity.
The pipeline is a LangGraph state machine with five nodes:
flowchart LR
Q([Query]) --> E[Extract Floor\n& Intent]
E -->|floor known| G[(Neo4j\nGraph Query)]
E -->|no floor| H[BM25 +\nVector Hybrid]
G --> Gen[Generate]
H --> Gen
Gen --> Ev{Evaluate\nSpatial Match?}
Ev -->|pass| Done([Answer + GUIDs\nhighlighted in 3D])
Ev -->|fail| AST[IFC AST\nTraversal]
AST --> Gen
style G fill:#1e3a5f,stroke:#3b82f6,color:#93c5fd
style AST fill:#3b1e1e,stroke:#ef4444,color:#fca5a5
style Done fill:#1a3a2a,stroke:#22c55e,color:#86efac
Node 1 — extract_spatial_constraints The LLM extracts the floor name and query type (inventory vs. specific element). This is a fast, cheap call — it just classifies the intent and sets routing flags.
Node 2a — graph_query (primary path)
When a floor is known, we go to Neo4j first. The IFC hierarchy is stored as a proper graph (Storey -[:CONTAINS]-> Element), so retrieving everything on Level 2 is a single Cypher query, not a semantic similarity search. For MEP/equipment queries it narrows to equipment-type nodes only. For inventory queries it returns everything.
Node 2b — retrieve_hybrid (no-floor fallback) When there's no spatial constraint, Neo4j doesn't help. So we run BM25 (lexical) + ChromaDB (vector) retrieval and fuse the results with Reciprocal Rank Fusion. This handles cross-floor questions like "how many walls does the building have total?"
Node 3 — generate Standard generation node. The context it receives is already spatially filtered before it gets here — we're not asking the LLM to filter spatial data, just to synthesize a readable answer from it.
Node 4 — evaluate An LLM-as-judge call that checks one specific thing: does the answer respect the floor constraint? Not quality, not completeness — just spatial correctness. If the answer mentions elements from a different floor, it fails.
Node 5 — spatial_ast_retrieval (self-healing fallback)
If evaluation fails, we throw out the retrieval results entirely and use IfcOpenShell to parse the IFC file directly. This traverses the actual AST, finds the exact IfcBuildingStorey node, and extracts only the elements directly contained in it. This is deterministic — it gives the same answer every time regardless of how the embeddings are set up. Once AST has run, we never loop again.
The system tracks which retrieval path produced the final answer via retrieval_source in the state:
| Source | Meaning |
|---|---|
graph |
Neo4j answered it — fastest, most precise |
dense |
Hybrid BM25 + vector — used when no floor constraint |
ast |
IFC AST traversal — self-healing fallback, always correct |
graph_unavailable |
Neo4j was down, fell back to hybrid then AST |
- Orchestration: LangGraph + LangChain
- Graph database: Neo4j 5 (Community Edition) — stores IFC hierarchy as a property graph
- Vector search: ChromaDB (local persistent) — nomic-embed-text via Ollama for embeddings
- Lexical search: Rank-BM25 — serialized index, loaded once at startup
- LLM: Two-tier on Groq free tier —
llama-3.1-8b-instant(128K ctx) for fast classification,llama-3.3-70b-versatile(128K ctx) for generation. Cerebras supported as a zero-rate-limit alternative. - Semantic cache: Redis — SHA-256 keyed on normalized query text, 1 hour TTL
- API: FastAPI with Server-Sent Events — streams node completions to the UI in real time
- BIM parsing: IfcOpenShell — used for both indexing and AST fallback
- Frontend: Next.js 14 (App Router)
Prerequisites: Docker, Python 3.11+, Node 18+, Ollama running locally, a free Groq API key.
# 1. Clone and install
git clone https://github.com/yourusername/bim-graph.git
cd bim-graph
pip install -e ".[dev]"
# 2. Pull the embedding model
ollama pull nomic-embed-text
# 3. Start infrastructure
docker compose up redis neo4j -d
# 4. Set your API key
cp .env.example .env
# edit .env and set GROQ_API_KEY=your_key_here
# 5. Drop your IFC files into data/
# Free sample files: https://github.com/buildingSMART/Sample-Test-Files
# 6. Index the IFC files (builds ChromaDB + BM25 index + loads Neo4j)
# Run from src/ directory
python -m indexer.spatial_indexer # ChromaDB + BM25 in one pass
python -m graph_db.loader # IFC hierarchy → Neo4j
# 7. Start the API
uvicorn api.main:app --reload --port 8000
# 8. Start the UI
cd ui && npm install && npm run devOpen http://localhost:3000.
30 queries across 4 IFC models, scored against a deterministic IfcOpenShell oracle (GUID-level P/R/F1).
| Metric | Value |
|---|---|
| Avg F1 | 0.816 |
| Avg Precision | 0.839 |
| Avg Recall | 0.828 |
| Graph hit rate | 86.7% |
| Self-heal rate | 13.3% |
| Avg latency | 7.6s |
Per category:
| Category | N | Avg F1 |
|---|---|---|
| cross_floor | 3 | 1.000 |
| architectural | 7 | 0.844 |
| adversarial | 6 | 0.833 |
| mep | 8 | 0.812 |
| inventory | 6 | 0.677 |
The remaining failures are concentrated in the inventory category (large element lists where the LLM truncates) and one cross-floor adversarial case where the answer phrasing doesn't match the floor-name keyword scorer. The MEP self-heal cases are PCERT files that route through AST instead of graph and still score well.
The benchmark scores all queries against the IFC oracle (GUID-level precision/recall/F1). LLM responses are cached to SQLite so reruns cost zero API tokens.
# default (Groq, 5s inter-query sleep to respect rate limits)
python -m benchmark.run_benchmark
# Cerebras — no rate limits, ~2s/query, recommended
python -m benchmark.run_benchmark --provider cerebras
# resume an interrupted run (checkpoint is saved after each query)
python -m benchmark.run_benchmark --provider cerebras
# force a fresh run ignoring the checkpoint
python -m benchmark.run_benchmark --provider cerebras --resetResults are written to data/benchmark_results.json. The oracle is deterministic — it uses IfcOpenShell to get the exact element GUIDs on each floor, then checks how many appear in the generated answer. The LLM cache (data/llm_cache.db) persists across runs; delete it to force fresh API calls.
pytest tests/ -v30 tests, all mocked — no running Neo4j, Redis, or Ollama required. The unit tests cover routing logic, graph query dispatch, GUID extraction, cache roundtrip, and context formatting. The test suite runs in under 5 seconds.
Negation queries don't work. "Show me everything that is NOT on Level 1" — the floor extractor pulls "Level 1" as the constraint and retrieves Level 1 elements. This would need a separate query intent classifier.
Multi-floor single queries are partially handled. "What's on floors 1 and 2?" extracts one floor, queries that, and misses the other. The benchmark includes this as an adversarial test case.
The evaluate node is LLM-judged. It's good but not perfect — occasionally passes answers that have minor floor confusion or fails correct answers that don't explicitly mention the floor. A GUID-based evaluator (checking retrieved GUIDs against the oracle) would be more reliable but requires the generate node to always output GUIDs, which makes the answers less readable.
Embeddings are cold-started. The Ollama nomic-embed-text model loads on first request. If you query immediately after startup there's a 3-5 second delay before the first real response.
src/
agent/
graph.py — LangGraph state machine and routing logic
nodes.py — all five node functions
state.py — TypedDict state schema
api/
main.py — FastAPI endpoints + SSE streaming
benchmark/
ifc_oracle.py — ground truth GUID extraction from IFC AST
run_benchmark.py — runs all 25 queries, scores with P/R/F1
query_set.json — 25 test queries across 5 categories
cache/
redis_cache.py — get/set with SHA-256 key + fakeredis fallback
config.py — pydantic-settings, all config from .env
graph_db/
loader.py — IFC → Neo4j ingestion (MERGE idempotent)
queries.py — Cypher query library
indexer/
chroma_indexer.py — spatial chunking → ChromaDB
bm25_index.py — BM25 index builder
spatial_indexer.py — IFC → spatial chunk extraction
parser/
ifc_parser.py — raw IfcOpenShell traversal
ui/ — Next.js frontend
tests/
unit/ — 17 mocked unit tests
The biggest design decision I'd revisit is the evaluate node. Right now it's an LLM call on every query, which adds ~800ms of latency and occasionally produces wrong verdicts. A better design would skip LLM evaluation for graph-sourced results entirely — if Neo4j says an element is on Level 2, it is on Level 2, you don't need an LLM to second-guess it. The evaluator should only run when the retrieval source is dense or after AST (where spatial proof is embedded in the context, not guaranteed).
The other thing: the AST fallback is currently triggered by evaluation failure, not by confidence. A more robust approach would be to run a lightweight confidence check on the graph result count — if Neo4j returns 0 elements, skip generation and go straight to AST rather than generating a "no elements found" answer and then evaluating it.
This came out of work on digital twins for large construction projects, where querying "what assets are on this floor" from a chunked IFC file was producing consistently wrong answers. The spatial hierarchy in IFC is explicit and machine-readable — it seemed wrong to throw it away and reconstruct it from semantic similarity. BIM-Graph is an attempt to keep that hierarchy intact through the entire query pipeline.