Production-ready Retrieval-Augmented Generation (RAG) system for figure skating knowledge.
WarmGPT converts unstructured skating discussions (Reddit, GoldenSkate, Wiki) into structured knowledge and delivers grounded answers using a retrieval-first pipeline.
Stack - FAISS (vector search) - SentenceTransformers (embeddings) - Intent routing (rule-based) - Modular LLM layer - FastAPI backend - Railway deployment
Frontend (Next.js / Vercel) consumes this API via HTTP.
User → Intent Router → FAISS Retrieval → Prompt Builder → LLM → Response
Design principles: - Retrieval-first (no blind LLM calls) - Deterministic search layer - Modular model abstraction - Clear separation of frontend and backend
- Embeddings stored in FAISS
- Top-k similarity search
- Metadata stored separately
- Greeting
- Social message
- Knowledge lookup Reduces unnecessary LLM calls.
- Called only after retrieval
- Provider abstracted and replaceable
The best bug-free way is to run from our website: Click here
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
uvicorn backend.server:app --reloadTest:
curl -X POST http://localhost:8000/chat -H "Content-Type: application/json" -d '{"message":"hi","history":[]}'- Push to GitHub
- Connect repo to Railway
- Set environment variable: HF_API_TOKEN=your_token_here
- Start command: uvicorn backend.server:app --host 0.0.0.0 --port 8080
- Designed full RAG architecture
- Built knowledge distillation pipeline
- Implemented FAISS retrieval + intent routing
- Structured prompt builder + LLM abstraction
- Deployed production backend (Railway)
- Functions:
detect_transform_mode(message),classify_query_intent(...) - Purpose: allow user to control how the answer is generated, not just what is asked
Behavior:
- frontend injects
Mode: simplify / deeper / drills / diagnose - backend detects mode and routes execution accordingly
- distinguishes:
- transformation tasks (simplify, deeper)
- normal RAG queries (drills, diagnose)
Execution:
- runs before
answer_question(...)in/chat - branches into:
- simplify → rewrite path (no RAG)
- deeper → hybrid path (RAG + merge)
- others → unchanged RAG pipeline, just a new regular query
If triggered:
- simplify:
- rewrites last assistant answer only
- deeper:
- combines last answer + new retrieval, then merges
- drills / diagnose:
- mapped to
how_to/diagnosisintent
- mapped to
Effect:
- introduces a lightweight “agent-like” routing layer
- improves consistency (no unnecessary re-retrieval)
- enables controlled depth expansion without breaking RAG
- Function:
needs_clarification(query, history) - Purpose: decide whether the system has enough information to answer
Behavior:
- detects vague or underspecified queries
- checks for missing skill level in recommendation questions
- uses history to avoid unnecessary clarification
Execution:
- runs after intent classification
- only triggers when
intent == "default"
If triggered:
- stops pipeline and asks a follow-up question
- Function:
build_skater_state(query) - Purpose: extract structured user information
Includes:
- skill level (beginner / intermediate / advanced)
- jump signals (axel, double, etc.)
- body info (height, weight → categories)
- experience type (adult vs standard)
- goal (e.g., equipment recommendation)
Usage:
- used later in prompt to condition answers
- provides implicit context when user input is incomplete
- Function:
classify_query_intent(query, history) - Purpose: decide how to process the query
Current intents:
how_to→ actionable improvementcomparison→ compare optionsdiagnosis→ infer causes from symptomsexperience_lookup→ general explanationdefault→ fallback
Role:
- executed before clarification
- controls downstream behavior and response type
- Function:
build_answer_plan(query, intent, state, history, clarify) - Purpose: determine how the answer should be constructed
Behavior:
- assigns:
- mode (coaching / diagnosis / explanation / comparison / standard / clarification)
- depth (short / medium / detailed)
- structure (e.g., causes → what to try → drills)
- context usage flags (use_context, avoid_repetition)
Execution:
- runs after intent and clarification
- output is passed into
answer_question(...)
Role:
- controls response style independently of retrieval
- ensures different query types produce different answer structures
- Function:
choose_k(query, intent, state, history) - Purpose: adjust retrieval depth based on context
Behavior:
- increases k for short or vague queries
- decreases k when strong skill signals exist
- decreases k when prior context is present
- varies k by intent
Execution:
- runs before retrieval
- determines number of documents retrieved
- Location:
answer_question()
Behavior:
- if initial retrieval is weak:
- generate fallback query
- perform second retrieval
- if still weak:
- trigger clarification
Flow:
- retrieve → weak → retry → (success → answer) / (fail → clarify)
- Location:
/chatendpoint inserver.py
Behavior:
- after generating the first answer:
- evaluate answer quality (length, vagueness, retrieval strength)
- if answer is weak:
- trigger a second pass through full RAG pipeline
- if second pass improves retrieval:
- replace original answer
- otherwise:
- keep original answer
Flow:
- retrieve → answer → judge →
(strong → return) / (weak → retry → compare → return best)
- Location:
/chatinserver.py+ helpers inagent.py
Behavior:
- after final answer (post self-repair):
- decide if a follow-up is useful based on:
- intent (how_to / diagnosis / comparison)
- missing user context (e.g., skill level)
- weak or repaired retrieval
- decide if a follow-up is useful based on:
- if triggered:
- build a structured prompt (query + answer + intent + state + recent history)
- call LLM to generate one short, specific follow-up question
- append to the answer
- otherwise:
- return answer as-is
Constraints:
- at most one question
- short (≤ ~20 words)
- context-aware and skating-specific
- no generic phrasing
- does not modify original answer
Flow:
- retrieve → answer → repair →
followup_decision → (no → return) / (yes → LLM_generate → append → return)
-
Purpose: decide whether the system has enough information to generate a useful answer before running full RAG.
-
Main functions:
semantic_clarification_check()- thin semantic LLM controller
needs_clarification()- backend clarification gatekeeper
-
v2 improvements:
- moved from keyword-based clarification
"recommend" in query
- to semantic clarification understanding
thinking about upgrading my boots
- moved from keyword-based clarification
-
Trigger examples:
- missing skating level for equipment recommendations
- ambiguous short domain terms (
loop) - vague references (
which one?)
-
Usually does NOT clarify:
- technique/how-to questions
- factual questions
- short replies to previous clarification
-
Convergence logic:
- uses:
MAX_CLARIFICATIONSforce_answer
- prevents endless clarification loops
- forces answering after enough clarification
- uses:
-
Conversation-state awareness:
- detects when user is replying to a clarification question
- activates:
force_answer = True
-
Philosophy:
- clarify only when answer would become:
- misleading
- useless
- badly personalized
- prefer useful answers over excessive questioning
- clarify only when answer would become:
-
Purpose: build a lightweight answer-generation strategy before LLM generation.
-
Main functions:
build_answer_plan()
-
Responsibilities:
- determine:
- response structure
- coaching depth
- diagnosis framing
- practical-action emphasis
- tone softening
- determine:
-
v2 improvements:
- moved from static prompt behavior
- to dynamic answer planning
- answer structure now adapts to:
- intent
- secondary intents
- user state
- topic
-
Example plan signals:
"what_to_try""likely_causes""equipment_note""confidence_softening"
-
Philosophy:
- generation should not use:
- one-prompt-fits-all
- build an answer strategy first
- then let the LLM execute the plan
- generation should not use:
-
Purpose: build the complete immutable retrieval profile before retrieval.
-
Main functions:
build_retrieval_profile()maybe_merge_history()
-
Responsibilities:
- semantic query construction
- terminology normalization
- history-aware merging
- intent-aware enrichment
- topic-aware enrichment
-
v2 improvements:
- semantic query mutation now happens ONLY here
- removed:
- hidden fallback rewriting
- repair-time query rewriting
- scattered intent injection
-
Philosophy:
- retrieval semantics should freeze before retrieval
- avoid hidden query mutation later in pipeline
-
Purpose: evaluate retrieval quality before generation and recover weak retrieval mechanically.
-
Main functions:
evaluate_retrieval_quality()apply_fallback_if_needed()
-
Retrieval checks:
- top score
- score ambiguity
- document count
- retrieval confidence
-
v2 improvements:
- fallback separated completely from repair
- fallback now modifies:
- retrieval strategy only
- fallback does NOT modify:
- semantic query meaning
-
Example fallback actions:
- increase retrieval pool
- use alternate prebuilt query variants
- broaden retrieval depth
-
Philosophy:
- weak retrieval is a retrieval problem
- recover retrieval before generation
-
Purpose: evaluate generated answers and repair weak answers after generation.
-
Main functions:
evaluate_answer_quality()repair_answer()
-
Evaluation checks:
- vague language
- weak structure
- missing practical guidance
- unsupported reasoning
- missing semantic focus coverage from the original query
-
Focus term system:
- semantic focus terms are extracted upstream by the intent controller
- repair evaluation checks whether generated answers remain anchored to important query concepts
- examples:
- adult
- camel spin
- blade change
- outside edge
- Edea Chorus
- prevents:
- generic but fluent answers
- semantic drift
- incomplete comparisons
-
v2 improvements:
- removed:
- repair-time retrieval reruns
- repair-time query rewriting
- added:
- semantic focus-term coverage checks
- repair now modifies:
- generation behavior only
- removed:
-
Philosophy:
- repair should improve answer quality
- repair should preserve semantic alignment with the original user request
- repair should not secretly rerun retrieval
-
Purpose: build a stable retrieval query before retrieval execution.
-
Main functions:
build_retrieval_profile()maybe_merge_history()is_self_contained_query()is_incomplete_continuation()
-
Responsibilities:
- query normalization
- intent-aware enrichment
- topic-aware enrichment
- conversational context reconstruction
-
v2 improvements:
- moved from blind history concatenation
- to semantic merge gating
- semantic query mutation now happens ONLY here
-
Merge philosophy:
- merge only when current query depends on previous context
- avoid unrelated conversational contamination
-
Usually merges:
- short incomplete replies
Jacksononly on the right footthat edge
- context-dependent continuations
- short incomplete replies
-
Usually does NOT merge:
- self-contained descriptive questions
- clear topic pivots
- fully specified equipment/technique questions
-
Previous issue:
- semantically related but unrelated skating topics could merge
loop turnlutz scratching
- caused retrieval noise and ambiguous retrieval scores
- semantically related but unrelated skating topics could merge
-
Current approach:
- first checks:
- semantic completeness
- then checks:
- conversational dependency
- embeddings now act only as:
- soft merge fallback
- first checks:
-
Philosophy:
- retrieval should reconstruct conversational intent
- not blindly concatenate recent messages
- prioritize retrieval precision over excessive memory inheritance
-
Purpose: decide whether the conversation should continue after the main answer, and generate a targeted follow-up only when useful.
-
Main functions:
build_followup_decision()- backend continuation-state controller
build_followup_prompt()- strategy-aware follow-up generation prompt
-
v3 improvements:
- moved from:
- simple intent-based follow-up triggering
- to:
- state-aware continuation reasoning
- moved from:
-
Uses system state from:
- retrieval evaluation
- fallback recovery
- repair results
- clarification state
- retrieval merge state
- confidence estimation
-
Follow-up reasons:
retrieval_ambiguityrepair_recoverycontext_continuationmissing_user_stateprogression_coaching
-
Trigger examples:
- ambiguous skating symptoms
- repaired but still uncertain answers
- context-dependent continuation queries
- missing useful skating state
- medium-confidence diagnosis/coaching answers
-
Usually does NOT follow up:
- high-confidence resolved answers
- clarification-active conversations
- short factual answers
- fully resolved retrieval situations
-
Clarification interaction:
- clarification and follow-up are mutually exclusive
- prevents stacked questioning behavior
- avoids conversational instability
-
Conversation-state awareness:
- uses:
used_history_merge- retrieval confidence
- fallback traces
- repair traces
- detects unresolved conversational state
- uses:
-
Philosophy:
- follow-up should resolve remaining uncertainty
- continuation should be targeted, not generic
- avoid:
- engagement fluff
- repetitive questions
- unnecessary continuation pressure
-
Purpose: build retrieval behavior before RAG retrieval.
-
Main functions:
build_retrieval_strategy()
-
Responsibilities:
- determine:
- retrieval breadth
- exploration level
- retrieval precision
- retrieval
k
- determine:
-
v2 improvements:
- moved from:
- flat intent →
k
- flat intent →
- to:
- semantic retrieval strategy
- moved from:
-
Uses:
- primary intent
- secondary intents
- topic
- skater state
- prior context
-
Strategy signals:
primary_diagnosisdiagnosis_with_equipmentcomparison_with_equipment_precisionstrong_skill_signalspecific_topic
-
Philosophy:
- retrieval behavior should depend on:
- semantic uncertainty
- topic specificity
- mixed intent composition
- not only:
- one flat intent
- retrieval behavior should depend on:
- Supabase auth + persistent sessions
- Persistent skater profiles
- Profile-aware generation
- Semantic profile evolution
- Human-confirmed memory mutation
- Async backend persistence
Long-term skater identity.
Per-message inferred skating context.
Key decision: persistent identity and transient conversation context are separated.
User Message
↓
LLM Semantic Detection
↓
Strict Validation
↓
Frontend Confirmation
↓
Async Backend Persistence
"1fl" → 1F
"double sal" → 2S
"ax-el" → 1A
Key decision: persistent storage remains canonicalized.
Frontend:
- UI
- interaction
Backend:
- semantic authority
- validation
- persistence
- Added silent async recurring-topic memory for authenticated users
- Introduced lightweight LLM-based topic extraction with normalized skating-domain tags
- Built persistent semantic memory layer using Supabase (
user_topic_memory) - Separated long-term skating interests from temporary mechanics/symptoms
- Added soft recurring-topic prompt injection (
[RECENT USER FOCUS]) for lightweight personalization - Ensured all topic extraction + DB writes run asynchronously after response generation
- Kept semantic memory invisible to frontend and non-blocking to Q&A latency
- Cleaned architecture by consolidating active user state into
profilestable
- Discovered semantic drift caused by parallel agent interactions after introducing jump detection and profile-update agents
- Found that clarification follow-up replies were sometimes incorrectly treated as standalone intents
- Observed retrieval and answer planning drifting away from the original unresolved question during clarification flows
- Identified weak clarification-attachment logic and lack of conversational anchor preservation as core causes
- Added explicit clarification state tracking instead of heuristic question-based inference
- Added semantic clarification attachment resolution using LLM reasoning
- Introduced conversational anchor preservation before downstream retrieval/planning/state extraction
- Switched downstream routing to use resolved conversational query instead of raw latest user message
- Prevented jump/profile semantic agents from hijacking clarification follow-up turns into unrelated semantic topics
- Established lightweight orchestration guidance across parallel semantic agents to stabilize multi-agent conversational behavior
- Added state-aware sharpening reasoning: WarmGPT now evaluates whether logged skating hours realistically support blade dullness, explains possible mismatches (under-logging, bad sharpening, mounting issues), and softly encourages better session tracking for more accurate equipment awareness.