1.0.0 -- 2026-04-20
First public release. All core components validated end-to-end on 15 years of
market data (2011-2026) and 1002 passing tests.
Added
- Walk-forward benchmark with two-cycle retrain (126 days) + rebalance (15
days), volatility targeting at 10% annualised, transaction costs (5bps
slippage + 10bps commission), and 16-metric portfolio vs SPY report.
10-year OOS champion: Sharpe=0.766, CAGR=13.68%, MaxDD=-21.94%, Beta=0.566,
Alpha=+2.57%. - CPCV-OOS parameter tuning with Deflated Sharpe Ratio (Bailey & Lopez de
Prado 2014). 982 configurations tested across three tiers; champion config
committed inrun_benchmark.py. - PatchTST forecaster (NeuralForecast) with multi-quantile loss (5
quantiles), CDF interpolation for continuousprob_up, CDF rearrangement
for monotonic quantiles, and NaN/Inf guards in the predict path. - LangGraph 4-agent debate (Technical Analyst, Fundamentalist, Bear,
Portfolio Manager) with Pydantic-structured outputs and streaming per-node
callbacks. Gemini (default, free tier) or Anthropic Claude via
LLM_PROVIDER=anthropic. - Financial RAG with sentence-transformers embeddings (
all-MiniLM-L6-v2)
and ChromaDB vector store; P95 retrieval latency = 101ms on 172 articles. - HRP allocator with Ward linkage, Ledoit-Wolf covariance shrinkage,
sum-preserving confidence tilt, and waterfilling weight optimiser. - DecisionEngine with three-tier weight model (BUY=HRP, HOLD=HRP*conf,
SELL=0); cash implicit; fallback topredictions.parquetif agents fail. - Streamlit dashboard with four tabs: Benchmark, Performance, War Room
(live + replay modes), Microstructure (fan chart). - Data pipeline with thread-safe
yf.Ticker().history()ingestion for
52 US large caps + SPY, 15 years OHLCV (~199k rows), news ingestion with
Google News RSS fallback. .claude/agents/with 5 development subagents (architect,
quant-reviewer, security-data, test-writer, docs-writer) encoding project
rules and historical incidents.- Docker Compose for PostgreSQL + ChromaDB.
- CI/CD via GitHub Actions (
ci.yml+lint.yml). - Methodology notebook (
notebooks/methodology_and_results.ipynb)
executing end-to-end in 6.6s with 9 sections covering problem framing,
data, CPCV-OOS, DSR, HRP, walk-forward, regime analysis, the max_weight
lesson, and limitations.
Fixed
- Thread-safety bug in
yf.download()(pre-session 37): 22 of 52 tickers
were receiving data from alphabetically-adjacent tickers due to a yfinance
thread-safety defect. Migrated toyf.Ticker().history(); 15 years of data
re-ingested. - Look-ahead bias in early Sharpe estimates: the infamous Sharpe ~2.7 from
sessions 1-35 was an artefact of look-ahead in rolling windows and fill
strategies. Corrected to the current Sharpe=0.766 with full CPCV-OOS and
out-of-sample walk-forward. - Geometric
rfconversion acrossbenchmark_metrics,cpcv,
cpcv_oos, andwalk_forward(was arithmetic, biased Sharpe upward). - Transaction costs silently dropping from
port_retin walk-forward
(session 36). prob_updiscretisation: replaced step function with CDF interpolation
across 5 quantiles.- Fan chart sort order: was alphabetical, now sorted by quantile level.
- Volatility targeting applied ex-ante (pre-allocation) instead of ex-post
(post-allocation) to avoid leverage lag. - HRP
tiltsum-preserving: previously naive multiplication could drift
the sum of weights away from 1. walk_forwardinitial state: portfolio starts 100% cash (institutional
convention) rather than fully invested.- Bankruptcy safeguard in walk-forward (guards against NaV<=0 chains).
- Dashboard (
app.py:1396):decision.get('weight')fallback to
suggested_weightfor live debate mode (weight key only exists after HRP
merge in replay mode). - Streamlit deprecations: 17
use_container_width=Truereplaced with
width='stretch'(deadline 2025-12-31).
Known limitations
- Backtest-production gap: the production
make decidepipeline uses the
LangGraph debate, which is not part of the walk-forward validation (cost
of ~26k LLM calls per rebalance is infeasible). See
docs/design_gap_backtest_vs_production.md. - Stochastic debate:
make decideis non-deterministic (temperature=0.2
analysts + 0.1 PM).decisions.jsonis a single-run snapshot; borderline
tickers flip between BUY/HOLD in reruns. TheMIN_CONFIDENCE_FOR_ACTION=0.3
gate is the deciding boundary. - RAG coverage depends on news availability: with 172 articles across 52
tickers, grounding coverage is ~47% on tickers with real debate. Meta and
Tesla sparse in Google News RSS; citations will vary with the backfill.
Methodological stance
- The project favours honest reporting over marketing numbers. MaxDD,
Beta, Alpha, Tracking Error are published alongside Sharpe; the
backtest-production gap is documented in the README; the session 40
regression (max_weight=0.10 reverted at Sharpe=0.462) is preserved as
evidence of anti-overfitting discipline. max_weightis treated as a risk constraint (min(6%, 2/N)) not an
optimisation parameter; fine-tuning via CPCV-OOS is valid only for timing
parameters (rebalance_every,target_vol).