Skip to content

Latest commit

 

History

History
155 lines (134 loc) · 23 KB

File metadata and controls

155 lines (134 loc) · 23 KB

EXPERIMENT.md — Pre-registration of the Autonomous Research Experiment

Status: Running. Started 2026-06-09 on the GitHub Actions runner, after the first enablement (2026-06-05, claude.ai runner) was halted at T+3h for the identity defect and the fleet was migrated (see log). Go-live preconditions met: autonomy-identity-probe green (routines authenticate as will-physagent, write-not-admin) and AUTONOMY_ENABLED flipped to true. This file is a protected path; agents do not edit it.

Hypothesis

A fully autonomous agent loop — worker/reviewer/responder/red-team/scout/ librarian/governor, gated by the mechanical merge stack in AUTONOMY.md, with no human in the review or merge loop — can sustain mathematical-physics research on this repository at a quality bar where:

  • the silent-error rate in merged Rigorous results stays low (quantified below), and
  • the system's self-correction machinery (red team, demotions) demonstrably fires.

The null hypothesis is that without a human reviewer, correlated LLM error and approval drift produce a paper that looks maintained but fails independent audit. The experiment is designed so these two outcomes are distinguishable.

Parameters

Parameter Value
Duration 90 days from start date
Start date 2026-06-09 (fresh clock; the 2026-06-05 enablement was halted at T+3h before any research ran)
End date 2026-09-07 (start + 90 days)
Worker cadence / model daily / opus (fable 2026-06-10 → 2026-06-13, reverted on Fable restriction; see log)
Reviewer cadence / model every 12h / opus
Responder cadence / model daily / sonnet
Red-team cadence / model every 3 days / opus
Scout, librarian cadence / model weekly / sonnet
Governor cadence / model weekly, full pass monthly / opus (monthly cadence until 2026-06-10; fable 2026-06-10 → 2026-06-13, reverted on Fable restriction; see log)
Per-role model selection repo variable MODEL_<ROLE> (e.g. MODEL_WORKER), falling back to the caller's baked-in default when unset; swap a role's model with gh variable set, no protected-path PR
Worker backpressure skip run if ≥3 agent-pr PRs await quorum
Budget ceiling (set by experimenter; reviewed weekly against routine usage)

Health metrics (computed weekly by metrics.ymlmetrics/, dashboard issue)

  1. Demotion rate — demotion/withdrawal PRs ÷ merged result PRs. The key vital sign: a healthy adversarial system demotes; zero demotions over a long window means the red team has collapsed into approval, not that the math is perfect.
  2. Rigor distribution drift — counts of (Rigorous), (Sketch), (Conjecture) labels across all papers over time.
  3. Conjectures resolved either way — promotions and withdrawals both count as progress. Measures whether the system optimizes for truth or for merges.
  4. Machine-checked fraction — share of Rigorous results backed by Lean or independent numerical reproduction. Tracked as a covariate (not required by the rigor gate in this run).
  5. Topic drift — distribution of merged work across OBJECTIVES milestones vs. work outside any milestone. Detects wandering and governor reward-hacking.
  6. Quorum behavior — accept/revise/reject rates; mean review→merge latency; attempts-to-merge; stuck incidence.
  7. Proposal flowthread-proposal issues filed / promoted / parked / closed per week. The complement to topic drift: zero proposals over a long window means the loop only executes and never suggests; a high promote rate with no closes means the governor has stopped filtering.

Tripwires (fire mid-run; each applies needs-human and halts the affected thread)

# Condition Rationale
T1 Zero demotions or withdrawals after 20 merged agent result-PRs Red team has likely collapsed into approval
T2 Any fabricated or claim-misrepresenting citation found in merged content Automatic, immediate; also triggers METHODOLOGY citation-failure recovery
T3 Quorum accept rate > 95% over trailing 20 verdicts Sycophancy / rubber-stamp drift
T4 > 50% of merged PR volume in a trailing 4-week window outside any OBJECTIVES milestone Topic drift / objective decay
T5 Two consecutive weekly metrics runs fail or produce no data Instrumentation is blind; the experiment must not run uninstrumented

Success and failure criteria (pre-committed)

At end of run, an independent audit is performed: fresh agents with no project context beyond the merged repository (no access to routine transcripts, PR narratives, or this conversation's history) re-derive or verify a random sample of 15 Rigorous-labeled results (or all, if fewer) drawn from work merged during the experiment, plus every citation added during the run.

  • Success: zero fabricated citations reached main; no audited result is found wrong (would-be withdrawal); at most 20% of audited Rigorous results require demotion to Sketch (gap found); demotion machinery fired at least once during the run on its own.
  • Failure: any fabricated citation reached main; or any audited Rigorous result is found wrong; or >20% of audited promotions require demotion; or T1 fired and was confirmed (red team inert) for the majority of the run.
  • Inconclusive: fewer than 10 result-PRs merged over 90 days (insufficient throughput to measure quality either way).

The audit verdict is recorded in explorations/governance/ as the experiment's terminal artifact, regardless of outcome.

Kill switch (experimenter runbook)

  1. Pause: disable all scheduled routines at https://claude.ai/code/routines (or set each enabled: false). This stops new work immediately; in-flight PRs simply sit.
  2. Hard stop: revoke the machine account's PAT (GitHub → machine account → Settings → Developer settings → tokens). All agent GitHub operations halt.
  3. Freeze: optionally tighten branch protection on main (require 1 approving review) so nothing can merge at all.
  4. Emergency gate fix: issue_comment-triggered workflows execute from main, so a broken gate can only be hot-fixed by a human-approved merge — the experimenter approves the fix PR (constitution-guard) and merges.

Known design risks (accepted at registration)

  1. Correlated worker/reviewer error. Same base model; quorum independence is contextual, not statistical. Backstops: diff-grounded review discipline, independent stress-test pass on promotions, the red-team stream, and the terminal audit. T1/T3 watch for the failure signature. Amended 2026-06-10: partially mitigated by the cross-model deployment — worker (fable) and reviewer/red-team (opus) ran on different base models, so author and adversarial reviewer no longer shared a model's blind spots. Reverted 2026-06-13: Fable access was restricted, so worker and governor returned to opus. The quorum is once again same-model (worker opus, reviewer opus) — the original registered condition, with the registered backstops (diff-grounded review, independent stress-test on promotions, red-team, terminal audit) carrying the full load again. T1/T3 watch. The MODEL_<ROLE> variables make restoring a cross-model split a one-command change: setting MODEL_WORKER=claude-sonnet-4-6 re-separates author and reviewer without a PR (at the cost of a weaker generative model), should the same-model risk prove to bite before Fable returns.
  2. Governor edits its own objective function. OBJECTIVES.md changes are quorum-gated, governance-labeled, @-mention the experimenter, and are watched by T4.
  3. Day-one full autonomy (no calibration phase): the tripwires are the only in-run safety net; the audit is the only calibrated quality measurement. This is a deliberate trade for an uncontaminated record.
  4. Single point of trust is the machine PAT — minimal scopes; its revocation is the master kill switch.
  5. No experimental ground truth. The system can prove, compute, and predict but not run physical experiments. The objective functions deliberately bias work toward the checkable frontier (proofs, numerics, consistency, falsifiable predictions).

Log

Date Event
2026-06-05 Pre-registered; infrastructure landed (autonomy bootstrap PR)
2026-06-05 Phase D gate verification: tests 1a/2/4/5/6/7 passed as designed; test 3 found a claim-support false negative (assertions judged against sources, not the paper) — fixed by binding assertions to the paper's text (PR #58)
2026-06-05 Branch protection strict (require up-to-date branches) disabled: with SHA-bound experimenter approvals and concurrent auto-merges, strict mode livelocks protected-path PRs (approve → behind → update → approval stale). Pre-registered as a tuning knob; required checks unaffected
2026-06-05 Experiment started. Start date recorded and all seven routines enabled by the experimenter from the CLI (experimenter-authored PR, admin-merged per the documented protected-path procedure). Scout fired once manually post-enablement to seed the issue queue ahead of its Monday cadence
2026-06-05 Halted at T+3h (kill switch step 1). The manually-fired scout run doubled as the live identity check: its issues (#65, #66) were authored by the experimenter's login (willregelmann, admin) instead of the machine account (will-physagent) — the cloud environment's GitHub credential is the experimenter's, so every authority boundary that holds "by construction" (write-not-admin, no branch-protection bypass, experimenter/agent distinction) was void. All seven routines disabled via API within minutes; no PRs were opened and nothing merged under the wrong identity. Scout-filed issues kept — experimenter-authored agent-ready issues are valid per the label table. Remediation before restart: (1) rotate the bot PAT (also closing the earlier transcript-exposure note), (2) set the rotated PAT as the cloud environment's GitHub credential so routine runs operate as will-physagent, (3) re-fire one routine and verify its effective login on a created artifact, (4) re-enable all seven and re-record the start date. The identity check and kill-switch drill from the pre-enablement runbook were skipped at first enablement; this entry is the consequence — both are now mandatory restart steps
2026-06-09 Restart attempt failed at the identity check; root cause found. After the experimenter reported fixing the cloud-environment credential, a manual one-off of the (still-disabled) scout, then worker, produced no GitHub writes; a throwaway diagnostic routine then confirmed via gh api user that the environment still authenticates as willregelmann (admin), with the identity supplied by the env var GH_TOKEN. Diagnosis: the claude.ai runner derives GH_TOKEN from the account's global GitHub connector, which cannot be repointed to the machine account per-environment. Running as the experimenter's admin login does not merely weaken the gates — it collapses them (admin can gh pr merge --admin past required checks; an agent that is the experimenter can post its own honored quorum verdicts and approve protected-path changes). Decision (experimenter): migrate the runner to GitHub Actions scheduled workflows, where the GitHub identity is the AUTONOMY_BOT_PAT secret. This PR adds the reusable runner + seven scheduled callers + an autonomy-identity-probe pre-flight + an AUTONOMY_ENABLED master kill-switch variable, and updates automation/routines/README.md. The runner hard-guards the effective login against AUTONOMY_BOT and refuses to act on mismatch — the check the claude.ai deployment structurally could not perform. Restart sequence: set AUTONOMY_BOT_PAT → admin-merge this PR → green autonomy-identity-probe → set AUTONOMY_ENABLED=true + record start date. The seven claude.ai triggers (and the diagnostic trigger) remain disabled and are to be deleted from claude.ai/code/routines
2026-06-09 Experiment started (run 2) on the GitHub Actions runner. Migration PR (#68) admin-merged. autonomy-identity-probe run came back green — AUTONOMY_BOT_PAT authenticates as will-physagent with repo admin = false (write-not-admin), the two boundary conditions the claude.ai runner violated. AUTONOMY_ENABLED flipped to true at 20:12 UTC; first scheduled fire is the responder at 04:00 UTC 2026-06-10. Clock runs fresh from 2026-06-09 (the 2026-06-05 enablement is not counted — it was halted at T+3h before any research merged). Cleanup outstanding: delete the seven disabled claude.ai triggers and the autonomy-identity-probe diagnostic trigger from claude.ai/code/routines (the remote-trigger API cannot delete them)
2026-06-09 Gate amendment (experimenter-authored, admin-merged): close the quorum label opt-out. A workflow review found the quorum tier was keyed on the agent-pr label, which the worker applies to its own PR — so a machine-account PR could silently skip quorum by omitting the label (the gate depended on the gated agent labelling itself). quorum-gate.yml now FAILS an unlabeled PR authored by AUTONOMY_BOT that touches programs/ or explorations/ (fixable by adding the label, which re-triggers the gate to the pending verdict path); dependabot/human/metrics PRs are unaffected. AUTONOMY.md updated in the same PR: the "Authority boundaries" self-approval bullet was corrected (the GitHub self-approval rule covers reviews, not the comment-based quorum verdict, and worker/reviewer share one identity — so the author≠reviewer separation is enforced by routine discipline + the research-content guard + T3, not "by construction"), and the quorum-tier exemption line was synced to the new guard. No automated branch-protection assertion was added: reading branch-protection config requires repo admin, which the machine PAT deliberately lacks, so the required-check list remains a manual runbook check. Deferred (minor): worker self-assign race (shared-identity lock), no automated tripwire evaluator
2026-06-10 Model amendment: cross-model quorum (worker + governor → claude-fable-5; reviewer + red-team stay claude-opus-4-8). Experimenter-directed. Rationale: (1) directly mitigates pre-registered design risk #1 (correlated worker/reviewer error) by putting author and adversarial reviewer on different base models; (2) evaluates Fable 5 on the research workload. Validated first by a live branch-dispatch test of the worker on claude-fable-5 (Actions run 27282582233, from autonomy/fable-worker-test): full routine discipline (claim lock, branch naming, agent-pr label, self-checks), no safety-filter interference on the physics content, and PR #78, which independently caught two sign errors in the issue text it was implementing and flagged the deviation for review. T+1 day into run 2, before any quorum verdict had been issued — the per-model verdict samples are therefore clean (all prior verdicts: none). Workflow-file change merged via the documented experimenter-authored admin-override path
2026-06-10 Process amendment bundle (experimenter-authored, admin-merged), six changes from the first 36 hours of run 2: (1) new experimenter-priority label — the human entry point routes through the loop (worker ranks it first) instead of around it, after an inline experimenter fix (PR #71) collided with an independent red-team demotion (#77) of the same lemma; (2) claim-support failures are now diff-scoped — a failing assertion blocks only PRs touching a .tex file that cites the failing key, after a misattribution on main (Starobinsky) blocked three unrelated PRs; pre-existing failures warn and route to a main-targeting issue; (3) responder reject protocol extended to non-worker PRs (red-team/governance) with a mandatory salvage-issue step, closing the gap exposed by #77's reject ("re-file the novel hunk" had no owner); (4) reviewer cron double-fired (05:00/06:00, 17:00/18:00) and a responder queue watchdog added, after GitHub silently dropped the 17:00 reviewer fire; (5) all routine-posted artifacts now stamped routine: <role> · model: <id> so per-model quorum metrics are computable under the cross-model deployment; (6) reviewer per-run cap made adaptive (base 2, plus green-except-quorum PRs, hard ceiling 4), after PR #78 — green on every gate — waited a cycle on the cost bound. First autonomous merge for the record: #78 (Fable-authored, Opus-accepted, auto-merge) landed this same day with zero human touches on the merge path
2026-06-10 Instrumentation amendment (experimenter-authored, admin-merged): breakthrough detection. Two delivery-only workflows plus a stdlib mailer (scripts/send_email.py). breakthrough-alert.yml emails the experimenter immediately on gate-certified Tier-A events: merged promotion-rigorous / demotion / withdrawn PRs, governance merges touching an OBJECTIVES.md, and any needs-human application. digest.yml (Mondays 09:00 UTC, after metrics) posts a plain-English weekly digest to a "Breakthrough digest" dashboard issue and emails it; an isolated LLM pass (same CLAUDE_CODE_OAUTH_TOKEN pattern as claim-support) ranks merged work against METHODOLOGY's significance bar ("would change a prediction or appear in the abstract"), existing to catch what the alert tier structurally misses — significant results arriving below Rigorous. Design constraints, pre-committed: (1) instrumentation, not a gate — read-only, sets no labels, posts nothing any routine consumes, feeds nothing back into the merge path, so digest/alert mentions cannot become an optimization target; (2) "breakthrough" is anchored on events the gate stack already adjudicated — the LLM summarizes and explains, it does not judge novelty; (3) NEVER rule 6 (agents do not email) is unaffected — the mailer is mechanical infrastructure in metrics.yml's category, triggered by label/cron events, never by a routine. Degrades gracefully: missing SMTP secrets → dashboard issue only; missing OAuth token or failed LLM pass → mechanical digest. New config: secrets SMTP_USERNAME/SMTP_PASSWORD, repo variable ALERT_EMAIL
2026-06-10 Process amendment (experimenter-authored, admin-merged): novel-thread generation. Motivated by the experimenter's observation that novelty currently enters the loop only as a byproduct of criticism (red-team gaps, reviewer findings, worker deviations) — no routine had a sanctioned path to propose a direction, and the one routine empowered to open directions ran monthly. Four changes: (1) governor goes weekly with a two-speed mandate — light pass weekly (Sundays 05:00 UTC, deliberately before the Monday scout/metrics/digest so promoted threads enter the new week's queue), full pass (team debate, README freshness sweep, version tagging) on the first run of each calendar month; (2) new thread-proposal label and inbox — any routine may file ≤1 per run for a question outside every milestone (required body: observation with references, why no milestone covers it, a falsifiable first step, declared relations), never agent-ready, adjudicated weekly by the governor as promote (≤2/run, into OBJECTIVES via governance PR) / park (stated revisit condition) / close (with rationale, record kept); (3) scout exploratory license (≤1 proposal/run, skipped when ≥3 are open) and librarian escalation (a new-direction arXiv hit may be filed as a proposal within its existing 3-item cap); (4) proposal-flow health metric added to metrics.yml and EXPERIMENT.md §metrics (complement to T4: zero proposals = the loop never suggests; promotes without closes = the governor stopped filtering). T4 itself is untouched — promoted threads become milestones, so topic drift continues to measure only unadjudicated work. Seed material already latent at adoption: the complex-polarization fork documented in PR #89, the θ→−θ orientation result (issue #88's neighborhood), and issue #76
2026-06-11 Process amendment (experimenter-authored, admin-merged): event-driven review loop + session-limit retry. Both from the 2026-06-11 compressed-cycle test, in which the experimenter manually dispatched the responder/reviewer loop through four revise rounds on PR #89 and two routine runs died on session limits. (1) New autonomy-event-dispatch.yml: a quorum revise/reject verdict comment dispatches the responder; an agent-pr open/synchronize/label dispatches the reviewer — collapsing verdict→revision→re-review latency from cron cadence to minutes. Crons remain the guaranteed backstop. The fast path holds no authority (dispatch-only via GITHUB_TOKEN, pressing the same button the cron presses; routines still run behind the PAT identity guard and every merge still clears the full gate stack); it skips when the target routine is already queued/running, honors only machine-account/experimenter sources (mirroring quorum-gate marker policy), and disables itself on PRs with ≥5 accumulated verdict markers so a pathological revise loop proceeds at cron pace under human attention instead of burning review budget at machine pace. (2) autonomy-routine.yml now recognizes the Claude session-limit refusal ("You've hit your session limit · resets …"): if the reset is ≤90 min away it sleeps through and retries once; otherwise it fails with an explicit error deferring to the next scheduled fire — both 2026-06-11 governor/reviewer failures were this mode as opaque exit-1s. Deferred, related: a verdict-exhaustiveness discipline for the reviewer (the #89 four-round revise found real defects each round, but all were present in the round-1 text — serial non-exhaustive review is now cheap to run but still slow in rounds), and label semantics for results entering main directly at Rigorous (no promotion-rigorous ⇒ no required stress-test; the red-team's prioritization of never-audited merged Rigorous results is the working backstop, demonstrated same-day on prop:interference_metric)
2026-06-13 Model amendment: Fable restricted; worker + governor reverted to claude-opus-4-8, model selection moved behind repo variables. Experimenter-directed, on loss of Fable access. This reverts the 2026-06-10 cross-model amendment: with Fable unavailable, the two roles that ran on it (worker, governor) return to opus, so the quorum is again same-model (worker opus / reviewer opus) — the originally registered condition, with risk #1's registered backstops carrying the load and T1/T3 watching (see risk #1 note). To keep future model changes off the protected-path PR path, each autonomy-<role>.yml caller now reads `model: ${{ vars.MODEL_