This repository is not code hosted on GitHub. It is an application running on GitHub: GitHub Actions is its compute, the label system is its state machine, issues and pull requests are its database, branch protection is its authorization layer, and the merged repository tree is its production state. The "users" of the application are seven autonomous agent routines and one human experimenter; its workload is a theoretical-physics research program.
This document is descriptive, not normative. The normative documents keep
their own vocabulary: AUTONOMY.md is the constitution
(policy), EXPERIMENT.md is the pre-registration
(spec + changelog + incident log), METHODOLOGY.md is
the schema for research content. Where this document and those disagree,
those win.
| Application concept | Implementation |
|---|---|
| Compute / runtime | GitHub Actions runners executing headless Claude Code sessions (autonomy-routine.yml) |
| Processes | Seven routines — durable roles (automation/routines/), ephemeral invocations (one Actions run each) |
| Scheduler | cron triggers per role (autonomy-<role>.yml), with deliberate redundancy (reviewer double-fire) |
| Event bus | GitHub events: a quorum verdict comment dispatches the responder; an agent-PR push dispatches the reviewer (autonomy-event-dispatch.yml). Cron remains the guaranteed backstop |
| State machine | The label system: agent-ready → claim (assignee lock) → agent-pr → per-SHA verdict → merged / stuck / needs-human. Transitions table in AUTONOMY.md |
| Database | Issues, PRs, and the repository tree. Durable, queryable (gh is the query language), transactional (a merge is a commit, in both senses) |
| Schema | METHODOLOGY's rigor system: every result typed Rigorous / Sketch / Conjecture; demotions are migrations; withdrawn records are tombstones (never deleted) |
| Authorization | Branch protection (required checks, no direct pushes) + machine-account PAT scopes (write, not admin; no workflow scope) + the four-tier merge-gate stack |
| Service account | The machine account (repo variable AUTONOMY_BOT), asserted at runtime by an identity guard that refuses to act under any other login |
| Policy engine | The gate stack: deterministic tier (tests, paper builds, citation existence) → semantic tier (LLM claim-support evaluator) → quorum tier (adversarial verdict markers, per-SHA) → constitutional tier (protected paths require experimenter approval) |
| Feature flag / kill switch | Repo variable AUTONOMY_ENABLED; one flip stops the fleet. Full kill-switch runbook in EXPERIMENT.md |
| Configuration | Repo variables (identity, switches) and secrets (PAT, model credentials, SMTP) |
| Observability | Metrics dashboard issue (weekly snapshots → metrics/), Breakthrough digest issue (weekly plain-English digest), Tier-A alert emails, Actions logs as traces |
| Release process | Workflow and constitution changes are deploys: experimenter-authored PRs, admin-merged, each recorded in the EXPERIMENT.md log |
| Incident log / postmortems | The same log — the 2026-06-05 identity halt, gate amendments, and instrumentation fixes are its entries |
| Integration test | The pre-registered terminal audit: fresh agents with no project context re-verify a sample of merged Rigorous results and every citation added during the run |
The deepest part of the mapping: the research content is the application state, and the methodology is its schema. A rigor label is a type annotation. A demotion PR is a schema-checked migration. The red team is a fuzzer over production state. The audit is an integration test run against production data by a clean client.
The core request path, annotated with the GitHub primitive that implements each step:
- Issue filed (
agent-ready) — by the scout from an OBJECTIVES milestone, the governor, or the experimenter (experimenter-priorityjumps the queue). Primitive: issue + label. - Claim — the worker assigns the machine account before any work; a 7-day-stale assignment is reclaimable. Primitive: issue assignee as a lock.
- Branch + PR —
program/issue-N-description, PR body carries the four self-checks and rigor labels, labeledagent-pr. Primitive: branch, PR, label. - Deterministic tier — pytest (×2), pdflatex per paper, citation existence against Crossref/arXiv. Primitive: required status checks.
- Semantic tier —
claim-support: an isolated LLM evaluator checks that load-bearing citations support the claims attached to them, diff-scoped so pre-existing defects ofmainwarn rather than block. Primitive: required check wrapping an LLM call. - Quorum tier — the reviewer posts a machine-readable verdict marker
bound to the head SHA (
accept/revise/reject); a push voids it. Rigor promotions additionally require an independent stress-test marker. Primitive: comment markers + a gate that parses them. - Revision loop — a
reviseverdict event-dispatches the responder; its push event-dispatches the reviewer. Round-trip latency is minutes, with a ≥5-verdict circuit breaker that drops a long loop back to cron pace. Primitive:issue_comment/pull_requestevents. - Merge — GitHub auto-merge fires mechanically when all required checks pass. No agent and no human presses the button. Primitive: auto-merge + branch protection.
- Post-merge audit — the red team prioritizes never-audited Rigorous results; its product is demotion PRs, which re-enter this same pipeline. Primitive: the loop, applied to itself.
Every agent operation authenticates as the machine account, whose PAT has
write access but not admin and deliberately lacks the workflow scope —
agents structurally cannot author changes to the gates that judge them. The
runner asserts the effective login at startup and refuses to act on mismatch
(the check whose absence caused the 2026-06-05 identity incident). The
experimenter holds the complementary powers: kill switch, budget,
constitutional approvals, and the admin override that is the only deploy path
for gate workflows. The event-dispatch fast path holds no authority at all —
it presses, via the default workflow token, the same dispatch button the
scheduler presses.
Observed failures and the mechanisms that absorb them — each of these has actually fired at least once:
- Dropped cron fires (GitHub silently skips scheduled runs): reviewer double-fire, a responder watchdog that re-dispatches a missed review cycle, and the event-driven fast path, which removes the dependency on cron for the hot loop.
- Model-budget exhaustion (session limits kill a routine mid-run): the runner detects the limit message, sleeps through resets ≤90 minutes away and retries once, otherwise fails loudly and defers to the next fire.
- Blind instrumentation (metrics that "succeed" while reporting nothing): tripwire T5 halts the experiment if weekly metrics fail or go empty twice running; the 2026-06-12 label-count fix is the case study in why.
- Reviewer-loop churn (serial non-exhaustive revise rounds): the ≥5-verdict circuit breaker bounds the fast path; verdict-exhaustiveness discipline is a logged open amendment.
- Runaway or compromised agents: the gate stack is independent of agent
cooperation (an unlabeled machine-account research PR fails rather than
skips quorum), protected paths require human approval, and
needs-humanis a one-way halt only the experimenter clears.
- Live queue: open
agent-prPRs andagent-readyissues - Vital signs: Metrics dashboard — demotion rate is the one to watch (a healthy adversarial system demotes)
- Plain English: Breakthrough digest, weekly
- Traces: the Actions tab — every agent invocation is a public log
- Changelog & incidents: the log table in
EXPERIMENT.md