Skip to content

subconscious-systems/deep-swe

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepSWE is a benchmark for measuring frontier coding agents on original, long-horizon software engineering tasks drawn from active open-source repositories. The benchmark includes 113 tasks across TypeScript, Go, Python, JavaScript, and Rust, with isolated environments and program-based verifiers.

Task format

DeepSWE tasks use the Harbor task format:

task.toml         Metadata: repository, base commit, language, prebuilt image, resource limits
instruction.md    The prompt the agent sees
environment/      Dockerfile that reproduces the prebuilt image (fallback if the image is unavailable)
tests/            Verifier: test.sh (entry point) + test.patch (test additions, applied at grading time)
solution/         Reference solution (held out from the agent; for human and AI reviewers)

The verifier exercises the behavior the prompt describes. It accepts any solution whose observable behavior is correct, regardless of internal symbol names or structure. The reference patch in solution/ is never used at grading time; it exists so reviewers can spot-check correctness offline.

Quickstart

Use Pier to run the benchmark:

git clone https://github.com/datacurve-ai/deep-swe
cd deep-swe
cp .env.example .env   # fill in BASETEN_*
# >= 0.2.1 (environment.mounts support) is git-only; PyPI stops at 0.2.0
uv tool install "datacurve-pier @ git+https://github.com/datacurve-ai/pier"

# Dev smoke test (1 task, 30 min timeout, 600 step cap, 1 worker)
./mini-swe-agent/run_dev.sh

# Full 113-task eval (leaderboard-comparable; verification on; 4 workers)
./scripts/pier-run.sh run -c mini-swe-agent/full.yaml -y

# Full eval on the DGX Spark (8 workers; aarch64 — see note below)
./scripts/prebuild.sh   # one-time: task images + agent overlays, natively for arm64
./mini-swe-agent/run_spark.sh

# One-task end-to-end wiring check (prebuild → trial → verify telemetry/cost)
./mini-swe-agent/spark-smoke.sh

# Single task
pier run -c mini-swe-agent/dev.yaml --env-file .env -y -p tasks/<task-id>

Job configs and run wrappers live in mini-swe-agent/:

Config Wrapper What
dev.yaml run_dev.sh 1-task smoke / iteration
full.yaml full 113-task eval (laptop, 4 workers)
full-spark.yaml run_spark.sh full eval, DGX Spark (8 workers)
full-spark-optimistic.yaml run_spark_optimistic.sh 22 tasks with >=50% avg pass rate, 1 attempt
dev-kimi.yaml run_dev_kimi.sh dev against tim-kimi2.6
full-spark-kimi.yaml run_spark_kimi.sh spark eval against tim-kimi2.6
full-spark-optimistic-kimi.yaml run_spark_optimistic_kimi.sh optimistic subset against tim-kimi2.6

DEEP_SWE_ROOT (used by the pricing/ and scripts/ bind mounts) is derived automatically by scripts/pier-run.sh from the checkout location — do not set it in .env; a stale value from another machine is stripped and overridden.

aarch64 / DGX Spark

The prebuilt task images in task.toml ([environment].docker_image) are amd64-only, and Rosetta is macOS-only — on Linux arm64 the only x86 emulation is QEMU, which is too slow and flaky for the eval. Instead, mini-swe-agent/full-spark.yaml sets force_build: true so Pier builds each task natively from environment/Dockerfile (the mars-base base image is multi-arch). Run ./scripts/prebuild.sh once before the eval to warm the Docker layer cache (PREBUILD_JOBS=8 to parallelize); failures are summarized with per-task logs. Two Dockerfiles carry arm64-specific fixes: cliffy-config-file-parsing (arch-specific Deno binary) and eicrud-keyset-pagination-cursor (MongoDB ships no arm64 server packages for Debian; uses the Ubuntu repo on arm64). Native arm64 environments are rebuilt, not the leaderboard's prebuilt amd64 images — tasks and verifiers are identical, but strict leaderboard comparability requires amd64.

Token pricing

All job configs bind-mount pricing/ and register pricing/model-registry.json (combined, one entry per model slug) via LITELLM_MODEL_REGISTRY_PATH at trial start — pricing/model changes never require an image rebuild:

Model Input (cache miss) Input (cache hit) Output
subconscious/tim-qwen3.6-27b $0.30 / 1M $0.15 / 1M $3.00 / 1M
subconscious/tim-kimi2.6 $0.50 / 1M $0.15 / 1M $3.00 / 1M

Costs appear in jobs/<job-dir>/result.json as stats.cost_usd and per-trial agent_result.cost_usd when the run completes. Per-call token counts are also logged (llm_timing.jsonl), so any past run can be re-priced under a different curve without rerunning: python3 scripts/reprice.py jobs/<job-dir> pricing/<curve>.json.

Dev vs full limits

Dev (mini-swe-agent/dev.yaml) Full (mini-swe-agent/full.yaml)
Concurrent trials 1 4
Agent timeout 30 min (override_timeout_sec) 90 min (task.toml [agent] timeout_sec)
Step limit 600 Unlimited until timeout (mini-swe default)
Tasks Pinned in YAML (edit task_names) All 113 under tasks/

Pause, resume, and recover

  1. Pause: Press Ctrl+C once in the terminal running pier run. Let Pier shut down gracefully before force-killing the process or closing the laptop.
  2. What is saved: jobs/<timestamp>/config.json, result.json, lock.json, and per-trial directories with result.json for finished trials.
  3. Resume (uses the frozen config.json saved in the job directory—not the current YAML on disk):
pier job resume -p jobs/<job-dir>

Or with DEEP_SWE_ROOT set: ./scripts/pier-run.sh job resume -p jobs/<job-dir>

  • Completed trials are skipped; only remaining trials run.
  • By default, pier job resume removes trial dirs that ended with CancelledError so those tasks are retried.
  • Trial folders without result.json but with partial artifacts are cleaned up and restarted from scratch.
  • Resume does not continue a single agent mid-trajectory—only at trial granularity.
  • If resume fails with a lock.json mismatch, Pier detected a change in job inputs (for example DEEP_SWE_ROOT or other .env values) since the job was created; restore the same environment or start a new job directory.

Inspect progress and costs with pier view jobs.

What is Pier

Pier is a Harbor-compatible framework for sandboxed coding-agent evals. It began as a fork of Harbor to support CLI agents in air-gapped tasks: Harbor blocks all outbound traffic in allow_internet = false tasks, including dependency installs and LLM API calls. Pier adds per-agent network allowlists, giving agents only the network access they need while keeping the task environment isolated.

Pier also adds more complete trajectory metadata, a better trajectory viewer, and pier critique run for analyzing agent trajectories. All leaderboard scores were produced with Pier running mini-swe-agent on Modal.

Agents and models

mini-swe-agent is model-agnostic. Pier also drives claude-code, codex, gemini-cli, and opencode directly. Pass --env modal to run in parallel sandboxes on Modal.

Subsets and single tasks

Deterministic random subset of the 113-task corpus:

pier run -p deep-swe/tasks --agent mini-swe-agent --n-tasks 10 --sample-seed 0

Single task:

pier run -p deep-swe/tasks/<task-id> --agent mini-swe-agent

About

Measuring frontier coding agents on original, long-horizon engineering tasks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Shell 85.2%
  • Dockerfile 7.9%
  • Python 6.9%