DeepSWE is a benchmark for measuring frontier coding agents on original, long-horizon software engineering tasks drawn from active open-source repositories. The benchmark includes 113 tasks across TypeScript, Go, Python, JavaScript, and Rust, with isolated environments and program-based verifiers.
DeepSWE tasks use the Harbor task format:
task.toml Metadata: repository, base commit, language, prebuilt image, resource limits
instruction.md The prompt the agent sees
environment/ Dockerfile that reproduces the prebuilt image (fallback if the image is unavailable)
tests/ Verifier: test.sh (entry point) + test.patch (test additions, applied at grading time)
solution/ Reference solution (held out from the agent; for human and AI reviewers)
The verifier exercises the behavior the prompt describes. It accepts any solution whose observable behavior is correct, regardless of internal symbol names or structure.
The reference patch in solution/ is never used at grading time; it exists so reviewers can spot-check correctness offline.
Use Pier to run the benchmark:
git clone https://github.com/datacurve-ai/deep-swe
cd deep-swe
cp .env.example .env # fill in BASETEN_*
# >= 0.2.1 (environment.mounts support) is git-only; PyPI stops at 0.2.0
uv tool install "datacurve-pier @ git+https://github.com/datacurve-ai/pier"
# Dev smoke test (1 task, 30 min timeout, 600 step cap, 1 worker)
./mini-swe-agent/run_dev.sh
# Full 113-task eval (leaderboard-comparable; verification on; 4 workers)
./scripts/pier-run.sh run -c mini-swe-agent/full.yaml -y
# Full eval on the DGX Spark (8 workers; aarch64 — see note below)
./scripts/prebuild.sh # one-time: task images + agent overlays, natively for arm64
./mini-swe-agent/run_spark.sh
# One-task end-to-end wiring check (prebuild → trial → verify telemetry/cost)
./mini-swe-agent/spark-smoke.sh
# Single task
pier run -c mini-swe-agent/dev.yaml --env-file .env -y -p tasks/<task-id>Job configs and run wrappers live in mini-swe-agent/:
| Config | Wrapper | What |
|---|---|---|
dev.yaml |
run_dev.sh |
1-task smoke / iteration |
full.yaml |
— | full 113-task eval (laptop, 4 workers) |
full-spark.yaml |
run_spark.sh |
full eval, DGX Spark (8 workers) |
full-spark-optimistic.yaml |
run_spark_optimistic.sh |
22 tasks with >=50% avg pass rate, 1 attempt |
dev-kimi.yaml |
run_dev_kimi.sh |
dev against tim-kimi2.6 |
full-spark-kimi.yaml |
run_spark_kimi.sh |
spark eval against tim-kimi2.6 |
full-spark-optimistic-kimi.yaml |
run_spark_optimistic_kimi.sh |
optimistic subset against tim-kimi2.6 |
DEEP_SWE_ROOT (used by the pricing/ and scripts/ bind mounts) is derived automatically by scripts/pier-run.sh from the checkout location — do not set it in .env; a stale value from another machine is stripped and overridden.
The prebuilt task images in task.toml ([environment].docker_image) are amd64-only, and Rosetta is macOS-only — on Linux arm64 the only x86 emulation is QEMU, which is too slow and flaky for the eval. Instead, mini-swe-agent/full-spark.yaml sets force_build: true so Pier builds each task natively from environment/Dockerfile (the mars-base base image is multi-arch). Run ./scripts/prebuild.sh once before the eval to warm the Docker layer cache (PREBUILD_JOBS=8 to parallelize); failures are summarized with per-task logs. Two Dockerfiles carry arm64-specific fixes: cliffy-config-file-parsing (arch-specific Deno binary) and eicrud-keyset-pagination-cursor (MongoDB ships no arm64 server packages for Debian; uses the Ubuntu repo on arm64). Native arm64 environments are rebuilt, not the leaderboard's prebuilt amd64 images — tasks and verifiers are identical, but strict leaderboard comparability requires amd64.
All job configs bind-mount pricing/ and register pricing/model-registry.json (combined, one entry per model slug) via LITELLM_MODEL_REGISTRY_PATH at trial start — pricing/model changes never require an image rebuild:
| Model | Input (cache miss) | Input (cache hit) | Output |
|---|---|---|---|
subconscious/tim-qwen3.6-27b |
$0.30 / 1M | $0.15 / 1M | $3.00 / 1M |
subconscious/tim-kimi2.6 |
$0.50 / 1M | $0.15 / 1M | $3.00 / 1M |
Costs appear in jobs/<job-dir>/result.json as stats.cost_usd and per-trial agent_result.cost_usd when the run completes. Per-call token counts are also logged (llm_timing.jsonl), so any past run can be re-priced under a different curve without rerunning: python3 scripts/reprice.py jobs/<job-dir> pricing/<curve>.json.
Dev (mini-swe-agent/dev.yaml) |
Full (mini-swe-agent/full.yaml) |
|
|---|---|---|
| Concurrent trials | 1 | 4 |
| Agent timeout | 30 min (override_timeout_sec) |
90 min (task.toml [agent] timeout_sec) |
| Step limit | 600 | Unlimited until timeout (mini-swe default) |
| Tasks | Pinned in YAML (edit task_names) |
All 113 under tasks/ |
- Pause: Press
Ctrl+Conce in the terminal runningpier run. Let Pier shut down gracefully before force-killing the process or closing the laptop. - What is saved:
jobs/<timestamp>/config.json,result.json,lock.json, and per-trial directories withresult.jsonfor finished trials. - Resume (uses the frozen
config.jsonsaved in the job directory—not the current YAML on disk):
pier job resume -p jobs/<job-dir>Or with DEEP_SWE_ROOT set: ./scripts/pier-run.sh job resume -p jobs/<job-dir>
- Completed trials are skipped; only remaining trials run.
- By default,
pier job resumeremoves trial dirs that ended withCancelledErrorso those tasks are retried. - Trial folders without
result.jsonbut with partial artifacts are cleaned up and restarted from scratch. - Resume does not continue a single agent mid-trajectory—only at trial granularity.
- If resume fails with a
lock.jsonmismatch, Pier detected a change in job inputs (for exampleDEEP_SWE_ROOTor other.envvalues) since the job was created; restore the same environment or start a new job directory.
Inspect progress and costs with pier view jobs.
Pier is a Harbor-compatible framework for sandboxed coding-agent evals. It began as a fork of Harbor to support CLI agents in air-gapped tasks: Harbor blocks all outbound traffic in allow_internet = false tasks, including dependency installs and LLM API calls. Pier adds per-agent network allowlists, giving agents only the network access they need while keeping the task environment isolated.
Pier also adds more complete trajectory metadata, a better trajectory viewer, and pier critique run for analyzing agent trajectories. All leaderboard scores were produced with Pier running mini-swe-agent on Modal.
mini-swe-agent is model-agnostic. Pier also drives claude-code, codex, gemini-cli, and opencode directly. Pass --env modal to run in parallel sandboxes on Modal.
Deterministic random subset of the 113-task corpus:
pier run -p deep-swe/tasks --agent mini-swe-agent --n-tasks 10 --sample-seed 0Single task:
pier run -p deep-swe/tasks/<task-id> --agent mini-swe-agent