DeepSWE

DeepSWE is a benchmark for measuring frontier coding agents on original, long-horizon software engineering tasks drawn from active open-source repositories. The benchmark includes 113 tasks across TypeScript, Go, Python, JavaScript, and Rust, with isolated environments and program-based verifiers.

Task format

DeepSWE tasks use the Harbor task format:

task.toml         Metadata: repository, base commit, language, prebuilt image, resource limits
instruction.md    The prompt the agent sees
environment/      Dockerfile that reproduces the prebuilt image (fallback if the image is unavailable)
tests/            Verifier: test.sh (entry point) + test.patch (test additions, applied at grading time)
solution/         Reference solution (held out from the agent; for human and AI reviewers)

The verifier exercises the behavior the prompt describes. It accepts any solution whose observable behavior is correct, regardless of internal symbol names or structure. The reference patch in solution/ is never used at grading time; it exists so reviewers can spot-check correctness offline.

Quickstart

Use Pier to run the benchmark:

git clone https://github.com/datacurve-ai/deep-swe
cd deep-swe
cp .env.example .env   # fill in BASETEN_*
# >= 0.2.1 (environment.mounts support) is git-only; PyPI stops at 0.2.0
uv tool install "datacurve-pier @ git+https://github.com/datacurve-ai/pier"

# Dev smoke test (1 task, 30 min timeout, 600 step cap, 1 worker)
./mini-swe-agent/run_dev.sh

# Full 113-task eval (leaderboard-comparable; verification on; 4 workers)
./scripts/pier-run.sh run -c mini-swe-agent/full.yaml -y

# Full eval on the DGX Spark (8 workers; aarch64 — see note below)
./scripts/prebuild.sh   # one-time: task images + agent overlays, natively for arm64
./mini-swe-agent/run_spark.sh

# One-task end-to-end wiring check (prebuild → trial → verify telemetry/cost)
./mini-swe-agent/spark-smoke.sh

# Single task
pier run -c mini-swe-agent/dev.yaml --env-file .env -y -p tasks/<task-id>

Job configs and run wrappers live in mini-swe-agent/:

Config	Wrapper	What
`dev.yaml`	`run_dev.sh`	1-task smoke / iteration
`full.yaml`	—	full 113-task eval (laptop, 4 workers)
`full-spark.yaml`	`run_spark.sh`	full eval, DGX Spark (8 workers)
`full-spark-optimistic.yaml`	`run_spark_optimistic.sh`	22 tasks with >=50% avg pass rate, 1 attempt
`dev-kimi.yaml`	`run_dev_kimi.sh`	dev against `tim-kimi2.6`
`full-spark-kimi.yaml`	`run_spark_kimi.sh`	spark eval against `tim-kimi2.6`
`full-spark-optimistic-kimi.yaml`	`run_spark_optimistic_kimi.sh`	optimistic subset against `tim-kimi2.6`

DEEP_SWE_ROOT (used by the pricing/ and scripts/ bind mounts) is derived automatically by scripts/pier-run.sh from the checkout location — do not set it in .env; a stale value from another machine is stripped and overridden.

aarch64 / DGX Spark

The prebuilt task images in task.toml ([environment].docker_image) are amd64-only, and Rosetta is macOS-only — on Linux arm64 the only x86 emulation is QEMU, which is too slow and flaky for the eval. Instead, mini-swe-agent/full-spark.yaml sets force_build: true so Pier builds each task natively from environment/Dockerfile (the mars-base base image is multi-arch). Run ./scripts/prebuild.sh once before the eval to warm the Docker layer cache (PREBUILD_JOBS=8 to parallelize); failures are summarized with per-task logs. Two Dockerfiles carry arm64-specific fixes: cliffy-config-file-parsing (arch-specific Deno binary) and eicrud-keyset-pagination-cursor (MongoDB ships no arm64 server packages for Debian; uses the Ubuntu repo on arm64). Native arm64 environments are rebuilt, not the leaderboard's prebuilt amd64 images — tasks and verifiers are identical, but strict leaderboard comparability requires amd64.

Token pricing

All job configs bind-mount pricing/ and register pricing/model-registry.json (combined, one entry per model slug) via LITELLM_MODEL_REGISTRY_PATH at trial start — pricing/model changes never require an image rebuild:

Model	Input (cache miss)	Input (cache hit)	Output
`subconscious/tim-qwen3.6-27b`	$0.30 / 1M	$0.15 / 1M	$3.00 / 1M
`subconscious/tim-kimi2.6`	$0.50 / 1M	$0.15 / 1M	$3.00 / 1M

Costs appear in jobs/<job-dir>/result.json as stats.cost_usd and per-trial agent_result.cost_usd when the run completes. Per-call token counts are also logged (llm_timing.jsonl), so any past run can be re-priced under a different curve without rerunning: python3 scripts/reprice.py jobs/<job-dir> pricing/<curve>.json.

Dev vs full limits

	Dev (`mini-swe-agent/dev.yaml`)	Full (`mini-swe-agent/full.yaml`)
Concurrent trials	1	4
Agent timeout	30 min (`override_timeout_sec`)	90 min (`task.toml` `[agent] timeout_sec`)
Step limit	600	Unlimited until timeout (mini-swe default)
Tasks	Pinned in YAML (edit `task_names`)	All 113 under `tasks/`

Pause, resume, and recover

Pause: Press Ctrl+C once in the terminal running pier run. Let Pier shut down gracefully before force-killing the process or closing the laptop.
What is saved: jobs/<timestamp>/config.json, result.json, lock.json, and per-trial directories with result.json for finished trials.
Resume (uses the frozen config.json saved in the job directory—not the current YAML on disk):

pier job resume -p jobs/<job-dir>

Or with DEEP_SWE_ROOT set: ./scripts/pier-run.sh job resume -p jobs/<job-dir>

Completed trials are skipped; only remaining trials run.
By default, pier job resume removes trial dirs that ended with CancelledError so those tasks are retried.
Trial folders without result.json but with partial artifacts are cleaned up and restarted from scratch.
Resume does not continue a single agent mid-trajectory—only at trial granularity.
If resume fails with a lock.json mismatch, Pier detected a change in job inputs (for example DEEP_SWE_ROOT or other .env values) since the job was created; restore the same environment or start a new job directory.

Inspect progress and costs with pier view jobs.

What is Pier

Pier is a Harbor-compatible framework for sandboxed coding-agent evals. It began as a fork of Harbor to support CLI agents in air-gapped tasks: Harbor blocks all outbound traffic in allow_internet = false tasks, including dependency installs and LLM API calls. Pier adds per-agent network allowlists, giving agents only the network access they need while keeping the task environment isolated.

Pier also adds more complete trajectory metadata, a better trajectory viewer, and pier critique run for analyzing agent trajectories. All leaderboard scores were produced with Pier running mini-swe-agent on Modal.

Agents and models

mini-swe-agent is model-agnostic. Pier also drives claude-code, codex, gemini-cli, and opencode directly. Pass --env modal to run in parallel sandboxes on Modal.

Subsets and single tasks

Deterministic random subset of the 113-task corpus:

pier run -p deep-swe/tasks --agent mini-swe-agent --n-tasks 10 --sample-seed 0

Single task:

pier run -p deep-swe/tasks/<task-id> --agent mini-swe-agent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DeepSWE

Task format

Quickstart

aarch64 / DGX Spark

Token pricing

Dev vs full limits

Pause, resume, and recover

What is Pier

Agents and models

Subsets and single tasks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
mini-swe-agent		mini-swe-agent
pricing		pricing
scripts		scripts
tasks		tasks
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

DeepSWE

Task format

Quickstart

aarch64 / DGX Spark

Token pricing

Dev vs full limits

Pause, resume, and recover

What is Pier

Agents and models

Subsets and single tasks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages