Legacy name:
auto_search_rubric/autosr—autosrremains supported as a legacy compatibility shim.
English | 中文
Reward Harness is a Harness engineering project for reward model optimization. It started from the rubric-search idea in Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training, then grew into a broader system for searching, versioning, serving, tracing, and prototyping closed-loop reward model workflows.
This repository keeps iterative as a baseline and uses evolutionary as the default search mode.
It now covers the path from automated rubric search to deployable RM artifacts, an RM server for online scoring,
and RL experiment registry/lineage tooling that lets external trainers consume the reward service while preserving
traceability.
Reward Harness is intentionally paused at the current personal open-source boundary. Stages A-D are the implemented and tested core: long-running rubric search, reproducible run records, deployable RM artifacts, RM server scoring, and RL training manifest/lineage management. Later stages require real RL/RM training resources that this project does not currently assume.
Stage E is therefore provided as prototype design documentation rather than active implementation work: docs/design-docs/03-stage-e-classifier-rm.md. It captures the intended next step: using RL training samples to build denoised score/preference datasets and hand them to an external classifier RM trainer.
The main idea is not just "search a better rubric". Reward Harness treats the reward model lifecycle as an engineering harness:
- automatically search and iterate rubric reward models;
- export the selected rubric into a versioned, deployable RM artifact;
- deploy that artifact behind an RM server for online scoring;
- manage RL training metadata, results, evaluation reports, comparisons, and lineage around that RM server;
- prototype a Stage E data plane where RL samples are repeatedly scored, denoised, converted into preference data, and used by an external classifier RM trainer.
- Unified runtime configuration with typed enums (
reward_harness.types) and layered config dataclasses (reward_harness.config) - Composition-root factory (
ComponentFactory) for backend-aware dependency wiring - Canonical domain models in
reward_harness.data_modelswith legacy compatibility re-export inautosr.models - Search extensibility:
- Parent selection:
rank,tournament,top_k - Adaptive mutation:
fixed,success_feedback,exploration_decay,diversity_driven - Iteration scope:
global_batch(dataset-level) andprompt_local(prompt-level independent evolution)
- Parent selection:
- LLM architecture split into transport config (
reward_harness.llm_config) and runtime config (reward_harness.config) - Deployable RM artifacts with validated schema and embedded runtime snapshot for server startup
- Deployment tracking via
reward_harness.rm.deploymanifests with per-targetprevious_artifact_idresolution - RM Server MVP (
reward_harness.rm.server) exposing/healthz,/score, and/batch_scorewith closed-loop LLM scoring - RL experiment registry and lineage tooling for external trainer manifests, results, eval reports, comparisons, and regression checks
- Stage E prototype design for classifier RM distillation from RL samples without requiring this repository to own GPU training
- Reproducibility outputs:
run_manifestembedded in output JSON- archived manifest and replay script under
<output_parent>/run_records/ - RM deployment records under
artifacts/rm_deployments/ - optional RM server request logs under
artifacts/rm_server_logs/
reward_harness/cli.py(legacy compatible:autosr/cli.py)- Parses CLI args only
- Builds
RuntimeConfig - Delegates runtime wiring to
ComponentFactory
reward_harness/factory.py(legacy compatible:autosr/factory.py)- Single composition root for backend selection and component assembly
- Auto-resolves rank-based judge when all candidates provide
metadata.rank
reward_harness/config.py(legacy compatible:autosr/config.py)- Runtime-level configuration:
RuntimeConfigLLMBackendConfigSearchAlgorithmConfigObjectiveConfig(compat alias:ObjectiveFunctionConfig)InitializerStrategyConfig,ContentExtractionConfig,VerifierConfig
- Runtime-level configuration:
reward_harness/llm_config.py(legacy compatible:autosr/llm_config.py)- Low-level LLM transport/model config (
LLMConfig,RoleModelConfig)
- Low-level LLM transport/model config (
reward_harness/types.py(legacy compatible:autosr/types.py)- Shared enums:
BackendType,SearchMode,EvolutionIterationScope,SelectionStrategyAdaptiveMutationSchedule,InitializerStrategy,ExtractionStrategy,LLMRole
- Shared enums:
reward_harness/data_models.py: canonical domain entities (Rubric,Criterion,PromptExample, ...)reward_harness/models.py: canonical models (legacy compatible viaautosr/models.py)reward_harness/exceptions.py: shared LLM exceptions (LLMCallError,LLMParseError)reward_harness/io_utils.py: dataset/rubric I/O and run-record persistencereward_harness/run_records/use_cases.py: run manifest + reproducible shell script generation
reward_harness/search/config.py:IterativeConfig,EvolutionaryConfig,SearchResultreward_harness/search/iterative.py: iterative baseline implementationreward_harness/search/evolutionary.py: evolutionary algorithm implementationreward_harness/search/strategies.py: reusable search helpersreward_harness/search/selection_strategies.py: parent selection policiesreward_harness/search/adaptive_mutation.py: mutation scheduler and diversity metricsreward_harness/search/use_cases.py: searcher entrypoints exports
reward_harness/llm_components/base.py: request/retry base + prompt rendering fallbackreward_harness/llm_components/parsers.py: response normalization/validationreward_harness/llm_components/use_cases.py: initializer/proposer/verifier/judge implementationsreward_harness/llm_components/factory.py: legacy helper kept for compatibilityreward_harness/content_extraction/strategies.py:tag/regex/identityextractionreward_harness/content_extraction/use_cases.py: extraction-decorated verifierreward_harness/prompts/loader.py+reward_harness/prompts/constants.py: file templates and constant fallback
reward_harness/rl/: experiment registry, lineage tracking, comparison, regression detection, and external RL training-run reference scaffoldingdata_models.py,registry.py,lineage.py,validation.py,io.pycli/:record_manifest,record_eval,record_result,show_lineageverl/:prepare_training_run,run_verl_training,finalize_training_run,reward_client
docs/design-docs/03-stage-e-classifier-rm.md: prototype design for the future classifier RM data plane
reward_harness/rm/data_models.py: deployable RM artifact schema and deploy manifest schemareward_harness/rm/use_cases.py: artifact export and deployment-record use casesreward_harness/rm/export.py: CLI for exporting search output into a deployable RM artifactreward_harness/rm/deploy.py: CLI for recording per-environment deployment manifestsreward_harness/rm/server.py: FastAPI RM server that loads artifact runtime snapshot and serves scoring APIs
reward_harness/: core package (legacy compatible:autosr/)reward_harness/rm/: RM artifact/export/deploy/server modulesreward_harness/rl/: RL experiment lineage and external training-run reference modulesprompts/: prompt templates (supports locale folders such asprompts/zh/andprompts/en/)tests/:unittesttest suitescripts/: unit/integration/formal run scriptsexamples/: demo datasets and examplesartifacts/: default output directory for search outputs, RM artifacts, deployment manifests, and server logs
Requires Python >=3.11 and uv.
uv syncRun commands with uv run:
uv run python -m reward_harness.cli --help
# Legacy compatibility: uv run python -m autosr.cli --helpuv sync installs both the search stack and RM server dependencies (fastapi, uvicorn).
Default (evolutionary):
uv run python -m reward_harness.cli \
--dataset examples/demo_dataset.json \
--mode evolutionary \
--output artifacts/best_rubrics.jsonIterative baseline:
uv run python -m reward_harness.cli \
--dataset examples/single_case.json \
--mode iterative \
--output artifacts/best_rubrics_iterative.jsonEvolutionary with custom strategy and prompt locale:
uv run python -m reward_harness.cli \
--dataset examples/single_case_with_rank.json \
--mode evolutionary \
--output artifacts/best_rubrics_rank.json \
--selection-strategy top_k \
--adaptive-mutation diversity_driven \
--prompt-language zhEnd-to-end RM flow:
# 1) Search for the best rubric
uv run python -m reward_harness.cli \
--dataset examples/demo_dataset.json \
--mode evolutionary \
--output artifacts/best_rubrics.json
# 2) Export a deployable RM artifact
uv run python -m reward_harness.rm.export \
--search-output artifacts/best_rubrics.json \
--out-artifact artifacts/rm_artifacts/rm_v1.json
# 3) Record deployment metadata
uv run python -m reward_harness.rm.deploy \
--artifact artifacts/rm_artifacts/rm_v1.json \
--deployment-target dev
# 4) Start the RM server
export LLM_API_KEY="..."
uv run python -m reward_harness.rm.server \
--artifact artifacts/rm_artifacts/rm_v1.json \
--host 0.0.0.0 \
--port 8080 \
--request-log-path artifacts/rm_server_logs/requests.jsonl--backend {auto,mock,llm}:
auto(default): resolve tollmwhen API key exists, elsemockllm: requires API key (LLM_API_KEYby default, configurable via--api-key-env)mock: local heuristic components only
Default endpoint/model:
--base-url https://openrouter.ai/api/v1--model-default stepfun/step-3.5-flash:free
Role-specific model override is supported:
--model-initializer--model-proposer--model-verifier--model-judge
Prompt locale loading order:
prompts/<language>/(when--prompt-languageis set)prompts/- built-in constants in code
Formal LLM-backed flow:
export LLM_API_KEY="..."
./scripts/run_formal_search.sh \
examples/call_summary_dataset_with_rank_single.json \
evolutionary \
artifacts/best_rubrics_formal_call_summary.jsonNote:
scripts/run_formal_search.shnow defaults to--evolution-iteration-scope prompt_local- Override with environment variable if needed:
EVOLUTION_ITERATION_SCOPE=global_batch ./scripts/run_formal_search.sh
Objective:
score = TailAcc - lambda_var * TailVar + mu_diverse * DiverseTailAcc
Common flags:
--generations,--population-size,--mutations-per-round,--batch-size--mutation-parent-count(number of parent rubrics used for mutation each generation)--tail-fraction,--lambda-var,--mu-diverse--pair-confidence-prior(pairwise confidence shrinkage; set0to disable)--selection-strategy {rank,tournament,top_k}--adaptive-mutation {fixed,success_feedback,exploration_decay,diversity_driven}--evolution-iteration-scope {global_batch,prompt_local}--stop-when-distinguished/--no-stop-when-distinguished(prompt-local early stop)--distinguish-margin(override top-margin threshold; default uses objective tie tolerance)
Verifier grading scale:
- Supports continuous criterion scores in
0-5(preferred) and0-1(compatible). - Final rubric score is normalized to
0-1before objective computation.
Iteration behavior:
global_batch:- Original dataset-level generations
- each generation evolves only the selected hard prompts (
batch_size)
prompt_local:- each prompt evolves independently for up to
generations - no cross-prompt batching dependency
- can stop early per prompt when top candidates are already distinguished
- supports step-wise checkpoint/resume at prompt/generation boundaries
- each prompt evolves independently for up to
Input JSON must contain top-level prompts:
{
"prompts": [
{
"prompt_id": "p1",
"prompt": "Write ...",
"candidates": [
{
"candidate_id": "c1",
"text": "response text",
"source": "strong",
"metadata": { "quality": 0.91, "rank": 1 }
}
]
}
]
}Notes:
prompt_idandpromptare required- each prompt must provide at least 2 candidates
metadata.rankis optional (1is best); if present for all candidates, rank-based judge is auto-selected
Main output JSON (--output) includes:
best_rubrics(array; each item may includebest_candidate_idandcandidate_scores)best_objective_scoresbest_scores(legacy alias ofbest_objective_scores)- optional
run_manifest
Per-run reproducibility files are written to:
<output_parent>/run_records/<output_stem>_<run_id>.manifest.json<output_parent>/run_records/<output_stem>_<run_id>.reproduce.sh
RM artifact and deployment outputs:
artifacts/rm_artifacts/*.json: deployable RM artifacts exported from search resultsartifacts/rm_deployments/*.json: deployment records withdeployment_target,deployed_by, andprevious_artifact_idartifacts/rm_server_logs/requests.jsonl: request logs emitted byreward_harness.rm.server
RM server notes:
- The server requires an artifact with the embedded runtime snapshot produced by
reward_harness.rm.export. - Stable endpoints:
GET /healthz,POST /score,POST /batch_score.
Recommended local quality gate before handing off changes:
./scripts/run_tests_unit.sh
uv run python scripts/validate_docs.py
./scripts/run_quality_checks.shrun_quality_checks.sh enforces ruff lint and the current staged mypy scope.
Ruff format remains a visible report rather than a hard gate until the repository
is batch-formatted.
Unit tests:
./scripts/run_tests_unit.shIntegration tests (requires API key):
export LLM_API_KEY="..."
./scripts/run_tests_integration.shAggregate entrypoint:
./scripts/run_tests.shRun all tests directly:
uv run python -m unittest discover -s tests -p "test_*.py"Architecture-focused regression set:
uv run python -m unittest \
tests.test_architecture_refactor \
tests.test_cli_backend_selection \
tests.test_cli_best_candidates \
tests.test_io_utils \
tests.test_search_config_enum_unification \
tests.test_data_models_compat \
tests.test_exceptions_module \
tests.test_evolutionary_decouplingRL lineage regression set:
uv run python -m unittest \
tests.test_rl_lineage \
tests.test_rl_verl_reference_flowRM artifact/server regression set:
uv run python -m unittest \
tests.test_rm_artifact \
tests.test_rm_deploy_manifest \
tests.test_rm_server- Import domain entities from
reward_harness.data_modelsin new code. reward_harness.modelsis the canonical module;autosr.modelsis a long-term compatibility re-export shim for historical import paths; keep it working, but do not use it in new code.- Prefer
ComponentFactory(RuntimeConfig(...))over manual runtime wiring. - Keep secrets in environment variables only (
LLM_API_KEY, optionalLLM_BASE_URL,LLM_MODEL).