This guide describes the public reproduction path for OpenGUI-RL. It is intentionally honest about what the repository can reproduce by itself and what requires external benchmark access.
Without external datasets or checkpoints, users can:
- inspect the public README, docs, and summary artifacts,
- run lightweight unit tests,
- inspect the reward schema, candidate representation, and dual-path verifier code,
- use synthetic records in
data_examples/to understand the expected data interface, - inspect public summary artifacts in
artifacts/.
The main reported experiments require:
- Mind2Web access for Stage A and Stage B training / evaluation,
- ScreenSpot-v2 access for held-out point-native and dual-path evaluation,
- VisualWebBench access for supplementary transfer analysis,
- Qwen2.5-VL model access and GPU memory suitable for VLM inference or LoRA training.
The repository does not ship model checkpoints, raw screenshots, or benchmark payloads.
conda create -n opengui-rl python=3.10 -y
conda activate opengui-rl
pip install -e ".[dev]"Optional local environment:
cp .env.example .envFill in tokens only on your machine. Do not commit .env.
Stage A trains a Qwen2.5-VL LoRA policy to emit structured GUI actions. The final project setting uses screenshot + instruction + OCR/DOM-style candidate cues.
python scripts/run_train_sft.py \
--config configs/train/mind2web_stageA_qwen2_5_vl_3b_sft_hybrid_candidates.yamlExpected role:
- tests whether a faithful candidate-aware interface can build a strong supervised grounding baseline,
- produces the first-stage policy used by Stage B candidate export,
- depends on external Mind2Web access and local compute.
Stage B exports a small candidate pool per example and labels each candidate with deterministic verifiable reward.
python scripts/run_generate_candidates.py \
--config configs/train/mind2web_stageB_candidates_qwen2_5_vl_3b_hybrid_stagea.yamlThe exported pool supports:
- first-choice evaluation,
- oracle best-of-(k) headroom,
- pairwise preference construction,
- learned reward-based reranking.
python scripts/run_train_reranker.py \
--config configs/train/mind2web_stageB_reranker_qwen_hybrid_stagea.yamlThe reranker is intentionally small and auditable. It is used to test whether reward-labeled candidate selection adds value after Stage A, not to hide the grounding problem inside another large model.
ScreenSpot-v2 point-native:
python scripts/run_eval_screenspot_v2.py \
--config configs/eval/screenspot_v2_qwen2_5_vl_3b_point_native_decoupled.yamlScreenSpot-v2 dual-path verifier:
python scripts/run_eval_dual_path_verifier.py \
--config configs/eval/screenspot_v2_qwen2_5_vl_3b_dual_path_verifier.yamlVisualWebBench point-native:
python scripts/run_eval_visualwebbench.py \
--config configs/eval/visualwebbench_qwen2_5_vl_3b_point_native_decoupled.yamlVisualWebBench dual-path verifier:
python scripts/run_eval_visualwebbench_dual_path_verifier.py \
--config configs/eval/visualwebbench_qwen2_5_vl_3b_dual_path_verifier.yamlWhen local saved artifacts are available, recompute the summary:
python scripts/run_quantitative_metrics_suite.pyThe public release includes a frozen copy of the reported summary in artifacts/metrics/. That copy is for inspection; it is not a substitute for rerunning the benchmark pipeline with external data access.
Do not read the reported numbers as "RL always beats supervision." The intended conclusion is narrower:
- representation quality comes first,
- reward-based reranking is useful when candidate-pool headroom remains,
- point-native inference transfers well,
- candidate-aware transfer depends on semantically meaningful candidate protocols.