A test runner for agentskills.io-style AI agent skills
-
Updated
Jun 17, 2026 - TypeScript
A test runner for agentskills.io-style AI agent skills
Crash-test insurance claim AI agents before production.
Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs
Turn feature specs into merged PRs with a self-supervising swarm of coding agents — parallel execution, isolated sandboxes, DAG dependencies. Open-source, self-hostable, model-agnostic (Claude / Gemini / Codex).
Visualize LLM outputs against datasets, manually annotate results, and run automated evaluations to algorithmically optimize prompts.
An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"
Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.
Three runnable enterprise AI systems showing secure RAG, governed agents, AI release reliability, evals, traces, audit logs, and approval gates.
Squeeze your model with pressure prompts to see if its behavior leaks.
Detecting Relational Boundary Erosion in AI systems. A framework for testing whether models maintain honest, calibrated, and appropriate boundaries.
Codex-native autoresearch harness with structured worker/judge turns for optimizing anything you can measure.
Disposable Daytona sandboxes for LLM evals and isolated command execution
End-to-end LLM eval framework for AI products. Provider-agnostic and agent-native, with skills, traces, judge validation, and a human review interface.
A framework for evaluating large language models (LLMs) across a variety of tasks.
Alpha benchmark for repo continuation intelligence
A model-agnostic benchmark for sealed, re playable AI reasoning failures and tamper-evident failure history.
7 Claude Code skills for software architecture review (Python, web, cloud, microservices). Includes A/B benchmarks against unskilled baseline, assertion-graded eval suite, and interactive dashboards.
In this we evaluate the LLM responses and find accuracy
Add a description, image, and links to the llm-evals topic page so that developers can more easily learn about it.
To associate your repository with the llm-evals topic, visit your repo's landing page and select "manage topics."