llm-evals

Star

Here are 100 public repositories matching this topic...

darkrishabh / agent-skills-eval

Star

A test runner for agentskills.io-style AI agent skills

cli yaml typescript ai-agents jsonl llm-evaluation llm-evals agent-evals agent-skills openai-compatible agentskills

Updated Jun 17, 2026
TypeScript

samarailly51-pixel / claimpilot-harness

Star

Crash-test insurance claim AI agents before production.

python testing insurance ai-agents prompt-injection llm-evals agent-evaluation

Updated Jun 15, 2026
Python

fastxyz / skill-optimizer

Star

Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs

cli benchmark sdk ai mcp evaluation optimizer eval evaluation-framework ai-agent llm llm-eval evals openrouter llm-evaluation-framework tool-calling llm-evals ai-skill

Updated May 28, 2026
TypeScript

Turn feature specs into merged PRs with a self-supervising swarm of coding agents — parallel execution, isolated sandboxes, DAG dependencies. Open-source, self-hostable, model-agnostic (Claude / Gemini / Codex).

Updated Jun 18, 2026
Python

ALucek / evaluizer

Star

Visualize LLM outputs against datasets, manually annotate results, and run automated evaluations to algorithmically optimize prompts.

llm-optimizer llm-evals prompt-annotation prompt-optimizer

Updated Nov 22, 2025
TypeScript

The-Swarm-Corporation / StatisticalModelEvaluator

Star

An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"

ai ml multiagent agents llms evals llm-evals agent-evals multi-agent-eval

Updated Oct 6, 2025
Python

pyladiesams / eval-llm-based-apps-jan2025

Star

Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

workshop llm llms llmops llm-eval llm-test llm-evaluation-framework llm-evaluation-metrics llm-monitoring llm-testing llm-evals

Updated May 6, 2025
Jupyter Notebook

tiramitree / fde-ai-systems-portfolio

Star

Three runnable enterprise AI systems showing secure RAG, governed agents, AI release reliability, evals, traces, audit logs, and approval gates.

python openai ai-safety human-in-the-loop ai-agents rag enterprise-ai agentic-workflows tool-calling llm-evals forward-deployed-engineering responses-api

Updated Jun 15, 2026
Python

tpertner / squeeze

Star

Squeeze your model with pressure prompts to see if its behavior leaks.

reliability evaluation calibration alignment quality-assurance metamorphic-testing ai-safety trustworthiness hallucinations prompt-engineering llm-eval llm-evals

Updated Mar 1, 2026
Python

SaiTeja-Erukude / agentdog

Star

AgentDog helps developers inspect, test, score, and monitor AI agent runs locally.

testing ai-safety ai-agents tool-use llm-evals agent-evaluation

Updated May 23, 2026
Python

tpertner / confess

Star

Detecting Relational Boundary Erosion in AI systems. A framework for testing whether models maintain honest, calibrated, and appropriate boundaries.

python yaml calibration alignment metamorphic-testing model-evaluation ai-safety red-teaming prompt-injection hallucination-detection llm-evals evaluation-harness

Updated Feb 22, 2026
Python

kevinschaul / llm-evals

Star

Because we should all have our own set of LLM evals.

llm llm-evals

Updated Apr 28, 2026
Python

aelaguiz / codex-autoresearcher

Star

Codex-native autoresearch harness with structured worker/judge turns for optimizing anything you can measure.

python research optimization codex ai-agents llm-evals experiment-runner autoresearch

Updated Mar 21, 2026
Python

spences10 / ralph-town

Star

Disposable Daytona sandboxes for LLM evals and isolated command execution

cli typescript mcp sandbox daytona llm evals llm-evals sandbox-orchestration

Updated Jun 17, 2026
TypeScript

itseffi / ai-product-evals

Star

End-to-end LLM eval framework for AI products. Provider-agnostic and agent-native, with skills, traces, judge validation, and a human review interface.

benchmark tracing gemini openai observability rag anthropic evals openrouter ollama rag-evaluation llm-as-judge ai-products llm-evals

Updated May 24, 2026
JavaScript

LLMSystems / llm-evals

Star

A framework for evaluating large language models (LLMs) across a variety of tasks.

nlp benchmark ai evaluation-framework ai-evaluation llm llm-evaluation llm-as-a-judge g-eval llm-evals

Updated Mar 18, 2026
Python

mverab / Reposcale

Star

Alpha benchmark for repo continuation intelligence

python open-source benchmark evaluation developer-tools ai-agents ai-engineering llm-evals agent-evals llm-benchmark

Updated Apr 10, 2026
Python

iotaverbum-core / sfa-bench

Star

A model-agnostic benchmark for sealed, re playable AI reasoning failures and tamper-evident failure history.

benchmark provenance deterministic reasoning tamper-evident replayability failure-analysis ai-evaluation auditability llm-evals

Updated Jun 16, 2026
Python

keez97 / claude-architecture-skills

Star

7 Claude Code skills for software architecture review (Python, web, cloud, microservices). Includes A/B benchmarks against unskilled baseline, assertion-graded eval suite, and interactive dashboards.

python tdd architecture cloud-infrastructure code-review software-architecture ai-agents fastapi architecture-review prompt-engineering anthropic llm-evals claude-code claude-code-skills llm-benchmarks

Updated May 8, 2026
HTML

Pavansomisetty21 / GEval-Metrics-Analyzing-the-Reliability-of-LLM-Responses

Sponsor

Star

In this we evaluate the LLM responses and find accuracy

llm-evaluation-metrics llm-evals geval

Updated Jul 8, 2025
Python

Improve this page

Add a description, image, and links to the llm-evals topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-evals topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-evals

Here are 100 public repositories matching this topic...

darkrishabh / agent-skills-eval

samarailly51-pixel / claimpilot-harness

fastxyz / skill-optimizer

kivo360 / OmoiOS

ALucek / evaluizer

The-Swarm-Corporation / StatisticalModelEvaluator

pyladiesams / eval-llm-based-apps-jan2025

tiramitree / fde-ai-systems-portfolio

tpertner / squeeze

SaiTeja-Erukude / agentdog

tpertner / confess

kevinschaul / llm-evals

aelaguiz / codex-autoresearcher

spences10 / ralph-town

itseffi / ai-product-evals

LLMSystems / llm-evals

mverab / Reposcale

iotaverbum-core / sfa-bench

keez97 / claude-architecture-skills

Pavansomisetty21 / GEval-Metrics-Analyzing-the-Reliability-of-LLM-Responses

Improve this page

Add this topic to your repo