ABEvalFlow

Automated Tekton-orchestrated pipeline on OpenShift for evaluating AI skill submissions. Measures skill efficacy by comparing agent performance with and without skills (the "gap"), producing statistical reports with pass rates, uplift metrics, and significance tests.

How It Works

Submit — Push a skill directory to the submissions repo; a Tekton EventListener triggers the pipeline.
Validate — Checks structure, compiles test files, validates metadata.yaml schema.
Generate — AI-assisted generation of missing test artifacts (optional):
- Harbor: generates instruction.md and test_outputs.py from SKILL.md
- ASE: generates evals.json from SKILL.md if not provided
Quality Review — AI-powered review of skill/test coherence (advisory, non-blocking).
Security Scan — Optional Cisco AI Defense scan for prompt injection, data exfiltration risks.
Evaluate — Two evaluation engines supported:
- Harbor — Full agent evaluation with container isolation:
  - Scaffold treatment/control container variants
  - Build & push images to OpenShift internal registry
  - Run N=20 attempts per variant
- ASE — Lightweight LLM-as-judge evaluation using evals.json assertions (no containers).
Analyze — Computes pass rates, uplift (gap), statistical significance (p-value).
Publish — Stores reports to MinIO, records results to PostgreSQL.

Repository Structure

ABEvalFlow/
├── Docs/                    # ADR, implementation plan, guides
├── pipeline/
│   ├── pipeline.yaml        # Main pipeline definition
│   ├── triggers/            # EventListener, TriggerTemplate, TriggerBinding
│   └── tasks/
│       ├── validate.yaml
│       ├── generate_tests.yaml
│       ├── test-quality-review.yaml
│       ├── security-scan.yaml
│       ├── scaffold.yaml
│       ├── build-push.yaml
│       ├── harbor-eval.yaml
│       ├── analyze-report.yaml
│       └── publish-store.yaml
├── templates/               # Jinja2 templates (Dockerfiles, test.sh, task.toml)
├── scripts/                 # Python scripts invoked by pipeline tasks
├── config/                  # K8s manifests (RBAC, PostgreSQL, LiteLLM)
└── tests/                   # Unit and integration tests

Related Repositories

Repository	Purpose
skill-submissions	Submission intake — users push skills here to trigger evaluation
skills_eval_corrections	Harbor fork with OpenShift backend
cisco-ai-defense/skill-scanner	Security scanner for prompt injection and data exfiltration detection

Submission Formats

Harbor Format (full agent evaluation)

my-skill-name/
├── instruction.md       # Task description (required)
├── skills/
│   └── SKILL.md         # Skill definition (required)
├── tests/
│   ├── test_outputs.py  # Verification tests (required)
│   └── llm_judge.py     # LLM-based judge (optional)
├── docs/                # Reference documentation (optional)
├── supportive/          # Mock MCPs, data files (optional, <50MB)
└── metadata.yaml        # Name, persona, etc. (required)

ASE Format (lightweight LLM-as-judge)

my-skill-name/
├── skills/
│   └── SKILL.md         # Skill definition (required)
├── evals/
│   ├── evals.json       # Evaluation prompts and assertions (optional, generated if missing)
│   └── files/           # Test data files (optional)
└── metadata.yaml        # Name, etc. (required)

Trigger with eval-engine=ase parameter. See Trigger Guide for details.

LLM Access

The pipeline is LLM-agnostic. Three modes are supported:

Mode	Proxy Required?
Direct API key (Anthropic, OpenAI, etc.)	No
opencode + self-hosted model (vLLM, Ollama)	No
Google Vertex AI + LiteLLM proxy	Yes

Prerequisites

OpenShift cluster with Pipelines operator (Tekton)
Container registry (Quay.io) with push credentials
Harbor fork with OpenShift backend
LLM access (one of the three modes above)
Python 3.11+

Documentation

Trigger Guide — How to submit skills for evaluation
ADR: Skill Evaluation Pipeline

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
.github		.github
Docs		Docs
abevalflow		abevalflow
config		config
examples		examples
pipeline		pipeline
scripts		scripts
templates		templates
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ABEvalFlow

How It Works

Repository Structure

Related Repositories

Submission Formats

Harbor Format (full agent evaluation)

ASE Format (lightweight LLM-as-judge)

LLM Access

Prerequisites

Documentation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ABEvalFlow

How It Works

Repository Structure

Related Repositories

Submission Formats

Harbor Format (full agent evaluation)

ASE Format (lightweight LLM-as-judge)

LLM Access

Prerequisites

Documentation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages