Automated Tekton-orchestrated pipeline on OpenShift for evaluating AI skill submissions. Measures skill efficacy by comparing agent performance with and without skills (the "gap"), producing statistical reports with pass rates, uplift metrics, and significance tests.
- Submit — Push a skill directory to the submissions repo; a Tekton EventListener triggers the pipeline.
- Validate — Checks structure, compiles test files, validates
metadata.yamlschema. - Generate — AI-assisted generation of missing test artifacts (optional):
- Harbor: generates
instruction.mdandtest_outputs.pyfromSKILL.md - ASE: generates
evals.jsonfromSKILL.mdif not provided
- Harbor: generates
- Quality Review — AI-powered review of skill/test coherence (advisory, non-blocking).
- Security Scan — Optional Cisco AI Defense scan for prompt injection, data exfiltration risks.
- Evaluate — Two evaluation engines supported:
- Harbor — Full agent evaluation with container isolation:
- Scaffold treatment/control container variants
- Build & push images to OpenShift internal registry
- Run N=20 attempts per variant
- ASE — Lightweight LLM-as-judge evaluation using
evals.jsonassertions (no containers).
- Harbor — Full agent evaluation with container isolation:
- Analyze — Computes pass rates, uplift (gap), statistical significance (p-value).
- Publish — Stores reports to MinIO, records results to PostgreSQL.
ABEvalFlow/
├── Docs/ # ADR, implementation plan, guides
├── pipeline/
│ ├── pipeline.yaml # Main pipeline definition
│ ├── triggers/ # EventListener, TriggerTemplate, TriggerBinding
│ └── tasks/
│ ├── validate.yaml
│ ├── generate_tests.yaml
│ ├── test-quality-review.yaml
│ ├── security-scan.yaml
│ ├── scaffold.yaml
│ ├── build-push.yaml
│ ├── harbor-eval.yaml
│ ├── analyze-report.yaml
│ └── publish-store.yaml
├── templates/ # Jinja2 templates (Dockerfiles, test.sh, task.toml)
├── scripts/ # Python scripts invoked by pipeline tasks
├── config/ # K8s manifests (RBAC, PostgreSQL, LiteLLM)
└── tests/ # Unit and integration tests
| Repository | Purpose |
|---|---|
| skill-submissions | Submission intake — users push skills here to trigger evaluation |
| skills_eval_corrections | Harbor fork with OpenShift backend |
| cisco-ai-defense/skill-scanner | Security scanner for prompt injection and data exfiltration detection |
my-skill-name/
├── instruction.md # Task description (required)
├── skills/
│ └── SKILL.md # Skill definition (required)
├── tests/
│ ├── test_outputs.py # Verification tests (required)
│ └── llm_judge.py # LLM-based judge (optional)
├── docs/ # Reference documentation (optional)
├── supportive/ # Mock MCPs, data files (optional, <50MB)
└── metadata.yaml # Name, persona, etc. (required)
my-skill-name/
├── skills/
│ └── SKILL.md # Skill definition (required)
├── evals/
│ ├── evals.json # Evaluation prompts and assertions (optional, generated if missing)
│ └── files/ # Test data files (optional)
└── metadata.yaml # Name, etc. (required)
Trigger with eval-engine=ase parameter. See Trigger Guide for details.
The pipeline is LLM-agnostic. Three modes are supported:
| Mode | Proxy Required? |
|---|---|
| Direct API key (Anthropic, OpenAI, etc.) | No |
| opencode + self-hosted model (vLLM, Ollama) | No |
| Google Vertex AI + LiteLLM proxy | Yes |
- OpenShift cluster with Pipelines operator (Tekton)
- Container registry (Quay.io) with push credentials
- Harbor fork with OpenShift backend
- LLM access (one of the three modes above)
- Python 3.11+
- Trigger Guide — How to submit skills for evaluation
- ADR: Skill Evaluation Pipeline
Apache License 2.0