Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .codex/verify.commands
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# codex-os-managed
pnpm install
pnpm run build
cargo test --manifest-path src-tauri/Cargo.toml
18 changes: 18 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
name: Test (Rust)
on:
push:
branches: [main, 'feat/**']
pull_request:
branches: [main]

jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-toolchain@stable
- uses: Swatinem/rust-cache@v2
with:
workspaces: src-tauri
- run: cd src-tauri && cargo clippy -- -D warnings
- run: cd src-tauri && cargo nextest run || cargo test
43 changes: 43 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# ModelColosseum Codex Playbook

## Communication Contract

Follow the global Codex communication contract. Keep updates short, beginner-friendly, and focused on what changed, what passed, and what still needs attention.

## Project Goal

ModelColosseum is a local-first Tauri 2 desktop app for evaluating Ollama models through arenas, benchmarks, sparring, scorecards, and a SQLite-backed leaderboard.

## First Read

- `README.md`
- `CLAUDE.md`
- `src-tauri/Cargo.toml`
- `.codex/verify.commands`

## Core Rules

- Keep all model calls local to Ollama unless the user explicitly changes the product contract.
- Do not add telemetry, cloud sync, or remote judging.
- Keep SQLite as the source of truth under the app data path.
- Keep Rust responsible for Ollama communication, scoring, Elo calculations, database writes, and streaming events.
- Frontend should stay presentational/stateful; avoid duplicating scoring or persistence rules in React.
- Do not assume Ollama is running; health check and fail gracefully.

## Codex App Usage

- Use Codex App Projects for repo-scoped implementation, debugging, and verification.
- Use Worktrees for debate engine, benchmark runner, auto-judge, Elo, database migration, Ollama streaming, import/export, or Tauri capability changes.
- Use file search before editing because behavior spans Rust engines, SQLite schema, prompt templates, Tauri commands/events, and React mode views.
- Use app-window or browser evidence when arena, benchmark, sparring, leaderboard, settings, or export UI changes.
- Use artifacts when benchmark results, scorecards, or comparison reports need reusable review.

## Verification

Use `.codex/verify.commands` as the canonical local gate. Current session note: Rust tests pass, while frontend build is blocked until `esbuild` is approved through pnpm build approval.

## Done Criteria

- The relevant verifier commands have been run, or the exact blocker is recorded.
- Scoring, Elo, benchmark, and database changes have focused tests or fixture evidence.
- UI changes have app-window or screenshot evidence when visual behavior matters.
61 changes: 61 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,3 +69,64 @@ Key modules:
- Do not use class components in React — hooks only
- Do not store any data outside `~/.model-colosseum/` — single source of truth
- Do not assume Ollama is running — always health check first and handle absence gracefully

<!-- portfolio-context:start -->
# Portfolio Context

## What This Project Is

ModelColosseum is an active local project in the /Users/d/Projects portfolio.

## Current State

**v1.0.0 — Feature Complete** (all phases done, audit remediation applied)

- [x] **Phase 0: Foundation** — Tauri 2.0 scaffold, SQLite (13 tables, WAL), Ollama REST client, Elo module
- [x] **Phase 1: Arena Mode** — Debate engine (freestyle/formal/socratic), vote + Elo, leaderboard, history
- [x] **Phase 2: Benchmark** — CRUD suites/prompts, runner with TTFT/TPS metrics, manual + auto-judge scoring, blind comparison, hardware metrics, import/export
- [x] **Phase 3: Sparring Ring** — Human vs AI debates, 3 difficulty levels, 4-phase structure, scorecards, user Elo
- [x] **Phase 4: Polish** — 3 debate formats, topic suggestions, settings page, blind test, animations, skeleton loading, export (Markdown/CSV/JSON)
- [x] **Audit** — Security hardening (configurable Ollama URL, query limit caps, settings key whitelist), accessibility (ARIA attributes), error handling, 67 Rust tests

## Stack

- Runtime: Tauri 2.x (Rust backend + webview frontend)
- Frontend: React 19 + TypeScript 5.x strict mode
- Build: Vite 6.x with `@tauri-apps/vite-plugin`
- Styling: Tailwind CSS 4.x (dark theme, gold/amber accents)
- State: Zustand 5.x
- Routing: React Router 7.x
- Charts: Recharts 2.x
- Database: SQLite via `rusqlite` 0.31+ (bundled, WAL mode)
- HTTP: `reqwest` 0.12+ (async streaming)
- Async: `tokio` 1.x
- System info: `sysinfo` 0.31+
- LLM: Ollama REST API (localhost:11434)

## How To Run

- TypeScript strict mode. No `any` types.
- React: Functional components with hooks only. No class components.
- Rust: `clippy` clean. `cargo fmt` on save.
- File naming: `snake_case.rs` for Rust, `PascalCase.tsx` for React components, `camelCase.ts` for utilities
- Git commits: conventional commits (`feat:`, `fix:`, `refactor:`, `chore:`)
- All Tauri commands return `Result<T, String>` — handle errors in Rust, display in frontend
- Database writes wrapped in explicit transactions
- No unwrap() in production Rust code — use ? operator or proper error handling

## Known Risks

- Do not scaffold the entire project in one session — follow the phased plan strictly
- Do not use Tauri v1 APIs or import paths — this is Tauri 2.x (`@tauri-apps/api` v2)
- Do not use `tauri-plugin-sql` — we use `rusqlite` directly
- Do not use `unwrap()` in Rust production code — use `?` or proper error handling
- Do not make any network calls except to localhost Ollama (no telemetry, no cloud)
- Do not use class components in React — hooks only
- Do not store any data outside `~/.model-colosseum/` — single source of truth
- Do not assume Ollama is running — always health check first and handle absence gracefully

## Next Recommended Move

Use this context plus the README and supporting docs to resume the next active task, then promote the repo beyond minimum-viable by capturing a dedicated handoff, roadmap, or discovery artifact.

<!-- portfolio-context:end -->
2 changes: 2 additions & 0 deletions pnpm-workspace.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
allowBuilds:
esbuild: true
Loading