diff --git a/.codex/verify.commands b/.codex/verify.commands new file mode 100644 index 0000000..ea68377 --- /dev/null +++ b/.codex/verify.commands @@ -0,0 +1,4 @@ +# codex-os-managed +pnpm install +pnpm run build +cargo test --manifest-path src-tauri/Cargo.toml diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml new file mode 100644 index 0000000..c296305 --- /dev/null +++ b/.github/workflows/test.yml @@ -0,0 +1,18 @@ +name: Test (Rust) +on: + push: + branches: [main, 'feat/**'] + pull_request: + branches: [main] + +jobs: + test: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: dtolnay/rust-toolchain@stable + - uses: Swatinem/rust-cache@v2 + with: + workspaces: src-tauri + - run: cd src-tauri && cargo clippy -- -D warnings + - run: cd src-tauri && cargo nextest run || cargo test diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..3cf7515 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,43 @@ +# ModelColosseum Codex Playbook + +## Communication Contract + +Follow the global Codex communication contract. Keep updates short, beginner-friendly, and focused on what changed, what passed, and what still needs attention. + +## Project Goal + +ModelColosseum is a local-first Tauri 2 desktop app for evaluating Ollama models through arenas, benchmarks, sparring, scorecards, and a SQLite-backed leaderboard. + +## First Read + +- `README.md` +- `CLAUDE.md` +- `src-tauri/Cargo.toml` +- `.codex/verify.commands` + +## Core Rules + +- Keep all model calls local to Ollama unless the user explicitly changes the product contract. +- Do not add telemetry, cloud sync, or remote judging. +- Keep SQLite as the source of truth under the app data path. +- Keep Rust responsible for Ollama communication, scoring, Elo calculations, database writes, and streaming events. +- Frontend should stay presentational/stateful; avoid duplicating scoring or persistence rules in React. +- Do not assume Ollama is running; health check and fail gracefully. + +## Codex App Usage + +- Use Codex App Projects for repo-scoped implementation, debugging, and verification. +- Use Worktrees for debate engine, benchmark runner, auto-judge, Elo, database migration, Ollama streaming, import/export, or Tauri capability changes. +- Use file search before editing because behavior spans Rust engines, SQLite schema, prompt templates, Tauri commands/events, and React mode views. +- Use app-window or browser evidence when arena, benchmark, sparring, leaderboard, settings, or export UI changes. +- Use artifacts when benchmark results, scorecards, or comparison reports need reusable review. + +## Verification + +Use `.codex/verify.commands` as the canonical local gate. Current session note: Rust tests pass, while frontend build is blocked until `esbuild` is approved through pnpm build approval. + +## Done Criteria + +- The relevant verifier commands have been run, or the exact blocker is recorded. +- Scoring, Elo, benchmark, and database changes have focused tests or fixture evidence. +- UI changes have app-window or screenshot evidence when visual behavior matters. diff --git a/CLAUDE.md b/CLAUDE.md index 8975dbe..7326eb6 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -69,3 +69,64 @@ Key modules: - Do not use class components in React — hooks only - Do not store any data outside `~/.model-colosseum/` — single source of truth - Do not assume Ollama is running — always health check first and handle absence gracefully + + +# Portfolio Context + +## What This Project Is + +ModelColosseum is an active local project in the /Users/d/Projects portfolio. + +## Current State + +**v1.0.0 — Feature Complete** (all phases done, audit remediation applied) + +- [x] **Phase 0: Foundation** — Tauri 2.0 scaffold, SQLite (13 tables, WAL), Ollama REST client, Elo module +- [x] **Phase 1: Arena Mode** — Debate engine (freestyle/formal/socratic), vote + Elo, leaderboard, history +- [x] **Phase 2: Benchmark** — CRUD suites/prompts, runner with TTFT/TPS metrics, manual + auto-judge scoring, blind comparison, hardware metrics, import/export +- [x] **Phase 3: Sparring Ring** — Human vs AI debates, 3 difficulty levels, 4-phase structure, scorecards, user Elo +- [x] **Phase 4: Polish** — 3 debate formats, topic suggestions, settings page, blind test, animations, skeleton loading, export (Markdown/CSV/JSON) +- [x] **Audit** — Security hardening (configurable Ollama URL, query limit caps, settings key whitelist), accessibility (ARIA attributes), error handling, 67 Rust tests + +## Stack + +- Runtime: Tauri 2.x (Rust backend + webview frontend) +- Frontend: React 19 + TypeScript 5.x strict mode +- Build: Vite 6.x with `@tauri-apps/vite-plugin` +- Styling: Tailwind CSS 4.x (dark theme, gold/amber accents) +- State: Zustand 5.x +- Routing: React Router 7.x +- Charts: Recharts 2.x +- Database: SQLite via `rusqlite` 0.31+ (bundled, WAL mode) +- HTTP: `reqwest` 0.12+ (async streaming) +- Async: `tokio` 1.x +- System info: `sysinfo` 0.31+ +- LLM: Ollama REST API (localhost:11434) + +## How To Run + +- TypeScript strict mode. No `any` types. +- React: Functional components with hooks only. No class components. +- Rust: `clippy` clean. `cargo fmt` on save. +- File naming: `snake_case.rs` for Rust, `PascalCase.tsx` for React components, `camelCase.ts` for utilities +- Git commits: conventional commits (`feat:`, `fix:`, `refactor:`, `chore:`) +- All Tauri commands return `Result` — handle errors in Rust, display in frontend +- Database writes wrapped in explicit transactions +- No unwrap() in production Rust code — use ? operator or proper error handling + +## Known Risks + +- Do not scaffold the entire project in one session — follow the phased plan strictly +- Do not use Tauri v1 APIs or import paths — this is Tauri 2.x (`@tauri-apps/api` v2) +- Do not use `tauri-plugin-sql` — we use `rusqlite` directly +- Do not use `unwrap()` in Rust production code — use `?` or proper error handling +- Do not make any network calls except to localhost Ollama (no telemetry, no cloud) +- Do not use class components in React — hooks only +- Do not store any data outside `~/.model-colosseum/` — single source of truth +- Do not assume Ollama is running — always health check first and handle absence gracefully + +## Next Recommended Move + +Use this context plus the README and supporting docs to resume the next active task, then promote the repo beyond minimum-viable by capturing a dedicated handoff, roadmap, or discovery artifact. + + diff --git a/pnpm-workspace.yaml b/pnpm-workspace.yaml new file mode 100644 index 0000000..5ed0b5a --- /dev/null +++ b/pnpm-workspace.yaml @@ -0,0 +1,2 @@ +allowBuilds: + esbuild: true