diff --git a/CHANGELOG.md b/CHANGELOG.md index 2699da9e..991f0c26 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,34 @@ All notable changes to this project will be documented in this file. See [commit-and-tag-version](https://github.com/absolute-version/commit-and-tag-version) for commit guidelines. +## [3.1.5](https://github.com/optave/codegraph/compare/v3.1.4...v3.1.5) (2026-03-16) + +**Phase 3 architectural refactoring completes.** This release finishes the remaining two Phase 3 roadmap tasks — domain directory grouping (3.15) and CLI composability (3.16) — bringing Phase 3 to 14 of 14 tasks complete. The `src/` directory is now reorganized into `domain/`, `features/`, and `presentation/` layers. A new `openGraph()` helper eliminates DB-open/close boilerplate across CLI commands, and a universal output formatter adds `--table` and `--csv` output to all commands. Several post-reorganization bugs are fixed: complexity/CFG/dataflow analysis restored after the move, MCP server imports corrected, worktree boundary escapes prevented, CJS `require()` support added, and LIKE wildcard injection in queries patched. + +### Features + +* **cli:** `openGraph()` helper and universal output formatter with `--table` and `--csv` output formats — eliminates per-command DB boilerplate and format-switching logic ([#461](https://github.com/optave/codegraph/pull/461)) + +### Bug Fixes + +* **builder:** restore complexity/CFG/dataflow analysis that silently stopped running after src/ reorganization ([#469](https://github.com/optave/codegraph/pull/469)) +* **db:** prevent `findDbPath` from escaping git worktree boundary — stops codegraph from accidentally using a parent repo's database ([#457](https://github.com/optave/codegraph/pull/457)) +* **mcp:** update MCP server import path after src/ reorganization ([#466](https://github.com/optave/codegraph/pull/466)) +* **api:** add CJS `require()` support to package exports — fixes `ERR_REQUIRE_ESM` for CommonJS consumers ([#472](https://github.com/optave/codegraph/pull/472)) +* **db:** escape LIKE wildcards in `NodeQuery.fileFilter` and `nameLike` — prevents filenames containing `%` or `_` from matching unrelated rows ([#446](https://github.com/optave/codegraph/pull/446)) + +### Refactors + +* **architecture:** reorganize `src/` into `domain/`, `features/`, `presentation/` layers — completes Phase 3.15 domain directory grouping ([#456](https://github.com/optave/codegraph/pull/456)) +* **architecture:** move remaining flat `src/` files into subdirectories ([#458](https://github.com/optave/codegraph/pull/458)) +* **architecture:** resolve three post-reorganization issues (circular imports, barrel exports, path corrections) ([#459](https://github.com/optave/codegraph/pull/459)) +* **queries:** deduplicate BFS impact traversal and centralize config loading ([#463](https://github.com/optave/codegraph/pull/463)) +* **tests:** migrate integration tests to InMemoryRepository for faster execution ([#462](https://github.com/optave/codegraph/pull/462)) + +### Tests + +* **db:** add `findRepoRoot` and `findDbPath` ceiling boundary tests ([#475](https://github.com/optave/codegraph/pull/475)) + ## [3.1.4](https://github.com/optave/codegraph/compare/v3.1.3...v3.1.4) (2026-03-16) **Phase 3 architectural refactoring reaches near-completion.** This release delivers 11 of 14 roadmap tasks in Phase 3 (Vertical Slice Architecture), restructuring the codebase from a flat collection of large files into a modular subsystem layout. The 3,395-line `queries.js` is decomposed into `src/analysis/` and `src/shared/` modules. The MCP tool registry becomes composable. CLI commands are self-contained modules under `src/commands/`. A domain error hierarchy replaces ad-hoc throws. The build pipeline is decomposed into named stages. The embedder is extracted into `src/embeddings/` with pluggable stores and search strategies. A unified graph model (`src/graph/`) consolidates four parallel graph representations. Nodes gain qualified names, hierarchical scoping, and visibility metadata. An `InMemoryRepository` enables fast unit testing without SQLite. The presentation layer (`src/presentation/`) separates all output formatting from domain logic. `better-sqlite3` is bumped to 12.8.0. diff --git a/README.md b/README.md index 391946e9..5861d19e 100644 --- a/README.md +++ b/README.md @@ -477,6 +477,8 @@ codegraph registry remove # Unregister | `-f, --file ` | Scope to a specific file (`fn`, `context`, `where`) | | `--mode ` | Search mode: `hybrid` (default), `semantic`, or `keyword` (`search`) | | `--ndjson` | Output as newline-delimited JSON (one object per line) | +| `--table` | Output as auto-column aligned table | +| `--csv` | Output as CSV (RFC 4180, nested objects flattened) | | `--limit ` | Limit number of results | | `--offset ` | Skip first N results (pagination) | | `--rrf-k ` | RRF smoothing constant for multi-query search (default 60) | @@ -775,7 +777,7 @@ See **[ROADMAP.md](docs/roadmap/ROADMAP.md)** for the full development roadmap a 1. ~~**Rust Core**~~ — **Complete** (v1.3.0) — native tree-sitter parsing via napi-rs, parallel multi-core parsing, incremental re-parsing, import resolution & cycle detection in Rust 2. ~~**Foundation Hardening**~~ — **Complete** (v1.4.0) — parser registry, 12-tool MCP server with multi-repo support, test coverage 62%→75%, `apiKeyCommand` secret resolution, global repo registry 3. ~~**Deep Analysis**~~ — **Complete** (v3.0.0) — dataflow analysis (flows_to, returns, mutates), intraprocedural CFG for all 11 languages, stored AST nodes, expanded node/edge types (parameter, property, constant, contains, parameter_of, receiver), GraphML/GraphSON/Neo4j CSV export, interactive HTML viewer, CLI consolidation, stable JSON schema -4. **Architectural Refactoring** — **In Progress** (v3.1.4) — unified AST analysis, composable MCP, domain errors, builder pipeline, embedder subsystem, graph model, qualified names, presentation layer, InMemoryRepository (11/14 tasks complete) +4. ~~**Architectural Refactoring**~~ — **Complete** (v3.1.5) — unified AST analysis, composable MCP, domain errors, builder pipeline, embedder subsystem, graph model, qualified names, presentation layer, InMemoryRepository, domain directory grouping, CLI composability 5. **Natural Language Queries** — `codegraph ask` command, conversational sessions 6. **Expanded Language Support** — 8 new languages (12 → 20) 7. **GitHub Integration & CI** — reusable GitHub Action, PR review, SARIF output diff --git a/docs/roadmap/ROADMAP.md b/docs/roadmap/ROADMAP.md index 3f0c2abe..f4297e85 100644 --- a/docs/roadmap/ROADMAP.md +++ b/docs/roadmap/ROADMAP.md @@ -1,6 +1,6 @@ # Codegraph Roadmap -> **Current version:** 3.1.4 | **Status:** Active development | **Updated:** March 2026 +> **Current version:** 3.1.5 | **Status:** Active development | **Updated:** March 2026 Codegraph is a strong local-first code graph CLI. This roadmap describes planned improvements across eleven phases -- closing gaps with commercial code intelligence platforms while preserving codegraph's core strengths: fully local, open source, zero cloud dependency by default. @@ -16,7 +16,7 @@ Codegraph is a strong local-first code graph CLI. This roadmap describes planned | [**2**](#phase-2--foundation-hardening) | Foundation Hardening | Parser registry, complete MCP, test coverage, enhanced config, multi-repo MCP | **Complete** (v1.5.0) | | [**2.5**](#phase-25--analysis-expansion) | Analysis Expansion | Complexity metrics, community detection, flow tracing, co-change, manifesto, boundary rules, check, triage, audit, batch, hybrid search | **Complete** (v2.7.0) | | [**2.7**](#phase-27--deep-analysis--graph-enrichment) | Deep Analysis & Graph Enrichment | Dataflow analysis, intraprocedural CFG, AST node storage, expanded node/edge types, extractors refactoring, CLI consolidation, interactive viewer, exports command, normalizeSymbol | **Complete** (v3.0.0) | -| [**3**](#phase-3--architectural-refactoring) | Architectural Refactoring (Vertical Slice) | Unified AST analysis framework, command/query separation, repository pattern, queries.js decomposition, composable MCP, CLI commands, domain errors, builder pipeline, presentation layer, domain grouping, curated API, unified graph model, qualified names, CLI composability | **In Progress** (v3.1.4) | +| [**3**](#phase-3--architectural-refactoring) | Architectural Refactoring (Vertical Slice) | Unified AST analysis framework, command/query separation, repository pattern, queries.js decomposition, composable MCP, CLI commands, domain errors, builder pipeline, presentation layer, domain grouping, curated API, unified graph model, qualified names, CLI composability | **Complete** (v3.1.5) | | [**4**](#phase-4--native-analysis-acceleration) | Native Analysis Acceleration | Move JS-only build phases (AST nodes, CFG, dataflow, insert nodes, structure, roles, complexity) to Rust; fix incremental rebuild data loss on native; sub-100ms 1-file rebuilds | Planned | | [**5**](#phase-5--typescript-migration) | TypeScript Migration | Project setup, core type definitions, leaf -> core -> orchestration module migration, test migration, supply-chain security, CI coverage gates | Planned | | [**6**](#phase-6--runtime--extensibility) | Runtime & Extensibility | Event-driven pipeline, unified engine strategy, subgraph export filtering, transitive confidence, query caching, configuration profiles, pagination, plugin system, DX & onboarding | Planned | @@ -556,14 +556,12 @@ Plus updated enums on existing tools (edge_kinds, symbol kinds). --- -## Phase 3 -- Architectural Refactoring 🔄 +## Phase 3 -- Architectural Refactoring ✅ -> **Status:** In Progress -- started in v3.1.1 +> **Status:** Complete -- started in v3.1.1, finished in v3.1.5 **Goal:** Restructure the codebase for modularity, testability, and long-term maintainability. These are internal improvements -- no new user-facing features, but they make every subsequent phase easier to build and maintain. -> Reference: [generated/architecture.md](../../generated/architecture.md) -- full analysis with code examples and rationale. - **Architecture pattern: Vertical Slice Architecture.** Each CLI command is a natural vertical slice — thin command entry point → domain logic → data access → formatted output. This avoids the overhead of layered patterns (Hexagonal, Clean Architecture) that would create abstractions with only one implementation, while giving clear boundaries and independent testability per feature. The target end-state directory structure: ``` @@ -960,34 +958,34 @@ src/ **Affected files:** `src/viewer.js`, `src/export.js`, `src/sequence.js`, `src/infrastructure/result-formatter.js` -### 3.15 -- Domain Directory Grouping +### 3.15 -- Domain Directory Grouping ✅ -Once 3.2-3.4 are complete and analysis modules are standalone, group them under `src/domain/` by feature area. This is a move-only refactor — no logic changes, just directory organization to match the vertical slice target structure. +**Completed:** `src/` reorganized into `domain/`, `features/`, and `presentation/` layers ([#456](https://github.com/optave/codegraph/pull/456), [#458](https://github.com/optave/codegraph/pull/458)). Three post-reorganization issues (circular imports, barrel exports, path corrections) resolved in [#459](https://github.com/optave/codegraph/pull/459). MCP server import path fixed in [#466](https://github.com/optave/codegraph/pull/466). Complexity/CFG/dataflow analysis restored after the move in [#469](https://github.com/optave/codegraph/pull/469). ``` src/domain/ - graph/ # builder.js, resolve.js, cycles.js, watcher.js + graph/ # builder.js, resolve.js, cycles.js, watcher.js, journal.js, change-journal.js analysis/ # symbol-lookup.js, impact.js, dependencies.js, module-map.js, - # context.js, exports.js, roles.js (from 3.4 decomposition) - search/ # embedder.js subsystem (from 3.10) + # context.js, exports.js, roles.js + search/ # embedder subsystem (models, generator, stores, search strategies) + parser.js # tree-sitter WASM wrapper + LANGUAGE_REGISTRY + queries.js # Query functions (symbol search, file deps, impact analysis) ``` -- 🔲 Move builder pipeline modules to `domain/graph/` -- 🔲 Move decomposed query modules (from 3.4) to `domain/analysis/` -- 🔲 Move embedder subsystem (from 3.10) to `domain/search/` -- 🔲 Update all import paths across codebase -- 🔲 Update `package.json` exports map (from 3.7) +- ✅ Move builder pipeline modules to `domain/graph/` ([#456](https://github.com/optave/codegraph/pull/456)) +- ✅ Move decomposed query modules (from 3.4) to `domain/analysis/` ([#456](https://github.com/optave/codegraph/pull/456)) +- ✅ Move embedder subsystem (from 3.10) to `domain/search/` ([#456](https://github.com/optave/codegraph/pull/456)) +- ✅ Move remaining flat files (`features/`, `presentation/`, `infrastructure/`, `shared/`) into subdirectories ([#458](https://github.com/optave/codegraph/pull/458)) +- ✅ Update all import paths across codebase ([#456](https://github.com/optave/codegraph/pull/456), [#458](https://github.com/optave/codegraph/pull/458), [#459](https://github.com/optave/codegraph/pull/459)) **Prerequisite:** 3.2, 3.4, 3.9, 3.10 should be complete before this step — it organizes the results of those decompositions. -### 3.16 -- CLI Composability - -Practical cleanup to make the CLI surface match the internal composability that `*Data()` functions and MCP already provide. Not a philosophical overhaul -- just eliminating duplication and making the human CLI path as clean as the programmatic one. +### 3.16 -- CLI Composability ✅ -**Context:** The internal architecture is already well-layered -- pure `*Data()` functions, read/write separation, NDJSON support. The 3.6 refactor split the former 1,525-line `cli.js` into `src/cli/` with 40 command modules and an 8-line thin wrapper, but individual commands still repeat DB open/close boilerplate, and output formatting is scattered across command files. MCP and `batch_query` already solve in-process composition for AI agents; these items fix the equivalent gaps on the CLI side. +**Completed:** `openGraph(opts)` helper eliminates DB-open/close boilerplate across CLI commands. `resolveQueryOpts(opts)` extracts the 5 repeated option fields into one call, refactoring 20 command files. Universal output formatter extended with `--table` (auto-column aligned) and `--csv` (RFC 4180 with nested object flattening) output formats ([#461](https://github.com/optave/codegraph/pull/461)). -- **Extract shared `openGraph()` helper.** The thin dispatcher is done (3.6), but each of the 40 `commands/*.js` files still inlines its own DB-open / config-load / cleanup sequence. A single `openGraph(opts)` helper returning `{ db, rootDir, config }` with engine selection, config loading, and cleanup eliminates ~200 lines of per-command duplication. -- **Universal output formatter.** Complete the existing `result-formatter.js` into a full presentation layer that handles `--json`, `--ndjson`, `--table`, `--csv` for any data function. Commands produce data; the formatter renders. Eliminates per-command format-switching logic. +- ✅ **`openGraph()` helper** — single helper returning `{ db, rootDir, config }` with engine selection, config loading, and cleanup ([#461](https://github.com/optave/codegraph/pull/461)) +- ✅ **Universal output formatter** — `outputResult()` extended with `--table` and `--csv` formats; `resolveQueryOpts()` extracts repeated option fields ([#461](https://github.com/optave/codegraph/pull/461)) **Affected files:** `src/cli/commands/*.js`, `src/cli/shared/`, `src/presentation/result-formatter.js` diff --git a/generated/architecture.md b/generated/architecture.md deleted file mode 100644 index c803fb07..00000000 --- a/generated/architecture.md +++ /dev/null @@ -1,457 +0,0 @@ -# Codegraph Architectural Audit — Revised Analysis - -> **Scope:** Unconstrained redesign proposals. No consideration for migration effort or backwards compatibility. What would the ideal architecture look like? -> -> **Revision context:** The original audit (Feb 22, 2026) analyzed v1.4.0 with ~12 source modules totaling ~5K lines. The first revision (Mar 2, 2026) covered v2.6.0 with 35 modules totaling 17,830 lines. Since then, a rapid expansion added 6 new modules (cfg, ast, dataflow, viewer, extractors refactor, CLI consolidation), 4 new DB tables, 3 new node kinds, 3 new edge kinds, and 9 new MCP tools — all in a single day. The codebase now stands at 50 source modules totaling 26,277 lines. This revision re-evaluates every recommendation against the actual codebase as it stands today. - ---- - -## What Changed Since the Last Revision (Mar 2 → Mar 3, 2026) - -| Metric | Mar 2 (v2.6.0) | Mar 3 (post-PRs) | Delta | -|--------|----------------|-------------------|-------| -| Source modules | 35 | 50 (37 core + 11 extractors + 2 new) | +15 | -| Total source lines | 17,830 | 26,277 | +47% | -| `queries.js` | 3,110 lines | 3,395 lines | +285 | -| `mcp.js` | 1,212 lines | 1,370 lines | +158 | -| `cli.js` | 1,285 lines | 1,557 lines | +272 | -| `builder.js` | 1,173 lines | 1,355 lines | +182 | -| `cfg.js` | -- | 1,451 lines | New | -| `dataflow.js` | -- | 1,187 lines | New | -| `viewer.js` | -- | 948 lines | New | -| `ast.js` | -- | 392 lines | New | -| `db.js` | 317 lines | 392 lines | +75 | -| `export.js` | 681 lines | 681 lines | unchanged | -| DB tables | 9 | 13 | +4 | -| DB migrations | v9 | v13 | +4 | -| MCP tools | 25 | 34 | +9 | -| CLI commands | 45 | 47 | +2 (net: +7 added, -5 consolidated) | -| `index.js` exports | 120+ | 140+ (32 export lines) | +20 | -| Test files | 59 | 70 | +11 | -| Node kinds | 10 | 13 | +3 (parameter, property, constant) | -| Edge kinds | 6 | 9 | +3 (contains, parameter_of, receiver) | -| Extractor modules | 0 (inline in parser.js) | 11 files, 3,023 lines | New directory | - -**Key patterns observed in this burst:** - -1. **The dual-function anti-pattern was replicated 4 more times** (cfg.js, ast.js, dataflow.js, viewer.js) — each with its own `*Data()` / `*()` pair, DB opening, SQL, formatting. The pattern count went from 15 to 19 modules. - -2. **CFG introduced a third analysis engine pattern** alongside complexity and dataflow: language-specific rule maps keyed by AST node type, applied during a tree walk. Three modules now independently implement "per-language AST rules + engine walker" with no shared framework. - -3. **The extractors refactoring (PR #270) is the first genuine structural decomposition** — parser.js extractors split into `src/extractors/` with one file per language. This is the pattern the rest of the codebase should follow. - -4. **Scope and parent hierarchy finally arrived** — `parent_id` column on `nodes`, `contains`/`parameter_of` edges, `children` query. This partially addresses the qualified names gap (item #11 in the previous revision). - -5. **CLI consolidation (PR #280) removed 5 commands** — the first time the project actively reduced surface area. `hotspots` merged into `triage`, `manifesto` into `check`, `explain` into `audit --quick`, `batch-query` into `batch where`, `query --path` into standalone `path`. - ---- - -## 1. The Dual-Function Anti-Pattern — Now 19 Modules Deep - -**Previous state:** 15 modules with `*Data()` / `*()` pairs. - -**Current state:** 19 modules. Four new additions: - -``` -cfg.js -> cfgData() / cfg() -ast.js -> astQueryData() / astQuery() -dataflow.js -> dataflowData() / dataflow(), dataflowPathData(), dataflowImpactData() -viewer.js -> prepareGraphData() / generatePlotHTML() -``` - -Plus queries.js grew two more pairs: `childrenData()` / `children()`, `exportsData()` / `fileExports()`. - -**Reinforced assessment:** Each new module independently handles DB opening, SQL execution, result shaping, pagination, CLI formatting, JSON output, and `--no-tests` filtering. The `cfg.js` module at 1,451 lines is the most extreme example — it contains CFG construction rules for 9 languages, a build phase, a query function, DOT/Mermaid formatters, and a CLI printer all in one file. - -**The ideal architecture is unchanged** — Command + Query separation with shared `CommandRunner` lifecycle. But the urgency increased: at the current rate of ~4 new dual-function modules per development sprint, the pattern will reach 25+ modules before any refactoring can happen. - ---- - -## 2. The Database Layer — 13 Tables Across 25+ Modules - -**Previous state:** 9 tables, SQL scattered across 20+ modules. - -**Current state:** 13 tables, SQL scattered across **25+ modules**. New tables: - -| Table | Migration | Module | Purpose | -|-------|-----------|--------|---------| -| `dataflow` | v10 | `dataflow.js` | flows_to, returns, mutates edges with confidence | -| `nodes.parent_id` | v11 | `builder.js` | Parent-child node hierarchy | -| `cfg_blocks` | v12 | `cfg.js` | Basic blocks per function | -| `cfg_edges` | v12 | `cfg.js` | Control flow edges between blocks | -| `ast_nodes` | v13 | `ast.js` | Stored queryable AST nodes (call, new, string, regex, throw, await) | - -Each new module follows the same pattern: import `openDb()`, write raw SQL with inline string construction, create its own prepared statements. `cfg.js` alone has ~20 SQL statements. - -**The repository pattern is now even more critical.** With 13 tables, the migration system in `db.js` is getting complex (392 lines, up from 317). The ideal decomposition into `db/connection.js`, `db/migrations.js`, `db/repository.js` is unchanged but higher urgency. - ---- - -## 3. queries.js at 3,395 Lines — Still Growing - -**Previous state:** 3,110 lines. - -**Current state:** 3,395 lines — gained 285 lines. New additions: -- `childrenData()` — query child symbols (parameters, properties, constants) -- `exportsData()` — per-symbol consumer analysis for file exports -- `CORE_SYMBOL_KINDS` (10) / `EXTENDED_SYMBOL_KINDS` (3) / `EVERY_SYMBOL_KIND` (13) — tiered kind constants -- `CORE_EDGE_KINDS` (6) / `STRUCTURAL_EDGE_KINDS` (3) / `EVERY_EDGE_KIND` (9) — tiered edge constants -- `normalizeSymbol()` — stable 7-field JSON shape for all queries - -**Positive development:** The constant hierarchy (`CORE_` / `EXTENDED_` / `EVERY_`) is well-designed and provides clean backward compatibility (`ALL_SYMBOL_KINDS = CORE_SYMBOL_KINDS`). The `normalizeSymbol()` utility enforces consistent output. These are **the right abstractions** — they just need to live in dedicated files (`shared/constants.js`, `shared/normalize.js`) rather than accumulating in the megafile. - -**The decomposition plan from the previous revision still applies.** Add `shared/constants.js` for the kind/edge/role constants and `shared/normalize.js` for `normalizeSymbol` + `isTestFile` + `kindIcon`. - ---- - -## 4. MCP at 1,370 Lines with 34 Tools - -**Previous state:** 1,212 lines, 25 tools. - -**Current state:** 1,370 lines, 34 tools. Nine new tools added: - -| Tool | Source module | -|------|-------------| -| `cfg` | cfg.js | -| `ast_query` | ast.js | -| `dataflow` | dataflow.js | -| `dataflow_path` | dataflow.js | -| `dataflow_impact` | dataflow.js | -| `file_exports` | queries.js | -| `symbol_children` | queries.js | -| `fn_impact` (extended kinds enum) | queries.js | -| Various updated enums | (edge_kinds, symbol kinds) | - -**Positive development:** The MCP tools were **not** consolidated alongside the CLI (PR #280 removed 5 CLI commands but kept all MCP tools for backward compatibility). This is the right call for an MCP API — clients may depend on specific tool names. - -**The composable tool registry pattern is now more urgent.** At 34 tools in a single file, each addition requires coordinating the tool definition, the dispatch handler, and the import — three touch points. The one-file-per-tool registry pattern proposed in the previous revision would make each of the 34 tools independently maintainable. - ---- - -## 5. CLI at 1,557 Lines with 47 Commands — Consolidation Started - -**Previous state:** 1,285 lines, 45 commands. - -**Current state:** 1,557 lines, 47 commands. Net change: +7 new commands (cfg, ast, dataflow, dataflow-path, dataflow-impact, children, path), -5 consolidated commands (hotspots, manifesto, explain, batch-query, query --path). - -**Positive development:** PR #280 is the first CLI surface area reduction — 5 commands consolidated into existing ones. This is the right direction. `check` now subsumes `manifesto`, `triage` subsumes `hotspots`, `audit --quick` subsumes `explain`, `batch where` subsumes `batch-query`. - -**But the file still grew** because 7 new commands were added in parallel. The inline Commander.js pattern means each new command adds 20-40 lines of `.command().description().option().action()` boilerplate. The command object pattern from the previous revision would keep the entry point lean regardless of command count. - ---- - -## 6. cfg.js at 1,451 Lines — A New Monolith - -**Not in previous revision** — this module didn't exist. - -**Current state:** 1,451 lines containing: -- `makeCfgRules(overrides)` — factory for language-specific CFG construction rules -- `CFG_RULES` Map — rules for all 9 supported languages (JS/TS, Python, Go, Rust, Java, C#, PHP, Ruby) -- `buildFunctionCFG(functionNode, langId)` — CFG construction from AST (basic blocks + control flow edges) -- `buildCFGData(db, fileSymbols, rootDir)` — build-phase integration (write cfg_blocks/cfg_edges to DB) -- `cfgData(name, customDbPath, opts)` — query function -- `cfgToDOT()` / `cfgToMermaid()` — graph export formatters -- `cfg(name, customDbPath, opts)` — CLI printer - -**Problem:** This is a miniature version of the `complexity.js` monolith. It has the same structure: per-language rules map + engine walker + DB integration + query + formatting. The two modules share the same fundamental pattern but implement it independently. - -**Connection to complexity.js:** `cfg.js` imports `findFunctionNode()` from `complexity.js` — confirming that these two AST-analysis modules have shared concerns but no shared framework. - -**Ideal architecture — unified AST analysis framework:** - -``` -src/ - ast-analysis/ - engine.js # Shared AST walk with visitor pattern - rules/ - complexity/ # Cognitive/cyclomatic/Halstead rules per language - javascript.js - python.js - ... - cfg/ # Basic-block construction rules per language - javascript.js - python.js - ... - metrics.js # Halstead, MI computation (from complexity.js) - cfg-builder.js # Basic-block + edge construction (from cfg.js) -``` - -Both complexity and CFG analysis walk the same AST trees with language-specific rules. A shared visitor-pattern engine would eliminate the parallel rule-map implementations and allow future AST analyses (e.g., dead code detection, mutation analysis) to plug in without creating yet another 1K+ line module. - ---- - -## 7. dataflow.js at 1,187 Lines — JS/TS Only, Language Hardcoding - -**Not in previous revision** — this module was just introduced (#254). - -**Current state:** 1,187 lines implementing define-use chain extraction with three edge types: -- `flows_to` — parameter/variable flow between functions -- `returns` — call return value assignment tracking -- `mutates` — parameter-derived mutation detection - -**Design qualities:** -- Confidence scoring (1.0 param, 0.9 call return, 0.8 destructured) — good, but undocumented -- Transaction-based DB writes — correct pattern -- Lazy parser initialization — efficient - -**Architectural concerns:** -1. **Language hardcoding** — Lines 517-524 and 573-580 hardcode `javascript`/`typescript`/`tsx` checks. Not extensible via registry. -2. **Scope stack mutation** during tree walk — fragile for malformed AST -3. **No cycle detection** in dataflow BFS paths — can revisit nodes -4. **Statement-level mutation detection** misses inline mutations -5. **Follows the same monolith pattern** — extraction + DB write + query + CLI format all in one file - -**Ideal:** Dataflow extraction should integrate with the AST analysis framework proposed above. The define-use chain walk is fundamentally the same visitor pattern as complexity and CFG — it just collects different data. - ---- - -## 8. Extractors Refactoring — The Right Pattern, Applied Once - -**Previous state:** parser.js at 404 lines with inline extractors. - -**Current state:** `src/extractors/` directory with 11 files totaling 3,023 lines: - -| File | Lines | Language | -|------|-------|----------| -| `javascript.js` | 892 | JS/TS/TSX | -| `csharp.js` | 311 | C# | -| `php.js` | 322 | PHP | -| `java.js` | 290 | Java | -| `rust.js` | 295 | Rust | -| `ruby.js` | 277 | Ruby | -| `go.js` | 237 | Go | -| `python.js` | 284 | Python | -| `hcl.js` | 95 | Terraform/HCL | -| `helpers.js` | 11 | Shared utilities | -| `index.js` | 9 | Barrel export | - -**This is the correct decomposition pattern.** Each language has its own file. A shared helpers module provides `nodeEndLine()` and `findChild()`. The barrel export keeps the public API clean. All extractors return a consistent structure: `{ definitions, calls, imports, classes, exports }`. - -**This pattern should be replicated for:** -- `complexity.js` → `src/complexity/rules/{language}.js` (same per-language rule pattern) -- `cfg.js` → `src/cfg/rules/{language}.js` (same per-language rule pattern) -- `dataflow.js` → `src/dataflow/extractors/{language}.js` (when more languages are supported) - -The extractors refactoring proved the pattern works. Now apply it consistently. - ---- - -## 9. ast.js — Stored Queryable AST Nodes - -**Not in previous revision** — new module from PR #279. - -**Current state:** 392 lines. Stores selected AST nodes during build for later querying: -- Node kinds: `call`, `new`, `string`, `regex`, `throw`, `await` -- Pattern matching via SQL GLOB with auto-wrapping -- Parent resolution via narrowest enclosing definition - -**Architectural assessment:** This is a well-scoped module. At 392 lines it's appropriately sized. It follows the dual-function pattern (`astQueryData()` / `astQuery()`) but is otherwise clean. - -**The main concern** is that AST node extraction during build overlaps with what `dataflow.js` and `cfg.js` also do — all three walk the AST. With the unified AST analysis framework proposed in item #6, a single AST walk could populate all three subsystems in one pass. - ---- - -## 10. viewer.js at 948 Lines — Self-Contained but Bloated - -**Not in previous revision** — new module from PR #268. - -**Current state:** 948 lines generating self-contained interactive HTML with vis-network. Features: layout switching, physics toggle, search, color/size/cluster overlays, drill-down, detail panel, community detection. - -**Architectural assessment:** -- Embeds ALL node/edge data as JSON in the HTML — scales poorly for large graphs -- Client-side filtering only — no server-side optimization -- Hardcoded thresholds (fanIn >= 10, MI < 40) not derived from distribution -- Tight vis-network coupling — custom clustering logic deeply integrated -- Good: configuration cascading via `.plotDotCfg` with deep merge - -**This module is isolated** — it has minimal impact on the rest of the architecture. The main risk is HTML size growth for large codebases. - ---- - -## 11. Qualified Names + Hierarchical Scoping — Partially Addressed - -**Previous state:** Flat node model with no scope or parent information. - -**Current state:** Partially addressed via PR #270: -- `parent_id` column added to `nodes` table (migration v11) -- `contains` edges track parent-child relationships -- `parameter_of` edges link parameters to functions -- `childrenData()` query returns child symbols -- Extended kinds (`parameter`, `property`, `constant`) model sub-declarations - -**What's still missing:** -- `qualified_name` column (e.g., `DateHelper.format`) -- `scope` column (e.g., `DateHelper`) -- `visibility` column (`public`/`private`/`protected`) -- The `parent_id` FK only goes one level — deeply nested scopes (namespace > class > method > closure) aren't fully represented - -**Revised priority:** Medium → Low-Medium. The `parent_id` + `contains` edges solve the 80% case (class methods, interface members, struct fields). The remaining 20% (qualified names, deep nesting) is a polish item. - ---- - -## 12. builder.js at 1,355 Lines — Pipeline Now Has 7+ Opt-In Stages - -**Previous state:** 1,173 lines with complexity as the only opt-in stage. - -**Current state:** 1,355 lines. The build pipeline now has 4 opt-in stages: - -``` -Core pipeline (always): - collectFiles → detectChanges → parseFiles → insertNodes → - resolveImports → buildCallEdges → buildClassEdges → - resolveBarrels → insertEdges → buildStructure → classifyRoles - -Opt-in stages: - --complexity → computeComplexity() - --dataflow → buildDataflowEdges() (dynamic import) - --cfg → buildCFGData() (dynamic import) - AST nodes → extractASTNodes() (always, post-parse) -``` - -**Positive development:** The opt-in stages use dynamic imports — `dataflow.js` and `cfg.js` are only loaded when their flags are passed. This keeps default builds fast. - -**The pipeline architecture from the previous revision is even more relevant now.** Seven core stages + 4 opt-in stages = 11 total. Each should be independently testable with the pipeline runner handling transactions, logging, progress, and statistics. - ---- - -## 13. Export Formats — 6 Formats, Well-Contained - -**Previous state:** DOT, Mermaid, JSON. - -**Current state:** DOT, Mermaid, JSON, GraphML, GraphSON, Neo4j CSV. Export.js at 681 lines (unchanged — the new formats were already counted in the previous revision). - -**Assessment:** Well-contained. The export module adds formats without affecting other modules. No architectural concerns. - ---- - -## 14. Constants Hierarchy — A Good Foundation - -**Not in previous revision** — introduced across PRs #267, #270, #279. - -**Current state:** Three-tiered constants in `queries.js`: - -```js -// Symbol kinds -CORE_SYMBOL_KINDS = ['function', 'method', 'class', 'interface', 'type', - 'struct', 'enum', 'trait', 'record', 'module'] -EXTENDED_SYMBOL_KINDS = ['parameter', 'property', 'constant'] -EVERY_SYMBOL_KIND = [...CORE_SYMBOL_KINDS, ...EXTENDED_SYMBOL_KINDS] -ALL_SYMBOL_KINDS = CORE_SYMBOL_KINDS // backward compat alias - -// Edge kinds -CORE_EDGE_KINDS = ['imports', 'imports-type', 'reexports', 'calls', 'extends', 'implements'] -STRUCTURAL_EDGE_KINDS = ['parameter_of', 'receiver'] -EVERY_EDGE_KIND = [...CORE_EDGE_KINDS, ...STRUCTURAL_EDGE_KINDS] - -// AST node kinds (in ast.js) -AST_NODE_KINDS = ['call', 'new', 'string', 'regex', 'throw', 'await'] -``` - -**This is well-designed.** The tiered approach lets older code use `ALL_SYMBOL_KINDS` (10 core kinds) while new code can opt into `EVERY_SYMBOL_KIND` (13 kinds). The `contains` edge is stored in the `edges` table but excluded from coupling metrics via the `STRUCTURAL_EDGE_KINDS` distinction. - -**One concern:** These constants are scattered across multiple files (`queries.js`, `ast.js`). They should all live in a single `shared/constants.js` as proposed in item #3. - ---- - -## Updated Priority Summary - -### Items That Improved Since Last Revision - -| # | Item | What improved | -|---|------|--------------| -| 9 | Parser plugin system (was #20) | Extractors split into `src/extractors/` — **done** | -| 11 | Qualified names (was #12) | `parent_id`, `contains` edges, `parameter_of` — **partially done** | -| 5 | CLI surface area (was #5) | 5 commands consolidated in PR #280 — **started** | -| 3 | Constants organization (was part of #3) | Tiered `CORE_`/`EXTENDED_`/`EVERY_` hierarchy — **started** | -| -- | normalizeSymbol (new) | Stable JSON schema utility — **done** | - -### Items That Worsened Since Last Revision - -| # | Item | What worsened | -|---|------|--------------| -| 1 | Dual-function pattern | 15 → 19 modules | -| 2 | Repository pattern | 9 → 13 tables, 20 → 25+ modules with raw SQL | -| 3 | queries.js size | 3,110 → 3,395 lines | -| 4 | MCP monolith | 25 → 34 tools in one file | -| 5 | CLI size | 1,285 → 1,557 lines (despite consolidation) | -| 6 | Public API | 120+ → 140+ exports | -| 8 | AST analysis duplication | 1 module (complexity) → 3 modules (+ cfg, dataflow) with parallel rule engines | - ---- - -## Revised Summary — Priority Ordering by Architectural Impact - -| # | Change | Impact | Category | Previous # | -|---|--------|--------|----------|------------| -| **1** | **Command/Query separation — eliminate dual-function pattern across 19 modules** | **Critical** | Separation of concerns | #1 (15→19 modules) | -| **2** | **Repository pattern for data access — raw SQL in 25+ modules, 13 tables** | **Critical** | Testability, maintainability | #2 (9→13 tables) | -| **3** | **Decompose queries.js (3,395 lines) into analysis modules + shared constants** | **Critical** | Modularity | #3 (3,110→3,395) | -| **4** | **Unified AST analysis framework — complexity + CFG + dataflow share no infrastructure** | **Critical** | Code duplication | New (3 modules, ~4.8K lines, parallel rule engines) | -| **5** | **Composable MCP tool registry (34 tools in 1,370 lines)** | **High** | Extensibility | #4 (25→34 tools) | -| **6** | **CLI command objects (47 commands in 1,557 lines)** | **High** | Maintainability | #5 (45→47 commands, consolidation started) | -| **7** | **Curated public API surface (140+ to ~35 exports)** | **High** | API stability | #6 (120→140+ exports) | -| **8** | **Domain error hierarchy (50 modules, inconsistent handling)** | **High** | Reliability | #7 (35→50 modules) | -| **9** | **Builder pipeline architecture (1,355 lines, 11 stages, 4 opt-in)** | **High** | Testability, reuse | #9 (1,173→1,355, +2 opt-in stages) | -| **10** | **Embedder subsystem (1,113 lines, 3 search engines)** | **Medium-High** | Extensibility | #10 (unchanged) | -| **11** | **Unified graph model for structure/cochange/communities/viewer** | **Medium-High** | Cohesion | #11 (viewer now also builds its own graph) | -| **12** | **Pagination standardization (SQL-level + command runner)** | **Medium** | Consistency | #13 (unchanged) | -| **13** | **Testing pyramid with InMemoryRepository** | **Medium** | Quality | #14 (59→70 test files, same DB coupling) | -| **14** | **Event-driven pipeline for streaming** | **Medium** | Scalability, UX | #15 (unchanged) | -| **15** | **Qualified names (remaining: qualified_name, scope, visibility columns)** | **Low-Medium** | Data model | #12 (partially addressed by parent_id) | -| **16** | **Query result caching (34 MCP tools)** | **Low-Medium** | Performance | #16 (25→34 tools) | -| **17** | **Unified engine interface (Strategy)** | **Low-Medium** | Abstraction | #17 (unchanged) | -| **18** | **Subgraph export with filtering** | **Low-Medium** | Usability | #18 (unchanged) | -| **19** | **Transitive import-aware confidence** | **Low** | Accuracy | #19 (unchanged) | -| **20** | **Config profiles for monorepos** | **Low** | Feature | #21 (unchanged) | - -### Items Resolved / Downgraded - -| Previous # | Item | Status | -|------------|------|--------| -| #20 | Parser plugin system | **Resolved** — extractors split into `src/extractors/` | -| #8 | Decompose complexity.js (standalone) | **Subsumed** by new #4 (unified AST analysis framework) | - ---- - -## New Architectural Concern: Three Independent AST Rule Engines - -The most significant architectural development since the last revision is the emergence of **three independent AST analysis modules** that share the same fundamental pattern but no infrastructure: - -| Module | Lines | Languages | Pattern | -|--------|-------|-----------|---------| -| `complexity.js` | 2,163 | 8 | Per-language rules map → AST walk → collect metrics | -| `cfg.js` | 1,451 | 9 | Per-language rules map → AST walk → build basic blocks | -| `dataflow.js` | 1,187 | 1 (JS/TS) | Scope stack → AST walk → collect flows | - -Total: **4,801 lines** of parallel AST walking implementations. All three: -- Walk function-level ASTs from tree-sitter parse trees -- Use language-specific rule maps keyed by AST node type -- Build intermediate data structures during the walk -- Write results to dedicated DB tables -- Provide query functions + CLI formatters - -Additionally, `ast.js` (392 lines) does a fourth AST walk to extract stored nodes. - -**The extractors refactoring showed the path:** split per-language rules into files, share the engine. `cfg.js` already took a step in this direction with `makeCfgRules(overrides)` — a factory function for language-specific CFG rules with defaults. Apply this pattern to all four AST analysis passes: - -``` -src/ - ast-analysis/ - visitor.js # Shared AST visitor with hook points - rules/ - complexity/{lang}.js # Cognitive/cyclomatic rules - cfg/{lang}.js # Basic-block rules - dataflow/{lang}.js # Define-use chain rules - ast-store/{lang}.js # Node extraction rules - engine.js # Single-pass or multi-pass orchestrator -``` - -A single AST walk with pluggable visitors would: -1. Eliminate 3 redundant tree traversals per function -2. Share language-specific node type mappings -3. Allow new analyses to plug in without creating another 1K+ line module -4. Enable the 4 opt-in build stages to share a single parse pass - ---- - -*Revised 2026-03-03. Cold architectural analysis — no implementation constraints applied.* diff --git a/package-lock.json b/package-lock.json index e03070d2..d2745cc7 100644 --- a/package-lock.json +++ b/package-lock.json @@ -1,12 +1,12 @@ { "name": "@optave/codegraph", - "version": "3.1.4", + "version": "3.1.5", "lockfileVersion": 3, "requires": true, "packages": { "": { "name": "@optave/codegraph", - "version": "3.1.4", + "version": "3.1.5", "license": "Apache-2.0", "dependencies": { "better-sqlite3": "^12.6.2", diff --git a/package.json b/package.json index 88fd7b91..8c2ce563 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "@optave/codegraph", - "version": "3.1.4", + "version": "3.1.5", "description": "Local code graph CLI — parse codebases with tree-sitter, build dependency graphs, query them", "type": "module", "main": "src/index.js",