Rust MCP runtime for enterprise knowledge ingestion, latent ontology discovery, and downstream impact analysis.
The system is not primarily a code-understanding tool. Code is only one source class among many.
The broader goal is to ingest and correlate:
- program code from one or more git repositories
- architecture documents
- project plans
- meeting notes
- presentations
- diagrams and process maps
- database schemas and records
- content from silos such as SharePoint
and turn that material into a queryable semantic substrate that helps agents and humans:
- discover hidden dependencies
- distinguish overloaded terms that mean different things in different domains
- understand downstream impacts of changes
- surface missing or contradictory assumptions
- support architectural and operational decision-making
Organizations already contain a latent ontology.
It is scattered across:
- source code
- naming conventions
- schemas
- slide decks
- architecture artifacts
- meeting language
- operational records
- file structures
The same term can mean different things in different areas. The same dependency can appear as:
- a code import
- a business rule in a meeting note
- a process handoff in a diagram
- a column dependency in a database schema
- an ownership boundary in an architecture document
The system should not assume those ideas are already precise. It must extract evidence, preserve provenance, infer candidate semantics, and progressively build a usable ontology of the enterprise.
The runtime has four jobs:
- Ingest heterogeneous artifacts from many systems.
- Convert them into evidence-bearing semantic objects.
- Correlate those objects across silos to discover shared or conflicting meaning.
- Expose the resulting knowledge through MCP so agents can explore impacts and dependencies safely.
It is not:
- just a code graph
- just a vector database
- just a document search tool
- just an RDF store
- just an agent shell
It is a semantic runtime that combines all of those as supporting capabilities.
+----------------------------------+
| Source systems / silos |
|----------------------------------|
| git repos |
| SharePoint / file stores |
| wiki / docs / notes |
| presentations / PDFs |
| diagrams / process models |
| DB schemas / operational records |
+----------------+-----------------+
|
connectors / extractors
|
+-----------------v------------------+
| Normalized artifact layer |
|------------------------------------|
| Artifact |
| Anchor / span / section |
| metadata / source identity |
| timestamps / provenance |
+-----------------+------------------+
|
extraction / interpretation
|
+----------------------v----------------------+
| Evidence and claims |
|---------------------------------------------|
| entities |
| candidate concepts |
| relations |
| schema observations |
| process steps |
| confidence + provenance |
+----------------------+----------------------+
|
resolution / induction / disambiguation
|
+------------------------------v-------------------------------+
| Enterprise semantic runtime |
|---------------------------------------------------------------|
| ontology candidates |
| contextual namespaces |
| resolved entities |
| hidden dependency graph |
| impact paths |
| agent-facing discovery surface |
+---------------------+-------------------+----------------------+
| |
| |
+-------------v----+ +--------v------------------+
| Rhai runtime | | Python logical executor |
|------------------| |---------------------------|
| mapping policies | | connector interop |
| enrichment rules | | document/diagram tooling |
| disambiguation | | LLM discovery workflows |
| impact heuristics| | schema curation |
+-------------+----+ +------------+--------------+
| |
+-----------+----------+
|
validated Rust host boundary
|
+-----------------------+------------------------+
| |
+---------v----------+ +------------v------------+
| Neumann live store | | Git semantic ledger |
|--------------------| |-------------------------|
| facts | | blob/tree/commit IDs |
| graph edges | | snapshots / replay |
| embeddings | | manifests / provenance |
+--------------------+ +-------------------------+
|
+---------v---------+
| MCP / decision |
| support surface |
+-------------------+
The runtime should separate at least four layers.
What was actually found.
Examples:
- file in SharePoint
- git blob
- schema DDL
- slide deck
- BPMN diagram
- note page
What was extracted from the source.
Examples:
- a quoted statement
- a table definition
- a process step
- a system name
- a role or owner
- a dependency claim
This layer must preserve provenance and confidence.
What the system believes these artifacts mean.
Examples:
- candidate concepts
- candidate equivalences
- competing definitions
- contextualized "standing data" interpretations
- inferred relationships
This is where ambiguity is modeled, not hidden.
What changes imply.
Examples:
- if system A changes, which documents, processes, teams, schemas, and applications are affected
- which concepts depend on a field or policy that is defined only informally
- where terminology drift suggests hidden coupling or decision risk
A term like standing data is not globally meaningful on its own.
The system must be able to say:
standing datain market operationsstanding datain enterprise architecturestanding datain records or governance
Those may overlap, conflict, or only partially align.
So the runtime must model:
- local meaning
- namespace / business context
- source provenance
- temporal validity
- confidence
- equivalence or non-equivalence with other concepts
The ontology is therefore not just a fixed taxonomy. It is a living, evidence-backed semantic model of the organization.
Rust remains the authority for:
- host contracts and type safety
- storage and retrieval
- MCP transport and serving
- validation boundaries
- execution limits
- provenance and replay
- orchestration and policy enforcement
Rhai is the embedded runtime for configurable semantic behavior inside the host process.
Its role is to support:
- source-specific mapping rules
- concept disambiguation policies
- enrichment and derived fields
- relation emission
- naming and routing logic
- impact heuristics
- MCP-facing semantic projections
Rhai is not the source of truth for the host contract. It runs behind a fixed Rust adapter.
Python remains in scope, but as a logical executor rather than the core runtime object system.
Its role is to support:
- interoperability with external LLM/agent ecosystems
- document-specific and diagram-specific tooling
- complex extractors not worth reimplementing in Rust
- ontology discovery workflows
- curation of Rhai-facing schemas and rule bundles
- offline or batch enrichment pipelines
Python should remain outside the critical in-process object runtime boundary.
source artifact
|
+-- code / schema / note / slide / diagram / record
|
v
extractor pipeline
|
+-- native Rust extractor
+-- Python executor
+-- MCP-forwarded specialist worker
|
v
normalized evidence bundle
|
+-- anchors
+-- observations
+-- candidate entities
+-- candidate relations
+-- confidence + provenance
|
v
ontology interpretation
|
+-- validate against host schema
+-- apply Rhai mapping / enrichment
+-- resolve contextual namespaces
+-- generate candidate ontology objects
|
v
correlation and dependency graph
|
+-- resolved entities
+-- conflicts / overlaps
+-- hidden dependencies
+-- impact paths
|
v
persist to Neumann + snapshot to Git
|
v
serve through MCP for agent exploration
The system should evolve toward generic enterprise abstractions rather than code-specific ones.
Likely core objects:
SourceSystemArtifactAnchorObservationClaimConceptEntityRelationContextNamespaceEvidenceBundleImpactPathDecisionSupportView
Code symbols, database tables, process steps, and architecture components then become specializations or projections of those more general objects.
Current crates:
crates/
classifier/
cli/
codegraph/
domain/
dsl/
forward-mcp/
handlers/
indexer/
intake/
mcp-server/
naming/
orchestrator/
provider-api/
provider-local/
provider-openai/
provider-test/
retrieval/
semantic-runtime/
storage-neumann/
tomllm/
Today the implemented path is still biased toward local repo and code/document ingestion, because that is the most mature slice so far.
Already present:
-
mcp-serverwith stdio and HTTP JSON-RPC transport -
phase2ddaemon entrypoint incrates/cli - config-driven
phase2dbootstrap viaphase2d.toml -
forward-mcptransport for stdio and HTTP delegation -
provider-localfor managed local OpenAI-compatible serving -
indexerwatcher runtime withwatchexec -
storage-neumannas the live semantic store - ontology resources ingested at startup
-
agent.runMCP tool backed by a bounded internal executor - DSL foundations for rule-driven ingestion
- enterprise semantic domain objects such as
Artifact,Observation,Claim,Concept,Entity,Relation,ContextNamespace, andEvidenceBundle - Rhai-based semantic correlation runtime over evidence bundles
- directory-scoped staging source configs via
.promptexecution.toml - staged artifact ingestion with inherited source tags and ontology references
- watcher/indexer bridging that persists source metadata into the semantic store
- configured external executor dispatch over the MCP forward boundary
- semantic enrichment persistence for extracted fields, claims, relations, and notes
- Acme Corp demo corpus with cross-document fixtures and CI smoke coverage
Current startup examples:
cargo run -p cli --bin phase2d -- stdio --config examples/acme-corp/phase2d.toml
cargo run -p cli --bin phase2d -- http --config examples/acme-corp/phase2d.toml --addr 127.0.0.1:3000
cargo run -p cli --bin phase2d -- http --config examples/acme-corp/phase2d.toml --watch examples/acme-corp/repoThe next architectural step is not just "better code indexing".
It is:
- generalized connectors for multiple silos
- an artifact/evidence model that is not code-centric
- contextual ontology induction
- correlation across repositories, documents, schemas, and process artifacts
- decision-support views over discovered dependencies
The current runtime bundle now describes:
- source connectors
- extractor registrations
- Neumann/store config
- watch roots and polling scopes
- ontology registries
- schema registries
- Rhai modules and packages
- Python executor registrations
- MCP forward targets
- impact-view projections
The next configuration step is broadening that bundle beyond local staged directories into richer connector definitions and impact-view projections without recompiling the binary.
The repo now carries a stable demo spine under examples/acme-corp/:
- themed source roots for repo and document ingestion
.promptexecution.tomlstaging configs per source rootphase2d.tomlfor end-to-end daemon startup- a Python curator executor for cross-document classification hints
- Rhai modules for latent-dependency enrichment
- a GitHub Actions smoke workflow that exercises the demo as part of CI
This keeps new runtime features anchored to one continuous corpus instead of adding isolated test fixtures with no narrative continuity.
The system should ultimately help answer questions like:
- What else changes if this schema field changes?
- Which documents and processes depend on this concept, even if they use different words?
- Where do two teams mean different things by the same term?
- Which decisions rely on informal or weakly evidenced assumptions?
- Which systems are coupled only through undocumented process or data dependencies?
That is the actual downstream value of the runtime.
- Preserve provenance. Never lose the trail back to the source artifact.
- Model ambiguity explicitly. Do not force early false precision.
- Separate evidence from interpretation.
- Keep the Rust host contract stable.
- Use Rhai for embedded semantic behavior, not host contract mutation.
- Keep Python at an executor boundary for interop and discovery.
- Keep Git as the immutable semantic ledger and Neumann as the live semantic store.
- Expose ontology and impact discovery through MCP so agents learn instead of guessing.
This runtime is not trying to:
- reduce the enterprise ontology to code structure alone
- assume every source can be made semantically precise immediately
- make Python the in-process ontology runtime
- let scripts bypass host validation
- replace human curation where ambiguity is real
This branch captures the documentation pivot first.
The implementation already has a working Phase 2 runtime skeleton. The next real work is to widen that skeleton from "repo/code intelligence" into "enterprise artifact ingestion and latent ontology discovery".