Skip to content

feat(cimcheck): add CIM/CGMES SPARQL and SHACL validation toolchain#10

Open
spah-soptim wants to merge 59 commits into
mainfrom
cimcheck
Open

feat(cimcheck): add CIM/CGMES SPARQL and SHACL validation toolchain#10
spah-soptim wants to merge 59 commits into
mainfrom
cimcheck

Conversation

@spah-soptim
Copy link
Copy Markdown
Member

@spah-soptim spah-soptim commented May 26, 2026

Summary

This PR introduces CIMCheck — a static analysis toolchain for SPARQL queries and SHACL shapes written against CIM/CGMES schemas. It ships as three Maven modules plus a VS Code extension:

  • cimcheck/core — validation engine: SPARQL algebra visitor, semantic domain/range checks, SHACL shape analyzer, RDFS schema index, and a high-level SparqlValidationApi façade
  • cimcheck/lsp — Language Server Protocol server with debounced diagnostics, hover documentation, go-to-definition, and CIM term completion
  • cimcheck/cli — batch validation CLI (cimcheck) with text and JSON output, configurable strictness, and auto-discovery of .cgmes/validation.json
  • cimcheck/vscode — VS Code extension wrapping the LSP; ships the fat JAR or accepts an explicit serverJar setting

Additionally:

  • GitHub Actions CI (lint, build+test, TypeScript type-check, VSIX build) and release workflows
  • ENTSO-E application profiles library added as a git submodule for integration tests
  • Pre-2020 CGMES profile encoding fix in cimxml (rdfs:Literal blank-node style)

What gets validated

Input Checks
SPARQL query / update Syntax, unknown class/property, domain/range mismatch, implied type, path chain compatibility, unconfigured named graph
SHACL shapes (Turtle) sh:targetClass, sh:class, sh:path existence; sh:nodeKind/range compatibility; sh:datatype/sh:class vs range; sh:minCount/sh:maxCount contradiction; all embedded sh:select/sh:ask/sh:construct SPARQL fragments

Strictness is configurable (permissive / default / strict / pedantic) per-file via .cgmes/validation.json or via CLI flag.

Notable design decisions

  • Auto-detection: validateSparql() tries query parse first, falls back to update, then attempts ;-separated multi-query splitting — no caller configuration required.
  • Profile scoping: three overloads (all profiles, explicit profile list, named-graph→profile map) cover the full range of CGMES deployment patterns.
  • $PATH substitution: embedded SPARQL constraints that reference $PATH / ?PATH have the enclosing sh:path URI substituted before static analysis, avoiding false UNSUPPORTED_DYNAMIC_PROPERTY warnings.
  • Exempt vocabulary: RDF/RDFS/OWL/XSD/SHACL terms are never validated against the CIM index.
  • Tolerant schema loading: unparseable profile files are skipped with a warning rather than aborting; the LSP surfaces skipped files in the VS Code message area.
  • sh:alternativePath alias suppression: an unknown alternative in a cross-version compatibility path (cim:Foo | <cim100#Foo> | <ucaiug#Foo>) is suppressed when a sibling with the same local name is known, avoiding noise on multi-namespace shapes.

Test coverage

Unit and integration tests cover algebra traversal, semantic validation, squiggle position mapping, SHACL constraint checks, $this binding, SPARQL UPDATE, source locator, and full CGMES 2.4 / 3.0 profile integration.

…L shape analyzer

rdf:type and other standard W3C terms appearing in SHACL sh:path expressions
(e.g. sequence paths like (cim:Prop rdf:type)) were incorrectly flagged as
UNKNOWN_PROPERTY because ShaclShapeAnalyzer lacked the same exempt-namespace
guard that AlgebraAnalysisVisitor already applies for SPARQL queries.
…es for embedded SPARQL

Two improvements to embedded SPARQL validation in SHACL shapes:

1. Prefix fallback: ShaclSparqlExtractor.resolvePrefixes() now seeds the prefix
   map from the graph's own @Prefix declarations before applying sh:declare entries.
   This fixes SYNTAX_ERROR ("Unresolved prefixed name") for SHACL files that use
   sh:prefixes <X> without a corresponding sh:declare block on node X — the cim:
   prefix is still available from the Turtle file header.

2. \$PATH substitution: EmbeddedSparql gains a shPaths field populated with the
   simple-URI sh:path values of enclosing sh:PropertyShape nodes. SparqlValidationApi
   replaces \$PATH / ?PATH variable predicates with the concrete URI before analysis,
   eliminating the UNSUPPORTED_DYNAMIC_PROPERTY warning for the SHACL-standard
   \$PATH pattern.
…prefix fallback

Reflects three improvements shipped in the preceding commits:

- sh:path validation: standard vocabulary terms (rdf:type, rdfs:*, owl:*, etc.)
  are now exempt from UNKNOWN_PROPERTY checks; updated the SHACL structural
  checks table and the existing known-limitation comment.

- Dynamic predicates: added an exception note to the "Dynamic predicates and
  classes" section explaining that SHACL \$PATH variables are resolved via the
  enclosing sh:PropertyShape.sh:path before analysis. Updated the SHACL API
  description to mention \$PATH substitution alongside \$this typing.

- Prefix fallback: clarified in the Pass 2 description that graph-level Turtle
  @Prefix declarations are used as a fallback when sh:prefixes target nodes
  carry no sh:declare blocks.
Intermediate variables in SHACL embedded constraints are transient
bindings, not entities the author is expected to annotate with
rdf:type. Filtering QUERY_IMPLIED_TYPE (INFO) from embedded results
eliminates noise when validating ENTSO-E CGMES 3.0 SHACL files where
every SELECT uses pattern variables alongside domain-having properties.
…ePath

ENTSO-E CGMES SHACL files use sh:alternativePath to list the same
property under multiple CIM namespace URIs (e.g. cim16, CIM100,
ucaiug.io) for cross-version compatibility. Checking each alternative
independently flagged the aliases as UNKNOWN_PROPERTY even when one
variant was valid for the loaded profiles.

An unknown alternative is now silently suppressed when at least one
sibling in the same sh:alternativePath group is a known property with
the same local name. Alternatives whose local names differ from every
known sibling are still flagged, preserving detection of genuine typos.
…tVocabulary

Dead code removed:
- SparqlValidationCode.TERM_EXISTS_IN_OTHER_PROFILE (never emitted)
- GraphReference.Source.UPDATE_TEMPLATE (never used)
- SparqlQueryValidator.intersect() (orphaned private method)

Correctness fixes:
- SemanticChecks.anySubclassMatch: removed incorrect reverse-direction check
  that caused false-positive PATH_CHAIN_INCOMPATIBLE suppressions
- SparqlValidationApi: use String.replace() instead of replaceAll() for $PATH/$?PATH
  substitution to avoid regex back-reference interpretation of URIs with '$'
- SparqlValidationApi: merge profileDeps/updateProfileDeps into single
  collectProfileDeps to eliminate duplicated logic

API additions:
- ShaclValidationResult.isValid(StrictnessLevel): overload for caller-controlled
  strictness filtering; zero-arg isValid() now delegates to it

Refactoring:
- Extract ExemptVocabulary shared class from duplicate EXEMPT_NAMESPACES lists
  in AlgebraAnalysisVisitor and ShaclShapeAnalyzer
LSP fixes:
- SparqlWorkspaceService.didChangeWatchedFiles: guard against null params.getChanges()
  (LSP spec allows omitting the array; this caused NPE on some clients)
- SparqlTextDocumentService: remove pending.remove(uri) at start of validateSparql/
  validateShacl — races with newer scheduled tasks and removes their cancellation entry
- SparqlTextDocumentService.shutdown: use shutdownNow() so pending debounce tasks are
  not waited on during LSP server shutdown
- SparqlTextDocumentService.convertSparqlAnnotation: use tokenLengthInSource() for
  term-based paths instead of delegating to DiagnosticConverter (which uses full URI
  length + 2, wrong for prefixed-name tokens in source)
- SparqlTextDocumentService.turtleParseErrorDiagnostic: guard e.getMessage() null
  (some Jena parse exceptions have no message; rendered as "null" in UI)
- SchemaManager.shutdown: awaitTermination(2s) before shutdownNow to allow in-flight
  schema loads to complete gracefully
- DefinitionIndex.findSymbols: collect all matches then sort-and-cap at MAX_SYMBOLS
  so properties are not starved when many classes match the query

CLI fixes:
- SchemaLoader: pass FileVisitOption.FOLLOW_LINKS to Files.walk so symlinked schema
  directories are traversed correctly
- ValidateCommand: read stdin only once; cache in stdinText so '-' passed multiple
  times returns the same content rather than empty on the second read
VS Code:
- buildClient: pass context to allow traceOutputChannel to be registered in
  subscriptions, preventing leak on extension deactivation
- Remove redundant { dispose: () => client?.stop() } subscription — deactivate()
  already calls client.stop(), so this caused double-stop on VS Code shutdown
- JDWP debug address changed from 5005 to 127.0.0.1:5005 to bind only to
  localhost and prevent remote debug access

CI:
- Rename "Setup Node.js 18" step labels to "Node.js 20" to match the actual
  node-version: "20" that was already in use
@spah-soptim spah-soptim added the enhancement New feature or request label May 26, 2026
- SchemaManager: isolate on-loaded callbacks in per-callback try/catch so
  a failing callback cannot clear a successfully-loaded schema
- SparqlValidationApi: remove getGraphDependencies(String, Collection<VersionIri>)
  overload that silently ignored its profiles parameter; remove the
  corresponding test that only documented the dead-parameter behaviour
- SparqlTextDocumentService: replace wildcard java.util.concurrent.* import
  with explicit imports (CompletableFuture, ConcurrentHashMap, Executors,
  ScheduledExecutorService, ScheduledFuture, TimeUnit)
- ShaclSparqlExtractor: fix stray brace left by iterator-close refactor
@abba-soptim abba-soptim self-assigned this May 26, 2026
@spah-soptim spah-soptim changed the title feat: cimcheck feat(cimcheck): add CIM/CGMES SPARQL and SHACL validation toolchain May 26, 2026
@spah-soptim spah-soptim marked this pull request as ready for review May 26, 2026 08:47
@spah-soptim spah-soptim requested a review from abba-soptim May 26, 2026 08:47
Resolve three named-graph problems:

1. Relative graph names (<EQ>, <TP>) were resolved against the JVM
   working directory by Jena, producing file:// URIs that never
   matched any configured profile. SparqlQueryAnalyzer now passes
   a stable urn:x-cimcheck:base/ base to QueryFactory/UpdateFactory
   so relative refs always resolve predictably.

2. namedGraphs was dead config in the LSP: SchemaManager now builds
   a NamedGraphProfileScope from the config after each schema load
   and exposes it via namedGraphScope(). SparqlTextDocumentService
   uses the scope when present, falling back to AllProfilesScope
   (no GRAPH_NOT_CONFIGURED warnings) when not configured.

3. namedGraphs value type changed from string to array of strings
   (Map<String,String> -> Map<String,List<String>>) so a single
   graph can be mapped to multiple profiles. JSON schema updated
   accordingly.

Relative config keys (e.g. "EQ") are matched against the
urn:x-cimcheck:base/ base so users can write the same short name in
both the query and the config without needing full URIs.
- Update namedGraphs examples from string values to array of strings
- Document short relative graph names ("EQ", "TP") matching FROM NAMED <EQ>
- Clarify default behaviour: no namedGraphs = all profiles, no warnings
- Add named graph scope summary table in core README
- Update VS Code README namedGraphs description and example
Common prefixes (rdf, rdfs, owl, xsd, sh, cim, md) are now automatically
prepended to any SPARQL query or update that does not already declare them,
so users no longer need to repeat PREFIX lines in every file.

Custom prefixes can be supplied via "prefixes" in .cgmes/validation.json;
an explicit object replaces the built-in set entirely, and {} disables all
injection. Line numbers in annotations and Jena error messages are adjusted
back to the original query coordinates after injection.
Instead of hardcoding cim: -> http://iec.ch/TC57/CIM100# (CGMES 3.0 only),
scan all class and property IRIs in the loaded schema index, tally their
namespace prefixes, and use the dominant one as the cim: default prefix.

A namespace wins when it has >= 10 terms and > 2x as many as the next
candidate — so mixed-version schemas (2.4.15 + 3.0 loaded together) produce
no winner and leave cim: to the user's explicit prefixes config.

The detection runs in DefaultPrefixes.withDetectedCimPrefix() and is called
from the single-arg SparqlValidationApi constructor, SchemaManager, and
ValidateCommand whenever the user has not supplied an explicit prefixes map.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants