Skip to content

feat: extract deterministic_turtle into standalone PyPI package (rdflib-stable-turtle) #6

@jdsika

Description

@jdsika

Context

Upstream linkml maintainers (cmungall, sneakers-the-rat) closed linkml/linkml#3295 with clear guidance: the WL-based serialization logic should live in a separate RDF library, not inside linkml. Meanwhile, linkml/linkml#1943 shows upstream converging on pyoxigraph RDFC-1.0 for canonicalization — but not yet addressing diff stability (small schema change → small output diff).

Our fork's deterministic_turtle() + _wl_signatures() solves both problems via a hybrid pipeline. Extracting it into a standalone package:

  1. Directly addresses cmungall's request: "The implementation should live elsewhere, where others can take advantage of it"
  2. Positions the diff-stability layer as the answer when upstream discovers RDFC-1.0's cascading-renumbering problem
  3. Makes our fork's --deterministic flag a thin wrapper around an import

What the package should contain

Core API

from rdflib_stable_turtle import deterministic_turtle, wl_signatures

# Full pipeline: RDFC-1.0 → WL hashing → idiomatic rdflib Turtle
ttl: str = deterministic_turtle(graph)

# Lower-level: just the WL signatures (for custom pipelines)
signatures: dict[str, str] = wl_signatures(quads)

Implementation (extract from generator.py)

  • _wl_signatures() → public wl_signatures()
  • deterministic_turtle() → public, same name
  • The _to_rdflib() helper and prefix-filtering logic

What stays in linkml

  • deterministic_json() — JSON-specific, no RDF dependency
  • Collection sorting (owl:oneOf, sh:in, sh:ignoredProperties) — generator-specific fixes
  • --deterministic CLI flag — imports from the package
  • _deterministic_context_json() — JSON-LD context-specific ordering

Design considerations

1. RDFC-1.0 round-trip optimization

Current pipeline does 3 serialize/parse cycles (rdflib→NT→pyoxigraph→WL→rdflib→Turtle). Consider:

  • Computing WL directly on rdflib Graph for the common no-collision case
  • Only invoking pyoxigraph RDFC-1.0 for collision tiebreaking (automorphic nodes)
  • This would eliminate the N-Triples→pyoxigraph→rdflib round-trip overhead

2. pyoxigraph as optional dependency

  • pyoxigraph >= 0.4.0 (for Dataset.canonicalize())
  • Import lazily — raise ImportError with install instructions
  • Test suite should skip gracefully when absent
  • Consider: pip install rdflib-stable-turtle[fast] for pyoxigraph

3. Diff stability properties to test and document

  • Adding one blank node must NOT renumber existing nodes
  • Removing a blank node must NOT renumber unrelated nodes
  • Modifying a literal must only affect the containing subject block
  • Benchmark: single-description-change diff ≤ 20 lines
  • Benchmark: signal-to-noise ratio ≥ 5x vs non-deterministic (currently 13-344x)

4. Collection sorting is NOT part of this package

Collection sorting changes graph structure (different rdf:first/rdf:rest triples). This is a generator-level concern, not a serialization concern. The package should serialize faithfully.

5. Collision handling

  • 48-bit truncated SHA-256: ~0.002% collision probability at 100K blank nodes
  • Counter-based tiebreaker via sorted(bnode_ids) from RDFC-1.0 canonical ordering
  • WL is provably complete for tree-structured BNodes (LinkML's case)
  • Known failures (Cai-Fürer-Immerman symmetric graphs) don't occur in LinkML output

6. Package metadata

  • Name: rdflib-stable-turtle (follows rdflib-* convention)
  • License: Apache-2.0
  • Deps: rdflib >= 7.0.0, optional pyoxigraph >= 0.4.0
  • Python: >= 3.10

7. Upstream strategy

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions