Skip to content

fix: sanitize workspace label in Memgraph/Neo4j to prevent Cypher injection#2713

Closed
furkankoykiran wants to merge 5 commits intoHKUDS:mainfrom
furkankoykiran:fix/cypher-injection-workspace-label
Closed

fix: sanitize workspace label in Memgraph/Neo4j to prevent Cypher injection#2713
furkankoykiran wants to merge 5 commits intoHKUDS:mainfrom
furkankoykiran:fix/cypher-injection-workspace-label

Conversation

@furkankoykiran
Copy link
Copy Markdown
Contributor

Summary

Fixes #2698 — Cypher injection via unsanitized LIGHTRAG-WORKSPACE header in Memgraph/Neo4j.

Problem

The _get_workspace_label() method in both MemgraphStorage and Neo4JStorage returned self.workspace directly without sanitization. Since the workspace value can originate from the LIGHTRAG-WORKSPACE HTTP header, an attacker could inject arbitrary Cypher queries by manipulating this header.

The PostgreSQL implementation (PGGraphStorage._get_workspace_graph_name()) already had proper sanitization using re.sub(r"[^a-zA-Z0-9_]", "_", ...).

Changes

lightrag/kg/memgraph_impl.py

  • Added import re
  • Rewrote _get_workspace_label() to sanitize workspace using the same regex pattern as PostgreSQL — strips all chars except [a-zA-Z0-9_], falls back to "base" if empty

lightrag/kg/neo4j_impl.py

  • Rewrote _get_workspace_label() with identical sanitization (re was already imported)

tests/test_workspace_sanitization.py (NEW)

  • 20 offline unit tests covering normal inputs, special character replacement, Cypher injection payloads, empty/whitespace fallback, and edge cases
  • All 20 tests pass ✅

Test Results

tests/test_workspace_sanitization.py  20 passed in 0.04s

…ection

Apply regex sanitization to _get_workspace_label() in both MemgraphStorage
and Neo4JStorage, matching the pattern already used in PGGraphStorage.

This prevents Cypher injection via the LIGHTRAG-WORKSPACE HTTP header by
stripping all characters except alphanumeric and underscore.

Closes HKUDS#2698
@danielaskdd
Copy link
Copy Markdown
Collaborator

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ac79a0b7be

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lightrag/kg/neo4j_impl.py Outdated
Strips all characters except alphanumeric and underscore to prevent
Cypher injection via the LIGHTRAG-WORKSPACE header.
"""
safe_workspace = re.sub(r"[^a-zA-Z0-9_]", "_", self.workspace.strip())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep Neo4j workspace labels one-to-one after sanitization

Replacing every non-[a-zA-Z0-9_] character with _ in _get_workspace_label() is non-injective, so distinct workspace headers like team-a and team_a now map to the same graph label; because the API currently accepts arbitrary LIGHTRAG-WORKSPACE values (only .strip() is applied in lightrag/api/lightrag_server.py), this can merge tenant data, allow cross-workspace reads/writes, and hide previously indexed data that used unsanitized labels before this change.

Useful? React with 👍 / 👎.

Comment thread lightrag/kg/memgraph_impl.py Outdated
Strips all characters except alphanumeric and underscore to prevent
Cypher injection via the LIGHTRAG-WORKSPACE header.
"""
safe_workspace = re.sub(r"[^a-zA-Z0-9_]", "_", self.workspace.strip())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep Memgraph workspace labels one-to-one after sanitization

This sanitization has the same collision problem in Memgraph: different workspace names collapse to the same label (for example, project-a and project_a), which breaks workspace isolation and can expose one workspace’s graph to another when headers are user-controlled; it also makes existing nodes/edges under old unsanitized labels effectively unreachable after upgrade.

Useful? React with 👍 / 👎.

Copilot AI review requested due to automatic review settings March 5, 2026 20:16
@furkankoykiran
Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Mitigates Cypher injection risk by sanitizing workspace-derived labels used by Memgraph/Neo4j backends and adding regression tests targeting malicious LIGHTRAG-WORKSPACE inputs.

Changes:

  • Sanitize _get_workspace_label() output in Neo4j and Memgraph implementations to neutralize backtick-based injection.
  • Add a new offline pytest suite covering workspace sanitization edge cases and injection payloads.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
lightrag/kg/memgraph_impl.py Sanitizes workspace label before it is used in Cypher.
lightrag/kg/neo4j_impl.py Sanitizes workspace label before it is used in Cypher.
tests/test_workspace_sanitization.py Adds offline tests intended to prevent regressions in workspace sanitization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_workspace_sanitization.py Outdated
Comment thread lightrag/kg/memgraph_impl.py
Comment thread lightrag/kg/neo4j_impl.py Outdated
Comment thread tests/test_workspace_sanitization.py
@furkankoykiran
Copy link
Copy Markdown
Contributor Author

furkankoykiran commented Mar 6, 2026

PR Update Summary 🚀

I've completed a round of refinements to address the feedback and improve the implementation:

  1. Refined Workspace Isolation: Replaced the previous sanitization (which could cause collisions, e.g., 'team-a' vs 'team_a') with a backtick-escaping approach. This ensures a 1-to-1 mapping for all workspace names while staying secure against Cypher injection.
  2. Addressed Review Feedback:
    • Updated docstrings in Neo4j and Memgraph to clarify identifier safety when used with backticks.
    • Removed unused imports.
  3. Improved Test Robustness: Overhauled 'tests/test_workspace_sanitization.py' to validate the logic directly. The updated tests cover various scenarios and verify the 1-to-1 mapping.
  4. Formatting & Quality: Ensured the code complies with the project's pre-commit hooks.

All tests for workspace sanitization are passing locally. Ready for review.

cc @danielaskdd @HKUDS/maintainers

@danielaskdd
Copy link
Copy Markdown
Collaborator

Code review

Found 3 issues:

  1. Escaped label used in Python string equality comparisons against driver-returned (unescaped) labels — the database stores foobarbut_get_workspace_label()returnsfoo``bar, so the filter if label != workspace_label` never matches for workspaces containing backticks, meaning the workspace label is never stripped from `node_dict["labels"]`.

if "labels" in node_dict:
node_dict["labels"] = [
label
for label in node_dict["labels"]
if label != workspace_label
]
# logger.debug(f"Neo4j query node {query} return: {node_dict}")
return node_dict

Same issue in memgraph:

if "labels" in node_dict:
node_dict["labels"] = [
label
for label in node_dict["labels"]
if label != workspace_label
]
return node_dict

  1. Memgraph index creation uses workspace_label without backtick wrapping — Neo4j correctly wraps it (`{workspace_label}`), but Memgraph emits CREATE INDEX ON :{workspace_label}(entity_id) directly. If the workspace contains a backtick, _get_workspace_label() returns doubled backticks (e.g., foo``bar) which produces invalid Cypher at this unquoted position, causing initialize() to fail.

# Create index for base nodes on entity_id if it doesn't exist
try:
workspace_label = self._get_workspace_label()
await session.run(
f"""CREATE INDEX ON :{workspace_label}(entity_id)"""
)
logger.info(
f"[{self.workspace}] Created index on :{workspace_label}(entity_id) in Memgraph."
)

  1. workspace_label used as a Cypher string literal instead of a backtick-quoted identifier — the query WHERE '{workspace_label}' IN labels(n) embeds the escaped value (e.g., foo``bar) as a plain string, but Cypher string literals do not interpret doubled backticks as escapes, so it will never match the actual stored label (foobar`), silently returning no graph results for any workspace name containing a backtick.

MATCH (start:`{workspace_label}`)
WHERE start.entity_id = $entity_id
MATCH path = (start)-[*BFS 0..{max_depth}]-(end:`{workspace_label}`)
WHERE ALL(n IN nodes(path) WHERE '{workspace_label}' IN labels(n))
WITH collect(DISTINCT end) + start AS all_nodes_unlimited
WITH

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@danielaskdd
Copy link
Copy Markdown
Collaborator

PR #2792 resolved this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Security: Cypher injection via unsanitized LIGHTRAG-WORKSPACE header in Memgraph/Neo4j

3 participants