perf(opensearch): Remove refresh="wait_for" from OpenSearch storage backends by LantaoJin · Pull Request #2786 · HKUDS/LightRAG

LantaoJin · 2026-03-16T06:50:46Z

Description

Every individual upsert, delete, and delete_by_query call in the OpenSearch storage backends used refresh="wait_for", which forces OpenSearch to wait for a segment refresh before returning. With ~1,900+ storage operations during a typical ainsert, this added ~1,400s of unnecessary waiting (1,668s total vs 265s for default file-based storage).

Remove refresh="wait_for" from all individual write operations across OpenSearch storage classes, replacing it with reliance on OpenSearch's translog for read-after-write consistency and index_done_callback() for batch refresh. This eliminates a ~0.9s penalty per operation, yielding a 7.1x improvement in total storage time during document indexing.

Change details

delete_by_query calls (4 total) — all now have conflicts="proceed":

┌───────────┬──────────────────────────────────────────┬───────────────────────────────┐
│ Location  │                  Method                  │               Status          │
├───────────┼──────────────────────────────────────────┼───────────────────────────────┤
│ Line 1456 │ GraphStorage.delete_node()               │ conflicts="proceed"           │
├───────────┼──────────────────────────────────────────┼───────────────────────────────┤
│ Line 1489 │ GraphStorage.remove_nodes()              │ conflicts="proceed"           │
├───────────┼──────────────────────────────────────────┼───────────────────────────────┤
│ Line 1535 │ GraphStorage.remove_edges()              │ conflicts="proceed"           │
├───────────┼──────────────────────────────────────────┼───────────────────────────────┤
│ Line 2649 │ VectorDBStorage.delete_entity_relation() │ conflicts="proceed"           │
└───────────┴──────────────────────────────────────────┴───────────────────────────────┘

refresh="wait_for" — intentionally retained in 2 places:

┌──────────┬───────────────────────────┬───────────────────────────────────────────────┐
│ Location │          Method           │                    Reason                     │
├──────────┼───────────────────────────┼───────────────────────────────────────────────┤
│ Line 586 │ DocStatusStorage.upsert() │ Downstream get_docs_by_status uses search API │
├──────────┼───────────────────────────┼───────────────────────────────────────────────┤
│ Line 851 │ DocStatusStorage.delete() │ Downstream readers use search API             │
└──────────┴───────────────────────────┴───────────────────────────────────────────────┘

Performance profiling (without LLM cache, remote embedding model ) achieves a 7.1x speedup (1,668s → 234.5s)

┌────────────────┬────────────┬────────────┬────────────┬────────────┬────────────┐
│    Category    │ Default    │ OpenSearch │ This PR    │ MongoDB    │ Postgres   │
├────────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
│ VECTOR_STORAGE │ 263.7s     │ 749.0s     │ 213.0s     │ 236.9s     │ 233.4s     │
├────────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
│ GRAPH_STORAGE  │ 0.52s      │ 481.7s     │ 8.2s       │ 7.9s       │ 10.3s      │
├────────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
│ KV_STORAGE     │ 1.08s      │ 434.9s     │ 10.3s      │ 0.9s       │ 6.6s       │
├────────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
│ DOC_STATUS     │ 0.06s      │ 2.5s       │ 3.0s       │ 0.07s      │ 0.3s       │
├────────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
│ GRAND TOTAL    │ 265.4s     │ 1,668.1s   │ 234.5s     │ 245.8s     │ 250.7s     │
├────────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
│ E2E DURATION   │ 70s        │ 605s       │ 300s       │ 397s       │ 358s       │
└────────────────┴────────────┴────────────┴────────────┴────────────┴────────────┘

Profiling (without LLM cache)

Before

================================================================================
STORAGE PROFILING REPORT
================================================================================

  [DOC_STATUS_STORAGE]  total: 2.5121s  calls: 8
    upsert                                      2.3110s  (3 calls)
    get_docs_by_status                          0.1890s  (3 calls)
    index_done_callback                         0.0076s  (1 calls)
    filter_keys                                 0.0046s  (1 calls)

  [GRAPH_STORAGE]  total: 481.7155s  calls: 1516
    upsert_node                               252.1001s  (266 calls)
    upsert_edge                               223.6067s  (244 calls)
    get_node                                    3.4759s  (747 calls)
    has_edge                                    2.0792s  (244 calls)
    get_edges_batch                             0.3045s  (4 calls)
    index_done_callback                         0.0491s  (1 calls)
    node_degrees_batch                          0.0470s  (2 calls)
    get_nodes_edges_batch                       0.0332s  (2 calls)
    get_nodes_batch                             0.0197s  (4 calls)
    edge_degrees_batch                          0.0000s  (2 calls)

  [KV_STORAGE]  total: 434.8673s  calls: 1886
    upsert                                    429.1647s  (651 calls)
    get_by_id                                   5.6020s  (1221 calls)
    index_done_callback                         0.0800s  (12 calls)
    get_by_ids                                  0.0206s  (2 calls)

  [VECTOR_STORAGE]  total: 749.0035s  calls: 765
    upsert                                    620.2629s  (511 calls)
    delete                                    127.7973s  (244 calls)
    query                                       0.6355s  (5 calls)
    get_vectors_by_ids                          0.2081s  (2 calls)
    index_done_callback                         0.0996s  (3 calls)

  ────────────────────────────────────────────────────────────
  CATEGORY SUMMARY                                TIME  % OF TOTAL
  ────────────────────────────────────────────────────────────
  VECTOR_STORAGE                             749.0035s       44.9%
  GRAPH_STORAGE                              481.7155s       28.9%
  KV_STORAGE                                 434.8673s       26.1%
  DOC_STATUS_STORAGE                           2.5121s        0.2%
  ────────────────────────────────────────────────────────────
  GRAND TOTAL                               1668.0984s
================================================================================

After

================================================================================
STORAGE PROFILING REPORT
================================================================================

  [DOC_STATUS_STORAGE]  total: 3.0017s  calls: 8
    upsert                                      2.9492s  (3 calls)
    get_docs_by_status                          0.0457s  (3 calls)
    index_done_callback                         0.0048s  (1 calls)
    filter_keys                                 0.0021s  (1 calls)

  [GRAPH_STORAGE]  total: 8.3042s  calls: 1439
    upsert_edge                                 2.8464s  (236 calls)
    upsert_node                                 2.1988s  (242 calls)
    get_node                                    1.6704s  (710 calls)
    has_edge                                    1.3135s  (236 calls)
    get_edges_batch                             0.2187s  (4 calls)
    get_nodes_edges_batch                       0.0168s  (2 calls)
    node_degrees_batch                          0.0157s  (2 calls)
    get_nodes_batch                             0.0128s  (4 calls)
    index_done_callback                         0.0111s  (1 calls)
    edge_degrees_batch                          0.0000s  (2 calls)

  [KV_STORAGE]  total: 10.3602s  calls: 1810
    upsert                                      6.4499s  (618 calls)
    get_by_id                                   3.7595s  (1178 calls)
    index_done_callback                         0.1366s  (12 calls)
    get_by_ids                                  0.0142s  (2 calls)

  [VECTOR_STORAGE]  total: 212.8049s  calls: 725
    upsert                                    210.6248s  (479 calls)
    delete                                      1.2301s  (236 calls)
    query                                       0.6297s  (5 calls)
    get_vectors_by_ids                          0.1785s  (2 calls)
    index_done_callback                         0.1417s  (3 calls)

  ────────────────────────────────────────────────────────────
  CATEGORY SUMMARY                                TIME  % OF TOTAL
  ────────────────────────────────────────────────────────────
  VECTOR_STORAGE                             212.8049s       90.8%
  KV_STORAGE                                  10.3602s        4.4%
  GRAPH_STORAGE                                8.3042s        3.5%
  DOC_STATUS_STORAGE                           3.0017s        1.3%
  ────────────────────────────────────────────────────────────
  GRAND TOTAL                                234.4710s
================================================================================

Related Issues

#2785

Checklist

Changes tested locally
Code reviewed
Documentation updated (if necessary)
Unit tests added (if applicable)

Additional Notes

[Add any additional notes or context for the reviewer(s).]

LantaoJin · 2026-03-16T07:00:50Z

@codex review

chatgpt-codex-connector · 2026-03-16T07:00:56Z

To use Codex here, create a Codex account and connect to github.

danielaskdd · 2026-03-16T11:24:29Z

@codex review

chatgpt-codex-connector · 2026-03-16T11:24:39Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

danielaskdd · 2026-03-16T17:55:53Z

Please keep refresh="wait_for" in OpenSearchDocStatusStorage.delete().

DocStatus is different from the graph/entity ID-read paths: its important readers are search-based (get_docs_by_status, get_docs_by_track_id, get_docs_paginated, get_status_counts), so an acknowledged bulk delete is not sufficient to guarantee immediate visibility to subsequent status queries.

Without refresh, a just-deleted doc status can still show up briefly in scheduler/API/UI reads until the next refresh cycle. The main document-deletion flow eventually calls _insert_done(), but not every caller of doc_status.delete() has that guarantee; for example, _validate_and_fix_document_consistency() deletes inconsistent status rows directly.

So this method should keep its own write-then-search consistency contract, for the same reason DocStatusStorage.upsert() still keeps refresh="wait_for".

danielaskdd · 2026-03-16T17:58:33Z

Please add short comments at the KV / graph / vector storage mutation sites explaining why removing per-operation refresh does not break consistency for the current call paths, and explicitly warn future maintainers to re-evaluate if they introduce immediate search-based reads after mutation.

Suggested scope of the rationale:

KV storage: immediate correctness-sensitive reads are ID-based (mget) or happen only after batch-level index_done_callback().
Vector storage: same idea; immediate lookups are ID-based, while search visibility is expected after explicit batch refresh.
Graph storage: please scope this comment narrowly. The “safe without per-op refresh” argument applies to node / ID-based realtime reads (mget / exists) and keyed-lock serialization, but it does not automatically apply to edge paths that still read via search. Please avoid a blanket comment that says graph storage is generally safe without refresh.

LantaoJin · 2026-03-17T02:36:30Z

So this method should keep its own write-then-search consistency contract, for the same reason DocStatusStorage.upsert() still keeps refresh="wait_for".

Good catch, I'll restore refresh="wait_for" for DocStatusStorage.delete()

…tion

danielaskdd · 2026-03-17T04:06:23Z

Removing `refresh` from the graph delete paths introduced a real regression for node deletion

remove_edges() and remove_nodes() both use delete_by_query on the edges index. In the document-deletion flow, we can first delete relationships in step 6 and then later call remove_nodes() for affected entities. Without an intervening refresh, the later search-based edge lookup / delete_by_query can still see stale edge hits from the pre-delete search view, and then fail with version_conflict_engine_exception ... but no document was found when OpenSearch executes the delete phase.

The reported failure shape matches that exactly: total > 0, deleted = 0, version_conflicts = N, with per-doc causes saying the edge document was not found.

Relevant call chain:

remove_edges() deletes edges first
later get_nodes_edges_batch() / remove_nodes() still operate on the edges index
both rely on search-based visibility, not realtime ID reads

So for graph delete paths, removing refresh is not just a performance optimization. At minimum, please restore refresh on the edge delete_by_query operations (delete_node, remove_nodes, remove_edges), or otherwise make these deletes explicitly tolerant of stale search views (for example by using conflicts="proceed" and auditing the remaining read-after-delete assumptions).

danielaskdd · 2026-03-17T06:43:54Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

LantaoJin added 2 commits March 16, 2026 14:44

Remove refresh=wait_for from OpenSearch storage backends

b5a3f87

fix test

06d8fd1

not always drop llm cache in demo

a1663f7

LantaoJin added 2 commits March 17, 2026 10:44

Merge remote-tracking branch 'upstream/main' into opensearch_optimiza…

24a403f

…tion

address comments

2e6e1a5

add version conflict policy for delete()

6d72dd5

danielaskdd merged commit e3b4cec into HKUDS:main Mar 17, 2026
3 checks passed

LantaoJin deleted the pr/opensearch_optimization branch March 20, 2026 08:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(opensearch): Remove refresh="wait_for" from OpenSearch storage backends#2786

perf(opensearch): Remove refresh="wait_for" from OpenSearch storage backends#2786
danielaskdd merged 6 commits intoHKUDS:mainfrom
LantaoJin:pr/opensearch_optimization

LantaoJin commented Mar 16, 2026 •

edited

Loading

Uh oh!

LantaoJin commented Mar 16, 2026

Uh oh!

chatgpt-codex-connector Bot commented Mar 16, 2026

Uh oh!

danielaskdd commented Mar 16, 2026

Uh oh!

chatgpt-codex-connector Bot commented Mar 16, 2026

Uh oh!

danielaskdd commented Mar 16, 2026

Uh oh!

danielaskdd commented Mar 16, 2026

Uh oh!

LantaoJin commented Mar 17, 2026 •

edited

Loading

Uh oh!

danielaskdd commented Mar 17, 2026

Uh oh!

danielaskdd commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LantaoJin commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Change details

Profiling (without LLM cache)

Related Issues

Checklist

Additional Notes

Uh oh!

LantaoJin commented Mar 16, 2026

Uh oh!

chatgpt-codex-connector Bot commented Mar 16, 2026

Uh oh!

danielaskdd commented Mar 16, 2026

Uh oh!

chatgpt-codex-connector Bot commented Mar 16, 2026

Uh oh!

danielaskdd commented Mar 16, 2026

Uh oh!

danielaskdd commented Mar 16, 2026

Uh oh!

LantaoJin commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielaskdd commented Mar 17, 2026

Removing refresh from the graph delete paths introduced a real regression for node deletion

Uh oh!

danielaskdd commented Mar 17, 2026

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LantaoJin commented Mar 16, 2026 •

edited

Loading

LantaoJin commented Mar 17, 2026 •

edited

Loading

Removing `refresh` from the graph delete paths introduced a real regression for node deletion