Skip to content

perf(postgres): optimize KV storage upsert using executemany#2742

Merged
danielaskdd merged 2 commits intoHKUDS:mainfrom
wkpark:perf-postgres-batch
Mar 18, 2026
Merged

perf(postgres): optimize KV storage upsert using executemany#2742
danielaskdd merged 2 commits intoHKUDS:mainfrom
wkpark:perf-postgres-batch

Conversation

@wkpark
Copy link
Copy Markdown
Contributor

@wkpark wkpark commented Mar 3, 2026

Description

This PR optimizes the document ingestion performance for Postgres-based storage by introducing batch processing
for Key-Value (KV) storage operations. By replacing individual INSERT calls in a loop with executemany and implementing a robust sub-batching mechanism.

it addresses performance bottlenecks where small documents (e.g., <300kB) took several minutes to process.

Changes

  • PostgreSQLDB.executemany Implementation: Added a dedicated method to the PostgreSQLDB class to support asyncpg's batch execution, ensuring efficient data transmission to Postgres.
  • Refactored PGKVStorage.upsert: Updated the upsert logic for all namespaces (TEXT_CHUNKS, FULL_DOCS, LLM_CACHE, etc.) to collect data into tuples and execute them in a single batch.
  • Sub-batching (Chunking) Logic: Introduced a mechanism to split large data sets into smaller sub-batches (defined by _max_batch_size). This prevents memory spikes and database transaction timeouts when processing massive documents.

Checklist

  • Changes tested locally
  • Code reviewed
  • Documentation updated (if necessary)
  • Unit tests added (if applicable)

Additional Notes

The sub-batching unit is currently synchronized with the embedding_batch_num (default 10-50), which provides a balanced load for both the LLM embedding phase and the subsequent database write phase.

This PR and PR summary were created with the help of Gemini, with some additional modifications.

wkpark added 2 commits March 3, 2026 03:48
- Implement PostgreSQLDB.executemany to support batch SQL execution
- Refactor PGKVStorage.upsert to use batch processing for all namespaces
- Chunk large data sets into smaller sub-batches using _max_batch_size
@danielaskdd
Copy link
Copy Markdown
Collaborator

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4d3ae4af13

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +2328 to +2330
for i in range(0, len(batch_values), self._max_batch_size):
sub_batch = batch_values[i : i + self._max_batch_size]
await self.db.executemany(upsert_sql, sub_batch)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stream sub-batches instead of buffering full KV payload

PGKVStorage.upsert now accumulates every row into batch_values and only starts writing inside the sub-batch loop, which means large ingestions hold a second full copy of the payload in memory before the first DB write. In the document path (text_chunks.upsert(inserting_chunks)), this can significantly increase peak RSS and cause OOMs on large inputs, whereas the previous per-row implementation did not buffer all rows at once; the sub-batching logic should flush incrementally as each chunk fills rather than after full accumulation.

Useful? React with 👍 / 👎.

@danielaskdd danielaskdd merged commit 531a278 into HKUDS:main Mar 18, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants