perf(postgres): optimize KV storage upsert using executemany#2742
perf(postgres): optimize KV storage upsert using executemany#2742danielaskdd merged 2 commits intoHKUDS:mainfrom
Conversation
- Implement PostgreSQLDB.executemany to support batch SQL execution - Refactor PGKVStorage.upsert to use batch processing for all namespaces
- Chunk large data sets into smaller sub-batches using _max_batch_size
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4d3ae4af13
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for i in range(0, len(batch_values), self._max_batch_size): | ||
| sub_batch = batch_values[i : i + self._max_batch_size] | ||
| await self.db.executemany(upsert_sql, sub_batch) |
There was a problem hiding this comment.
Stream sub-batches instead of buffering full KV payload
PGKVStorage.upsert now accumulates every row into batch_values and only starts writing inside the sub-batch loop, which means large ingestions hold a second full copy of the payload in memory before the first DB write. In the document path (text_chunks.upsert(inserting_chunks)), this can significantly increase peak RSS and cause OOMs on large inputs, whereas the previous per-row implementation did not buffer all rows at once; the sub-batching logic should flush incrementally as each chunk fills rather than after full accumulation.
Useful? React with 👍 / 👎.
Description
This PR optimizes the document ingestion performance for Postgres-based storage by introducing batch processing
for Key-Value (KV) storage operations. By replacing individual INSERT calls in a loop with executemany and implementing a robust sub-batching mechanism.
it addresses performance bottlenecks where small documents (e.g., <300kB) took several minutes to process.
Changes
PostgreSQLDB.executemanyImplementation: Added a dedicated method to the PostgreSQLDB class to support asyncpg's batch execution, ensuring efficient data transmission to Postgres.PGKVStorage.upsert: Updated the upsert logic for all namespaces (TEXT_CHUNKS, FULL_DOCS, LLM_CACHE, etc.) to collect data into tuples and execute them in a single batch.Checklist
Additional Notes
The sub-batching unit is currently synchronized with the embedding_batch_num (default 10-50), which provides a balanced load for both the LLM embedding phase and the subsequent database write phase.
This PR and PR summary were created with the help of Gemini, with some additional modifications.