fix: Handle Cosmos DB replication lag during concurrent batch creation#392
Merged
Roopan-Microsoft merged 10 commits intodevfrom Apr 6, 2026
Merged
fix: Handle Cosmos DB replication lag during concurrent batch creation#392Roopan-Microsoft merged 10 commits intodevfrom
Roopan-Microsoft merged 10 commits intodevfrom
Conversation
fix: merging from dev to main
…g and concurrency - Added asyncio support and a lock mechanism in DatabaseFactory to ensure thread safety. - Implemented retry logic with backoff for reading existing batches in CosmosDBClient. - Updated batch service to handle existing batch entries more efficiently.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves reliability and concurrency behavior in the CosmosDB-backed backend by hardening database initialization, handling Cosmos replication lag during concurrent batch creation, and reducing redundant reads during file upload/batch updates. It also optimizes several Cosmos queries by explicitly scoping them with partition keys.
Changes:
- Made
DatabaseFactory.get_database()concurrency-safe and singleton-like viaasyncio.Lock+ double-checked locking. - Updated Cosmos batch creation to handle 409 conflicts with retry/backoff reads, and added partition key scoping to key query paths.
- Streamlined file upload / batch status update flow by reusing an existing batch document and avoiding an extra
get_fileread.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/backend/common/services/batch_service.py | Avoids a redundant get_file read after add_file, and passes existing_batch to reduce batch refetches. |
| src/backend/common/database/database_factory.py | Adds an async lock and caches the initialized database client to prevent concurrent initialization races. |
| src/backend/common/database/cosmosdb.py | Adds conflict/replication-lag handling in create_batch, adds partition-key-scoped queries, and allows update_batch_entry to skip refetch when given an existing batch. |
Comments suppressed due to low confidence (1)
src/backend/common/database/cosmosdb.py:371
update_batch_entrynow acceptsexisting_batchfrom callers, but it doesn't validate that the provided document matches thebatch_id/user_idarguments before doing a full-documentreplace_item. A mismatched/stale dict could accidentally overwrite the wrong batch or clobber fields. Suggest validatingexisting_batch["batch_id"](anduser_id) matches, or ignoring/refetching when it doesn't.
async def update_batch_entry(
self, batch_id: str, user_id: str, status: ProcessStatus, file_count: int,
existing_batch: Optional[Dict] = None
):
"""Update batch status. If existing_batch is provided, skip the re-fetch."""
try:
batch = existing_batch
if batch is None:
batch = await self.get_batch(user_id, batch_id)
if not batch:
raise ValueError("Batch not found")
if isinstance(status, ProcessStatus):
batch["status"] = status.value
else:
batch["status"] = status
batch["updated_at"] = datetime.utcnow().isoformat()
batch["file_count"] = file_count
await self.batch_container.replace_item(item=batch_id, body=batch)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…tch for existing batch records
…existing batch records
…le record handling in BatchService
…g navigation after 2 minutes
…f forcing navigation after 2 minutes" This reverts commit 79d5964.
Roopan-Microsoft
approved these changes
Apr 6, 2026
|
🎉 This PR is included in version 1.7.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
This pull request introduces several improvements to the backend's CosmosDB integration, focusing on more robust handling of batch creation conflicts, improved concurrency control for database initialization, and enhanced query consistency by specifying partition keys. It also updates related tests to reflect these changes and improves frontend behavior for long-running batch operations.
Backend: CosmosDB robustness and consistency
create_batchto handle replication lag after a conflict (409) by retrying reads with exponential backoff and verifying user ownership before returning an existing batch. Raises a clear error if the batch cannot be read after retries.get_batch,get_file,get_batch_from_id) to explicitly specify thepartition_keyparameter, improving query consistency and performance. [1] [2] [3]update_batch_entryto accept an optionalexisting_batchparameter, allowing callers to skip redundant database fetches if they already have the batch data.upload_file_to_batchin the batch service to use the newexisting_batchparameter, reducing unnecessary database queries. [1] [2]Backend: Initialization and concurrency
DatabaseFactoryto ensure thread-safe, singleton database initialization, preventing race conditions when multiple coroutines attempt to initialize the database simultaneously. [1] [2]Frontend: Batch status fallback improvement
Testing: Updated mocks and assertions
Does this introduce a breaking change?
Golden Path Validation
Deployment Validation