Skip to content

fix: Handle Cosmos DB replication lag during concurrent batch creation#392

Merged
Roopan-Microsoft merged 10 commits intodevfrom
psl-pk-cosavmfix
Apr 6, 2026
Merged

fix: Handle Cosmos DB replication lag during concurrent batch creation#392
Roopan-Microsoft merged 10 commits intodevfrom
psl-pk-cosavmfix

Conversation

@Pavan-Microsoft
Copy link
Copy Markdown
Contributor

@Pavan-Microsoft Pavan-Microsoft commented Apr 3, 2026

Purpose

This pull request introduces several improvements to the backend's CosmosDB integration, focusing on more robust handling of batch creation conflicts, improved concurrency control for database initialization, and enhanced query consistency by specifying partition keys. It also updates related tests to reflect these changes and improves frontend behavior for long-running batch operations.

Backend: CosmosDB robustness and consistency

  • Improved create_batch to handle replication lag after a conflict (409) by retrying reads with exponential backoff and verifying user ownership before returning an existing batch. Raises a clear error if the batch cannot be read after retries.
  • Updated all relevant query methods (get_batch, get_file, get_batch_from_id) to explicitly specify the partition_key parameter, improving query consistency and performance. [1] [2] [3]
  • Enhanced update_batch_entry to accept an optional existing_batch parameter, allowing callers to skip redundant database fetches if they already have the batch data.
  • Updated upload_file_to_batch in the batch service to use the new existing_batch parameter, reducing unnecessary database queries. [1] [2]

Backend: Initialization and concurrency

  • Added an async lock to DatabaseFactory to ensure thread-safe, singleton database initialization, preventing race conditions when multiple coroutines attempt to initialize the database simultaneously. [1] [2]

Frontend: Batch status fallback improvement

  • Changed the frontend's fallback behavior on the modernization page: after 2+ minutes without completion, it now checks the batch status instead of forcing navigation, providing a smoother user experience.

Testing: Updated mocks and assertions

  • Updated tests to mock the new retry logic and partition key usage, ensuring that new behaviors (like retries and partitioned queries) are properly verified. [1] [2] [3] [4] [5] [6]

Does this introduce a breaking change?

  • Yes
  • No

Golden Path Validation

  • I have tested the primary workflows (the "golden path") to ensure they function correctly without errors.

Deployment Validation

  • I have validated the deployment process successfully and all services are running as expected with this change.

Roopan-Microsoft and others added 3 commits March 30, 2026 11:52
fix: merging from dev to main
…g and concurrency

- Added asyncio support and a lock mechanism in DatabaseFactory to ensure thread safety.
- Implemented retry logic with backoff for reading existing batches in CosmosDBClient.
- Updated batch service to handle existing batch entries more efficiently.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves reliability and concurrency behavior in the CosmosDB-backed backend by hardening database initialization, handling Cosmos replication lag during concurrent batch creation, and reducing redundant reads during file upload/batch updates. It also optimizes several Cosmos queries by explicitly scoping them with partition keys.

Changes:

  • Made DatabaseFactory.get_database() concurrency-safe and singleton-like via asyncio.Lock + double-checked locking.
  • Updated Cosmos batch creation to handle 409 conflicts with retry/backoff reads, and added partition key scoping to key query paths.
  • Streamlined file upload / batch status update flow by reusing an existing batch document and avoiding an extra get_file read.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
src/backend/common/services/batch_service.py Avoids a redundant get_file read after add_file, and passes existing_batch to reduce batch refetches.
src/backend/common/database/database_factory.py Adds an async lock and caches the initialized database client to prevent concurrent initialization races.
src/backend/common/database/cosmosdb.py Adds conflict/replication-lag handling in create_batch, adds partition-key-scoped queries, and allows update_batch_entry to skip refetch when given an existing batch.
Comments suppressed due to low confidence (1)

src/backend/common/database/cosmosdb.py:371

  • update_batch_entry now accepts existing_batch from callers, but it doesn't validate that the provided document matches the batch_id/user_id arguments before doing a full-document replace_item. A mismatched/stale dict could accidentally overwrite the wrong batch or clobber fields. Suggest validating existing_batch["batch_id"] (and user_id) matches, or ignoring/refetching when it doesn't.
    async def update_batch_entry(
        self, batch_id: str, user_id: str, status: ProcessStatus, file_count: int,
        existing_batch: Optional[Dict] = None
    ):
        """Update batch status. If existing_batch is provided, skip the re-fetch."""
        try:
            batch = existing_batch
            if batch is None:
                batch = await self.get_batch(user_id, batch_id)
            if not batch:
                raise ValueError("Batch not found")

            if isinstance(status, ProcessStatus):
                batch["status"] = status.value
            else:
                batch["status"] = status

            batch["updated_at"] = datetime.utcnow().isoformat()
            batch["file_count"] = file_count

            await self.batch_container.replace_item(item=batch_id, body=batch)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…f forcing navigation after 2 minutes"

This reverts commit 79d5964.
@Roopan-Microsoft Roopan-Microsoft merged commit 4b70934 into dev Apr 6, 2026
8 checks passed
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 8, 2026

🎉 This PR is included in version 1.7.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants