fix(synapse-core): retry upload session creation on transient network failures#725
fix(synapse-core): retry upload session creation on transient network failures#725TippyFlitsUK wants to merge 5 commits intoFilOzone:masterfrom
Conversation
… failures The POST /pdp/piece/uploads request that creates an upload session is a lightweight stateless call that should respond in under a second. When it encounters a transient network error (fetch failed, connection reset), it currently waits the full 5-minute MAX_RETRY_TIME before failing the entire upload. Replace the 5-minute single-attempt timeout with a 30-second timeout and 2 retries. Each retry creates a fresh session UUID; abandoned sessions are cleaned up by Curio automatically. The streaming upload and finalize steps are unchanged.
…re tests Tests that simulate upload session creation failures now return HTTP 500 instead of HttpResponse.error() (network error). This correctly tests server-side failures without triggering the new network error retry logic.
@TippyFlitsUK can you show us the full error trace for this, or point us to a betterstack log entry for it? We can't find a timeout instance, just some bad SP responses. |
|
Here's a concrete example from production (Better Stack, Infra Prod / Job ID: Timeline: Full error: These are not bad SP responses. The SP has no logs during the failure window -- the Both SPs (Mongo2Stor and infrafolio) are otherwise high-performing with 99%+ success rates and have confirmed no infrastructure changes. This pattern has been gradually increasing over the last 72 hours and I haven't been able to pinpoint a root cause. This PR is admittedly speculative -- would a retry with a shorter timeout on session creation be a reasonable approach here? I'd welcome your thoughts and any other ideas on what might be causing this. |
|
need to fix the POST retry in iso-web first |
|
I guess the main thing we have to go on here is the fact that we have no evidence of a connection attempt on the SP side. Also annoying that error I'm going to do something about cause capturing for dealbot here: FilOzone/dealbot#444 and make Synapse also do it with its custom serialisation here: #727 |
There was a problem hiding this comment.
Pull request overview
Updates Synapse Core’s streaming upload flow to make upload session creation (POST /pdp/piece/uploads) resilient to transient network failures by reducing the per-attempt timeout and adding limited retries, while leaving the streaming PUT and finalization behavior unchanged.
Changes:
- Change upload session creation timeout from 5 minutes to 30 seconds.
- Add 2 retries for session creation on non-HTTP (transient/network) errors.
- Update Synapse SDK tests to mock HTTP 500 responses (instead of network errors) for failure cases around session creation.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| packages/synapse-core/src/sp/upload-streaming.ts | Adds a shorter timeout and retry policy for upload session creation only. |
| packages/synapse-sdk/src/test/synapse.test.ts | Adjusts upload failure mock to return HTTP 500 instead of a network error. |
| packages/synapse-sdk/src/test/storage.test.ts | Adjusts upload-related mocks to return HTTP 500 for session creation failure scenarios. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| retries: 2, | ||
| methods: ['post'], | ||
| minTimeout: 2_000, | ||
| shouldRetry: (ctx) => !HttpError.is(ctx.error), |
| const createResponse = await request.post(new URL('pdp/piece/uploads', options.serviceURL), { | ||
| timeout: RETRY_CONSTANTS.MAX_RETRY_TIME, | ||
| timeout: 30_000, | ||
| signal: options.signal, | ||
| retry: { | ||
| retries: 2, | ||
| methods: ['post'], | ||
| minTimeout: 2_000, | ||
| shouldRetry: (ctx) => !HttpError.is(ctx.error), | ||
| }, | ||
| }) |
| Mocks.PING(), | ||
| Mocks.pdp.postPieceHandler(testPieceCID, mockUuid, pdpOptions), | ||
| http.put('https://pdp.example.com/pdp/piece/upload/:uuid', async () => { | ||
| return HttpResponse.error() | ||
| http.post('https://pdp.example.com/pdp/piece/uploads', async () => { | ||
| return HttpResponse.text('Internal Server Error', { status: 500 }) | ||
| }) | ||
| ) |
Summary
POST /pdp/piece/uploads(session creation) with 30-second timeout and 2 retriesProblem
Dealbot monitoring shows intermittent
StoreError: fetch failedon upload session creation. The request hangs for the full 5-minuteMAX_RETRY_TIMEtimeout then fails. The SP never receives the request (no server-side log), and retrieval to the same endpoint succeeds seconds later.This is a stateless POST that returns a UUID -- it should respond in under a second. A single transient network failure currently burns 5 minutes and fails the entire deal.
Fix
30-second timeout with 2 retries on the session creation POST only. Each retry creates a fresh session UUID; abandoned sessions are cleaned up by Curio automatically. The streaming upload (
timeout: false) and finalize steps (MAX_RETRY_TIME) are unchanged -- high-latency providers are unaffected.Test plan