Skip to content

fix(synapse-core): retry upload session creation on transient network failures#725

Open
TippyFlitsUK wants to merge 5 commits intoFilOzone:masterfrom
TippyFlitsUK:fix/upload-session-retry
Open

fix(synapse-core): retry upload session creation on transient network failures#725
TippyFlitsUK wants to merge 5 commits intoFilOzone:masterfrom
TippyFlitsUK:fix/upload-session-retry

Conversation

@TippyFlitsUK
Copy link
Copy Markdown
Contributor

Summary

  • Replace 5-minute single-attempt timeout on POST /pdp/piece/uploads (session creation) with 30-second timeout and 2 retries
  • Streaming upload and finalize steps unchanged

Problem

Dealbot monitoring shows intermittent StoreError: fetch failed on upload session creation. The request hangs for the full 5-minute MAX_RETRY_TIME timeout then fails. The SP never receives the request (no server-side log), and retrieval to the same endpoint succeeds seconds later.

This is a stateless POST that returns a UUID -- it should respond in under a second. A single transient network failure currently burns 5 minutes and fails the entire deal.

Fix

30-second timeout with 2 retries on the session creation POST only. Each retry creates a fresh session UUID; abandoned sessions are cleaned up by Curio automatically. The streaming upload (timeout: false) and finalize steps (MAX_RETRY_TIME) are unchanged -- high-latency providers are unaffected.

Test plan

  • Build passes
  • Existing upload tests pass
  • Verify against dealbot staging with providers that intermittently fail

… failures

The POST /pdp/piece/uploads request that creates an upload session is a
lightweight stateless call that should respond in under a second. When it
encounters a transient network error (fetch failed, connection reset), it
currently waits the full 5-minute MAX_RETRY_TIME before failing the entire
upload.

Replace the 5-minute single-attempt timeout with a 30-second timeout and
2 retries. Each retry creates a fresh session UUID; abandoned sessions are
cleaned up by Curio automatically. The streaming upload and finalize steps
are unchanged.
@github-project-automation github-project-automation bot moved this to 📌 Triage in FOC Apr 9, 2026
TippyFlits added 4 commits April 9, 2026 17:27
…re tests

Tests that simulate upload session creation failures now return HTTP 500
instead of HttpResponse.error() (network error). This correctly tests
server-side failures without triggering the new network error retry logic.
@rvagg
Copy link
Copy Markdown
Collaborator

rvagg commented Apr 9, 2026

Dealbot monitoring shows intermittent `StoreError: fetch failed on upload session creation

@TippyFlitsUK can you show us the full error trace for this, or point us to a betterstack log entry for it? We can't find a timeout instance, just some bad SP responses.

@TippyFlitsUK
Copy link
Copy Markdown
Contributor Author

Here's a concrete example from production (Better Stack, Infra Prod / t468215.infra_prod, source ID 1678395).

Job ID: e9ee594e-64e7-472d-96b3-77d077888706
Provider: infrafolio-mainnet-pdp (ID 7)

Timeline:

10:37:42.175  deal_preprocessing_started
10:37:43.028  deal_preprocessing_completed  (853ms)
              ← 5 min 11 sec — no logs →
10:42:54.378  deal_creation_failed
10:42:54.811  deal_job_failed

Full error:

StoreError: Failed to store on primary provider 7 (https://mainnet-pdp.infrafolio.com)

Details: StorageContext store failed: Failed to store piece on service provider - fetch failed
    at StorageManager.upload (synapse-sdk/dist/src/storage/manager.js:56:19)
    at async uploadToSynapse (filecoin-pin/dist/core/upload/synapse.js:173:27)
    at async executeUpload (filecoin-pin/dist/core/upload/index.js:212:26)
    at async DealService.createDeal (deal.service.js:189:34)
    ...
    at async resolveWithinSeconds (pg-boss/dist/tools.js:38:18)

These are not bad SP responses. The SP has no logs during the failure window -- the POST /pdp/piece/uploads request never arrived. Curio logs show normal background operations throughout. A retrieval test to the same endpoint succeeded 22 seconds later. Every deal failure for IDs 5 and 7 over the last 48 hours shows the same fetch failed pattern -- none are HTTP errors from the SP.

Both SPs (Mongo2Stor and infrafolio) are otherwise high-performing with 99%+ success rates and have confirmed no infrastructure changes. This pattern has been gradually increasing over the last 72 hours and I haven't been able to pinpoint a root cause. This PR is admittedly speculative -- would a retry with a shorter timeout on session creation be a reasonable approach here? I'd welcome your thoughts and any other ideas on what might be causing this.

@hugomrdias
Copy link
Copy Markdown
Member

need to fix the POST retry in iso-web first

@rvagg
Copy link
Copy Markdown
Collaborator

rvagg commented Apr 10, 2026

I guess the main thing we have to go on here is the fact that we have no evidence of a connection attempt on the SP side. Also annoying that error cause isn't being captured in those log traces, I wonder if there's anything we can do about that.

I'm going to do something about cause capturing for dealbot here: FilOzone/dealbot#444 and make Synapse also do it with its custom serialisation here: #727

Comment thread packages/synapse-core/src/sp/upload-streaming.ts
@BigLep BigLep moved this from 📌 Triage to 🐱 Todo in FOC Apr 12, 2026
@BigLep BigLep moved this from 🐱 Todo to ⌨️ In Progress in FOC Apr 12, 2026
@BigLep BigLep requested a review from Copilot April 12, 2026 21:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates Synapse Core’s streaming upload flow to make upload session creation (POST /pdp/piece/uploads) resilient to transient network failures by reducing the per-attempt timeout and adding limited retries, while leaving the streaming PUT and finalization behavior unchanged.

Changes:

  • Change upload session creation timeout from 5 minutes to 30 seconds.
  • Add 2 retries for session creation on non-HTTP (transient/network) errors.
  • Update Synapse SDK tests to mock HTTP 500 responses (instead of network errors) for failure cases around session creation.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
packages/synapse-core/src/sp/upload-streaming.ts Adds a shorter timeout and retry policy for upload session creation only.
packages/synapse-sdk/src/test/synapse.test.ts Adjusts upload failure mock to return HTTP 500 instead of a network error.
packages/synapse-sdk/src/test/storage.test.ts Adjusts upload-related mocks to return HTTP 500 for session creation failure scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

retries: 2,
methods: ['post'],
minTimeout: 2_000,
shouldRetry: (ctx) => !HttpError.is(ctx.error),
Comment on lines 49 to 58
const createResponse = await request.post(new URL('pdp/piece/uploads', options.serviceURL), {
timeout: RETRY_CONSTANTS.MAX_RETRY_TIME,
timeout: 30_000,
signal: options.signal,
retry: {
retries: 2,
methods: ['post'],
minTimeout: 2_000,
shouldRetry: (ctx) => !HttpError.is(ctx.error),
},
})
Comment on lines 1139 to 1143
Mocks.PING(),
Mocks.pdp.postPieceHandler(testPieceCID, mockUuid, pdpOptions),
http.put('https://pdp.example.com/pdp/piece/upload/:uuid', async () => {
return HttpResponse.error()
http.post('https://pdp.example.com/pdp/piece/uploads', async () => {
return HttpResponse.text('Internal Server Error', { status: 500 })
})
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ⌨️ In Progress

Development

Successfully merging this pull request may close these issues.

5 participants