fix(synapse-core): retry upload session creation on transient network failures by TippyFlitsUK · Pull Request #725 · FilOzone/synapse-sdk

TippyFlitsUK · 2026-04-09T15:46:22Z

Summary

Replace 5-minute single-attempt timeout on POST /pdp/piece/uploads (session creation) with 30-second timeout and 2 retries
Streaming upload and finalize steps unchanged

Problem

Dealbot monitoring shows intermittent StoreError: fetch failed on upload session creation. The request hangs for the full 5-minute MAX_RETRY_TIME timeout then fails. The SP never receives the request (no server-side log), and retrieval to the same endpoint succeeds seconds later.

This is a stateless POST that returns a UUID -- it should respond in under a second. A single transient network failure currently burns 5 minutes and fails the entire deal.

Fix

30-second timeout with 2 retries on the session creation POST only. Each retry creates a fresh session UUID; abandoned sessions are cleaned up by Curio automatically. The streaming upload (timeout: false) and finalize steps (MAX_RETRY_TIME) are unchanged -- high-latency providers are unaffected.

Test plan

Build passes
Existing upload tests pass
Verify against dealbot staging with providers that intermittently fail

… failures The POST /pdp/piece/uploads request that creates an upload session is a lightweight stateless call that should respond in under a second. When it encounters a transient network error (fetch failed, connection reset), it currently waits the full 5-minute MAX_RETRY_TIME before failing the entire upload. Replace the 5-minute single-attempt timeout with a 30-second timeout and 2 retries. Each retry creates a fresh session UUID; abandoned sessions are cleaned up by Curio automatically. The streaming upload and finalize steps are unchanged.

…sion retries

…re tests Tests that simulate upload session creation failures now return HTTP 500 instead of HttpResponse.error() (network error). This correctly tests server-side failures without triggering the new network error retry logic.

rvagg · 2026-04-09T22:13:15Z

Dealbot monitoring shows intermittent `StoreError: fetch failed on upload session creation

@TippyFlitsUK can you show us the full error trace for this, or point us to a betterstack log entry for it? We can't find a timeout instance, just some bad SP responses.

TippyFlitsUK · 2026-04-09T22:54:42Z

Here's a concrete example from production (Better Stack, Infra Prod / t468215.infra_prod, source ID 1678395).

Job ID: e9ee594e-64e7-472d-96b3-77d077888706
Provider: infrafolio-mainnet-pdp (ID 7)

Timeline:

10:37:42.175  deal_preprocessing_started
10:37:43.028  deal_preprocessing_completed  (853ms)
              ← 5 min 11 sec — no logs →
10:42:54.378  deal_creation_failed
10:42:54.811  deal_job_failed

Full error:

StoreError: Failed to store on primary provider 7 (https://mainnet-pdp.infrafolio.com)

Details: StorageContext store failed: Failed to store piece on service provider - fetch failed
    at StorageManager.upload (synapse-sdk/dist/src/storage/manager.js:56:19)
    at async uploadToSynapse (filecoin-pin/dist/core/upload/synapse.js:173:27)
    at async executeUpload (filecoin-pin/dist/core/upload/index.js:212:26)
    at async DealService.createDeal (deal.service.js:189:34)
    ...
    at async resolveWithinSeconds (pg-boss/dist/tools.js:38:18)

These are not bad SP responses. The SP has no logs during the failure window -- the POST /pdp/piece/uploads request never arrived. Curio logs show normal background operations throughout. A retrieval test to the same endpoint succeeded 22 seconds later. Every deal failure for IDs 5 and 7 over the last 48 hours shows the same fetch failed pattern -- none are HTTP errors from the SP.

Both SPs (Mongo2Stor and infrafolio) are otherwise high-performing with 99%+ success rates and have confirmed no infrastructure changes. This pattern has been gradually increasing over the last 72 hours and I haven't been able to pinpoint a root cause. This PR is admittedly speculative -- would a retry with a shorter timeout on session creation be a reasonable approach here? I'd welcome your thoughts and any other ideas on what might be causing this.

hugomrdias · 2026-04-10T14:46:47Z

need to fix the POST retry in iso-web first

rvagg · 2026-04-10T15:35:31Z

I guess the main thing we have to go on here is the fact that we have no evidence of a connection attempt on the SP side. Also annoying that error cause isn't being captured in those log traces, I wonder if there's anything we can do about that.

I'm going to do something about cause capturing for dealbot here: FilOzone/dealbot#444 and make Synapse also do it with its custom serialisation here: #727

Copilot

Pull request overview

Updates Synapse Core’s streaming upload flow to make upload session creation (POST /pdp/piece/uploads) resilient to transient network failures by reducing the per-attempt timeout and adding limited retries, while leaving the streaming PUT and finalization behavior unchanged.

Changes:

Change upload session creation timeout from 5 minutes to 30 seconds.
Add 2 retries for session creation on non-HTTP (transient/network) errors.
Update Synapse SDK tests to mock HTTP 500 responses (instead of network errors) for failure cases around session creation.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
packages/synapse-core/src/sp/upload-streaming.ts	Adds a shorter timeout and retry policy for upload session creation only.
packages/synapse-sdk/src/test/synapse.test.ts	Adjusts upload failure mock to return HTTP 500 instead of a network error.
packages/synapse-sdk/src/test/storage.test.ts	Adjusts upload-related mocks to return HTTP 500 for session creation failure scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+      retries: 2,
+      methods: ['post'],
+      minTimeout: 2_000,
+      shouldRetry: (ctx) => !HttpError.is(ctx.error),


  const createResponse = await request.post(new URL('pdp/piece/uploads', options.serviceURL), {
-    timeout: RETRY_CONSTANTS.MAX_RETRY_TIME,
+    timeout: 30_000,
    signal: options.signal,
+    retry: {
+      retries: 2,
+      methods: ['post'],
+      minTimeout: 2_000,
+      shouldRetry: (ctx) => !HttpError.is(ctx.error),
+    },
  })


        Mocks.PING(),
-        Mocks.pdp.postPieceHandler(testPieceCID, mockUuid, pdpOptions),
-        http.put('https://pdp.example.com/pdp/piece/upload/:uuid', async () => {
-          return HttpResponse.error()
+        http.post('https://pdp.example.com/pdp/piece/uploads', async () => {
+          return HttpResponse.text('Internal Server Error', { status: 500 })
        })
      )


TippyFlitsUK requested review from hugomrdias and rvagg as code owners April 9, 2026 15:46

github-project-automation bot added this to FOC Apr 9, 2026

github-project-automation bot moved this to 📌 Triage in FOC Apr 9, 2026

TippyFlits added 4 commits April 9, 2026 17:27

fix(synapse-core): only retry network errors, not HTTP errors

ef04658

test: increase timeout for batch error test to account for upload ses…

88ba685

…sion retries

test: increase timeouts for tests that trigger upload session retries

81c9fcc

hugomrdias reviewed Apr 10, 2026

View reviewed changes

Comment thread packages/synapse-core/src/sp/upload-streaming.ts

BigLep moved this from 📌 Triage to 🐱 Todo in FOC Apr 12, 2026

BigLep moved this from 🐱 Todo to ⌨️ In Progress in FOC Apr 12, 2026

BigLep requested a review from Copilot April 12, 2026 21:09

BigLep assigned TippyFlitsUK Apr 12, 2026

Copilot started reviewing on behalf of BigLep April 12, 2026 21:10 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

hugomrdias mentioned this pull request Apr 16, 2026

Retry sp http requests on network error #738

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(synapse-core): retry upload session creation on transient network failures#725

fix(synapse-core): retry upload session creation on transient network failures#725
TippyFlitsUK wants to merge 5 commits intoFilOzone:masterfrom
TippyFlitsUK:fix/upload-session-retry

TippyFlitsUK commented Apr 9, 2026

Uh oh!

rvagg commented Apr 9, 2026

Uh oh!

TippyFlitsUK commented Apr 9, 2026

Uh oh!

hugomrdias commented Apr 10, 2026

Uh oh!

rvagg commented Apr 10, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

TippyFlitsUK commented Apr 9, 2026

Summary

Problem

Fix

Test plan

Uh oh!

rvagg commented Apr 9, 2026

Uh oh!

TippyFlitsUK commented Apr 9, 2026

Uh oh!

hugomrdias commented Apr 10, 2026

Uh oh!

rvagg commented Apr 10, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants