What content needs to be created or modified?
The Dapr Scheduler persistence documentation (kubernetes-persisting-scheduler.md) gives sizing guidance (64 GiB recommended) but does not address two production-critical properties of the underlying storage:
- Performance characteristics (IOPS / latency). The Scheduler's embedded etcd is sensitive to disk latency. Standard-tier storage classes can produce
leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk warnings and degraded reminder dispatch behaviour under sustained write load.
- Zone affinity. Many cloud default StorageClasses pin PVs to a single availability zone (e.g. Azure
StandardSSD_LRS with volumeBindingMode: WaitForFirstConsumer). When a node upgrade or zonal disruption requires the Scheduler pod to reschedule to a different zone, the PV cannot follow, blocking Scheduler recovery until the original zone is available again.
Operators following the existing guidance with cloud defaults can deploy a Scheduler that has predictable disk-latency-driven instability and is unable to recover cleanly during node upgrades or DR / zone-failure events.
Describe the solution you'd like
In the Storage class section of the Scheduler persistence page, add explicit production guidance:
- Recommend premium / SSD-backed storage classes for production deployments (to give etcd the IOPS and latency profile it expects).
- Recommend storage classes that support multi-zone failover (zone-redundant or regional persistent disks) so Scheduler PVCs are not locked to a single AZ.
- Briefly explain the failure modes when the above are not followed: slow-disk heartbeat warnings and inability to reschedule across zones during cluster upgrades / zone events.
Where should the new material be placed?
daprdocs/content/en/operations/hosting/kubernetes/kubernetes-persisting-scheduler.md, in the Storage class subsection following the existing 64 GiB sizing note. This keeps all storage-related guidance grouped together.
The associated pull request from dapr/dapr, dapr/components-contrib, or other Dapr code repos
Docs-only change — implementing PR: #5179
Additional context
This guidance reflects real-world operational signal from production deployments:
- Scheduler etcd logging recurring
slow disk heartbeat warnings under sustained reminder/job load on Standard-tier storage classes.
- Zone-pinned scheduler PVCs blocking recovery during rolling node upgrades, leaving the Scheduler unable to reschedule until the original zone became available again — extending downtime windows by hours in observed cases.
- DR exercises failing because the scheduler PV could not be relocated to a healthy zone.
What content needs to be created or modified?
The Dapr Scheduler persistence documentation (
kubernetes-persisting-scheduler.md) gives sizing guidance (64 GiB recommended) but does not address two production-critical properties of the underlying storage:leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow diskwarnings and degraded reminder dispatch behaviour under sustained write load.StandardSSD_LRSwithvolumeBindingMode: WaitForFirstConsumer). When a node upgrade or zonal disruption requires the Scheduler pod to reschedule to a different zone, the PV cannot follow, blocking Scheduler recovery until the original zone is available again.Operators following the existing guidance with cloud defaults can deploy a Scheduler that has predictable disk-latency-driven instability and is unable to recover cleanly during node upgrades or DR / zone-failure events.
Describe the solution you'd like
In the Storage class section of the Scheduler persistence page, add explicit production guidance:
Where should the new material be placed?
daprdocs/content/en/operations/hosting/kubernetes/kubernetes-persisting-scheduler.md, in the Storage class subsection following the existing 64 GiB sizing note. This keeps all storage-related guidance grouped together.The associated pull request from dapr/dapr, dapr/components-contrib, or other Dapr code repos
Docs-only change — implementing PR: #5179
Additional context
This guidance reflects real-world operational signal from production deployments:
slow diskheartbeat warnings under sustained reminder/job load on Standard-tier storage classes.