Skip to content

[BUG]: Spark Cosmos connector has quadratic behaviour on planning based on partitions/continuation tokens #49023

@joeyp-openai

Description

@joeyp-openai

Describe the bug
The Cosmos Spark change feed connector can spend excessive driver time planning a microbatch when a single streaming query reads a container with many feed ranges / continuation tokens.

The hot path appears to be ChangeFeedMicroBatchStream.planInputPartitions, which extracts a per-partition starting ChangeFeedState from the same start offset. Today, that repeatedly copies and sorts the full continuation-token list for each planned partition.

If P is the number of planned Spark input partitions / feed ranges and T is the number of continuation tokens in the starting offset, this behaves roughly like:

P * (T log T + T)

For large containers, this can dominate or stall microbatch planning before executors start consuming data. In some examples, we were seeing this take >15min on jobs for >30k continuation tokens / feed ranges.

Exception or Stack Trace
No exception is thrown. The symptom is very long driver-side microbatch planning / offset planning time.

Observed stack / hot path from investigation:

  • ChangeFeedMicroBatchStream.planInputPartitions
  • per partition: SparkBridgeImplementationInternal.extractChangeFeedStateForRange(...)
  • ChangeFeedState.extractForEffectiveRange(...)
  • ChangeFeedState.extractContinuationTokens(...)
  • repeated copy/sort/scan of the same continuation-token list

To Reproduce
Steps to reproduce the behavior:

  1. Use a Cosmos container with a large number of physical feed ranges / continuation tokens.
  2. Run a Spark Structured Streaming change feed query over the full container with one checkpoint / one streaming query.
  3. Use a starting offset/checkpoint that contains many composite continuation tokens.
  4. Trigger a microbatch and observe driver-side planning time in planInputPartitions.
  5. Compare planning time with an implementation that snapshots/sorts the start continuation tokens once per planning pass and reuses that sorted view for all per-partition extraction calls.

Code Snippet
Representative Spark read:

val df = spark.readStream
  .format("cosmos.oltp.changeFeed")
  .options(cosmosConfig)
  .option("spark.cosmos.changeFeed.mode", "Incremental")
  .option("spark.cosmos.changeFeed.startFrom", "Beginning")
  .load()

df.writeStream
  .format("delta")
  .option("checkpointLocation", checkpointPath)
  .start(outputPath)

The issue is most visible when the checkpoint/start offset contains many continuation tokens and Spark plans many feed-range partitions.

Expected behavior
The connector should avoid repeatedly sorting/parsing the same start-offset continuation tokens for each planned partition. For example, in this particular call-site (extractForEffectiveRange) and the equivalent ones across the version implementations:

val parsedStartChangeFeedState = SparkBridgeImplementationInternal.parseChangeFeedState(start.changeFeedState)
end
.inputPartitions
.get
.map(partition => {
val index = partitionIndexMap.asScala.getOrElseUpdate(partition.feedRange, partitionIndex.incrementAndGet())
partition
.withContinuationState(
SparkBridgeImplementationInternal
.extractChangeFeedStateForRange(parsedStartChangeFeedState, partition.feedRange),
clearEndLsn = false)
.withIndex(index)
})

If we were to:

  • Sort the continuation tokens ahead of time
  • Use binary search when looking for overlapping feed ranges

It would drastically reduce the overhead. I do acknowledge that for malformed or legacy overlapping ranges, we may want to just fallback on the full sorted scan.

Screenshots
N/A

Setup (please complete the following information):

  • OS: Linux
  • IDE: N/A
  • Library/Libraries: com.azure.cosmos.spark:azure-cosmos-spark_3-5_2-12 / Cosmos Spark change feed connector
  • Java version: 17
  • App Server/Environment: Apache Spark Structured Streaming
  • Frameworks: Spark 3.x

Additional context
This issue is a bottleneck before Spark executors even begin to read rows, and occurs on the driver's microbatch planning.

Increasing spark.cosmos.changeFeed.itemCountPerTriggerHint may amortize fixed per-trigger overhead over more rows, but it does not remove the repeated continuation-token planning work.

The issue is more severe for unsharded/full-container streaming queries because both the planned partition count and continuation-token count can be large. Sharded consumers may avoid the worst case because each query/checkpoint owns a smaller subset of feed ranges.

Information Checklist

  • Bug Description Added
  • Repro Steps Added
  • Setup information Added

Metadata

Metadata

Assignees

Labels

ClientThis issue points to a problem in the data-plane of the library.CosmosService AttentionWorkflow: This issue is responsible by Azure service team.customer-reportedIssues that are reported by GitHub users external to the Azure organization.questionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions