Skip to content

Add RFC to introduce Bolt backend for native engine#59

Open
Weixin-Xu wants to merge 8 commits into
prestodb:mainfrom
Weixin-Xu:introduce_bolt
Open

Add RFC to introduce Bolt backend for native engine#59
Weixin-Xu wants to merge 8 commits into
prestodb:mainfrom
Weixin-Xu:introduce_bolt

Conversation

@Weixin-Xu
Copy link
Copy Markdown

@Weixin-Xu Weixin-Xu commented Apr 14, 2026

Summary

Introduce Bolt as an additional backend for the Presto native execution engine.

The initial implementation provides a Bolt-based native worker that implements the Presto worker protocol and integrates with the existing Presto coordinator.

To support the Bolt backend build and dependency requirements, a Conan-based dependency flow is introduced for this worker module. Standardizing dependency management across all native backends is out of scope for this RFC.

@frankobe @ZacBlanco

@beinan
Copy link
Copy Markdown
Member

beinan commented Apr 16, 2026

Looking forward to having bolt in native presto workers!

Improve RFC with more implementation specifics
Comment thread RFC-0024-bolt-backend.md

The current code includes dedicated Bolt converters such as:

* `PrestoToBoltQueryPlan`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The Bolt worker deserializes those fragments using the existing Presto protocol model

How will future divergence in the protocol with Velox be handled? What happens if Velox requires a protocol change that isn’t compatible with Bolt, and vice versa?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on where the change originates.

If it’s a change to the Presto protocol, then it becomes a contract change, and we need to ensure both Bolt and Velox can handle it correctly (ideally in a backward-compatible way).
If it’s a Bolt- or Velox-specific interface change, then it should be handled within their respective translation/converter layers.

Copy link
Copy Markdown

@yingsu00 yingsu00 May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The Bolt worker deserializes those fragments using the existing Presto protocol model

How will future divergence in the protocol with Velox be handled? What happens if Velox requires a protocol change that isn’t compatible with Bolt, and vice versa?

I have the same question too. It's actually not just about Presto protocol but a very general concern overall. I think the right way to handle such kind of concerns is to support versioning on common interfaces like the Presto SPI, the Presto communication protocol, Velox interfaces, etc. In the past people have been very cautious when changes need to be made on Presto SPI, but changes were made very frequently and freely on Velox side, like the connector interfaces, DWIO interfaces, and Presto protocol. This caused lots of rebase conflicts in our internal repo in the past. I hope Bytedance Bolt can do a better job on this in the future.

So I see this as an opportunity to start cleaning things up, and maybe we can start working on versioning support on protocol in Presto and Bolt repos first.

Comment thread RFC-0024-bolt-backend.md
* `PrestoToBoltExpr`
* `PrestoToBoltConnector`
* `PrestoToBoltSplit`
* `BoltPlanConversion` and `BoltPlanValidator`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bolt execution aims to cover all side-car callbacks and extend them where needed.

Comment thread RFC-0024-bolt-backend.md

### 5. CI Plan

CI for Bolt should be split into a few clear lanes:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a fairly exhaustive sets in the presto-native-tests module https://github.com/prestodb/presto/tree/master/presto-native-tests. Please ensure these are covered as well.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End-to-end tests will use the same test module wherever possible.

@jja725 jja725 self-requested a review May 8, 2026 23:16
Comment thread RFC-0024-bolt-backend.md
* `PrestoToBoltSplit`
* `BoltPlanConversion` and `BoltPlanValidator`

This is intentionally backend-local. The initial implementation does not try to share plan conversion logic with the Velox backend.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have the function-coverage delta: which Velox functions are not yet in Bolt, and which Bolt functions don't match Velox semantics?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function-coverage will be reflected in the bolt/bolt-execution unittests and match Presto semantics.

Comment thread RFC-0024-bolt-backend.md Outdated
Comment thread RFC-0024-bolt-backend.md
* worker server implementation
* task execution logic
* operators
* plan, expression, connector, and split conversion
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The coordinator's planner produces one plan, and the RFC names BoltPlanValidator as the Bolt-side analog of getVeloxPlanValidator(). But validation is downstream of plan emission — by the time the worker rejects a plan, the query has already been sent. It might be better to have a coordinator-side capability description so the planner can avoid emitting plans the deployed backend can't run. The RFC should at least call out that this gap exists and how it's
bridged for the homogeneous-pool case.

Comment thread RFC-0024-bolt-backend.md

The initial implementation keeps the existing Velox-based worker unchanged and adds a sibling module, `presto-bolt-execution`, that implements the same Presto worker protocol against Bolt. The coordinator, query protocol, and external worker model remain unchanged.

The current code does not turn `presto-native-execution` into a generic shared framework. Instead, it adds a Bolt-specific worker tree and extracts only a small set of reusable helpers from `presto-native-execution`. Build enablement is also separate in the initial implementation: Velox and Bolt are built from different module directories and produce different worker binaries from different build roots.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where will the small set of reusable helpers reside?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a top-level presto-native-common-helper module to provide a unified abstraction layer for reusable native integration helpers.

Comment thread RFC-0024-bolt-backend.md
## Summary

This RFC introduces Bolt as an additional backend for Presto's native worker implementation.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presto currently has other native modules at root level:

  • presto-native-sidecar-plugin
  • presto-native-tests
    What's the plan for presto-bolt-execution to work with them? Would presto-native-tests be used to cover both Velox and Bolt?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the PR is finalized, presto-bolt-execution will be validated through presto-native-tests. This should help ensure Bolt stays aligned with existing Presto behaviors and test coverage.

For presto-native-sidecar-plugin, we have not tried integrating with it yet. This will be part of our next-step investigation and integration plan.

Comment thread RFC-0024-bolt-backend.md

The initial implementation keeps the existing Velox-based worker unchanged and adds a sibling module, `presto-bolt-execution`, that implements the same Presto worker protocol against Bolt. The coordinator, query protocol, and external worker model remain unchanged.

The current code does not turn `presto-native-execution` into a generic shared framework. Instead, it adds a Bolt-specific worker tree and extracts only a small set of reusable helpers from `presto-native-execution`. Build enablement is also separate in the initial implementation: Velox and Bolt are built from different module directories and produce different worker binaries from different build roots.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given Presto already has multiple top level native modules and it's unclear if the native side car and native tests can work with Bolt, maybe we can consider add a common top level folder to host the common helpers?

@jaystarshot
Copy link
Copy Markdown
Member

jaystarshot commented May 13, 2026

LGTM (with protocol versioning etc)

Copy link
Copy Markdown
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about the overall idea. Mainly, it boils down to this adding significant complexity/overhead, and the benefit isn't clear to me. What does Presto get out of having Bolt as a backend?
Some of the challenges are:

  1. as mentioned by some other reviewers, it makes any protocol changes much harder if we need to worry about two backends
  2. There's a huge risk of correctness issues/behavior differences between the two backends. The java -> Prestissimo migration had to contend with so many correctness issues, and that was meant to be a one way migration. Even with a test suite, there just will definitely be corner cases where there will be correctness differences between the two, and we will be dealing with that risk indefinitely
  3. maintenance. Once we add this support we will need to maintain it, even if the people originating it move on to other things. Is that something we want to take on as a project? Again, it comes down to that I'm not clear on what the advantages are of having this backend.

@amitkdutta
Copy link
Copy Markdown

amitkdutta commented May 15, 2026

I share the concern that Rebecca raised. While I appreciate the potential value this could bring, I'd like to highlight the maintenance burden that comes with adding a nearly identical native execution path.

Today, whenever we advance Velox as a submodule, we frequently need to make coordinated changes across both Velox and Presto to keep things working smoothly (e.g., #27390, #27271, #27716). Adding a third execution backend would compound this, a change in Velox that requires adaptation in Presto could just as easily break Bolt-based execution, and vice versa. This can quickly lead to a situation where even a straightforward change requires juggling three separate repositories.

This is manageable today because Velox operates as a single leaf node. However, with multiple leaf nodes as this proposal envisions, the complexity grows significantly. The protocols, refactorings, and code contracts between Presto and Velox are continuously evolving, and maintaining a third integration point without a clear, distinct benefit to the broader community is something I think we should consider carefully before committing to.

@ZacBlanco
Copy link
Copy Markdown
Contributor

ZacBlanco commented May 21, 2026

Thanks for your comments Rebecca and Amit. Sorry for the late reply. We're
taking your concerns seriously and are discussing amongst ourselves how we can
address them. We can appreciate them and understand your hesitation given your
investment in Presto. We can take some steps in the RFC to address them. We're
also open to having some more discussions in future working group or TSC
meetings to hash things out.

To address your points:

What does Presto get out of Bolt?

Presto was originally sold as "SQL on Everything" -- granted, that referred to
the connector interface at the time. But extending that pluggability to the
execution side aligns with the same spirit: giving users choice in how their
queries run, not just where their data lives. Multiple backends also signal
project health through investment from multiple stakeholders by giving users
meaningful performance tradeoffs to evaluate.

Aside from project health, Bolt gives Presto users another set of performance
tradeoffs. No single native engine is going to be optimal for every workload,
hardware profile, memory model, operator mix, or deployment environment. A Bolt
backend lets users evaluate a different native execution engine for
memory-sensitive workloads, expression-heavy queries, shuffle-heavy pipelines,
or hardware/deployment environments where Bolt may perform better, while keeping
the same Presto coordinator, SQL surface, and operational model. It could expand
the set of workloads where Presto can be competitive without forcing existing
Velox deployments to change.

A second backend also forces the contract between the coordinator and worker to
become a real, explicit API rather than an implicit contract validated by only
one implementation. This is a good thing, as the current protocol is hacky and
not formally defined in a manner that is maintainable or extensible. Introducing
Bolt gives us the opportunity (and necessity) to write clearer protocol
definitions.

Additionally, this type of change is not an isolated case. Apache Gluten is a
useful precedent: it was designed from the start to use multiple backends and
has buy in from large swaths of the Spark community. Doing the same for Presto
does not feel like an otherworldly idea.

Now to address the specific concerns:

1. Protocol changes become harder

This is a tough problem, and I have a few ideas. I'll lay out what I have
thought about so far, but this might warrant a separate offline discussion.

I do agree that adding bolt could make protocol changes more difficult, but I
think it's a price you pay if we want more investment into Presto. I think there
are opportunities already where we can make investments in the code architecture
to reduce friction in these areas. I'll try to lay out some of my thoughts about
how to address this issue:

First want to state that I feel "Presto Protocol" is a loaded term as it
combines a few different things:

  1. "Core" protocol
  • These are structures that are at the engine level. Think task and plan
    fragment definitions, operators, etc. They are slow changing and should not
    present too much issue with maintenance burden. Please correct me otherwise.
  1. "Connector" protocol.
  • As the name implies here, this would be definitions for connectors.
    Most commonly this would mean the definitions for splits and table handles.
    Today, the contract is very loose and implicit and they are updated
    frequently. The source of truth lies in Java code parsing, which is error
    prone and not very extensible.

Right now it is my understanding that most protocol updates are happening on the
connector side. Not really in the engine. So we would need to make it a priority
to find a way to make adding backend-specific extensions to connector-level
protocols possible, or even create separate protocols.

As to how we might adjust protocol itself to reduce friction, if the native worker
protocol is intended to be a Presto API, then we should probably aim to make the
protocol more explicitly defined with clear extension points for
backend-specific optimizations. The RFC can be updated to make this point.

There is similar precedent in the Substrait
project
for adding engine-specific
extensions, where core primitives can be extended so engines can expose their
own optimizations without changing the shared contract. This way, when Velox or
Bolt needs a backend-specific feature, it can evolve through its own extension
path instead of forcing a breaking change to the core protocol.

Currently the core and connector protocols is implicitly defined with Java
classes both inside presto-main and presto-main-base and within connectors. It
might be better to instead start defining shared serialized structures between
coordinator and worker in some protobuf or Thrift IDL whose bindings can be
generated and serialized in multiple languages; at least Java and C++. This
applies to both the "Core" and "Connector" parts of the protocol. It would
help remove a lot of code and make the contract between coordinator and worker
clearer.

Then, for parts of the protocol which might need backend-specific extensions we
can add high-level "optimization" fields to the splits, table handles, etc. This
might look something like an "Any" protobuf field.

Then at runtime during split and table handle generation, we can add in
interface points for the backend to inspect the generated splits and
table handles to tack on any backend-specific extension information.

Here is a rough example of what I'm thinking:

Connector structs could be defined in proto:

message BackendExtension {
  string identifier;
  Any extension;
}

message ConnectorSplit {
  optional bytes connector_split;
  repeated BackendExtension extensions;
}
message HiveConnectorSplit {
 ...
}

With a protocol definition like this, presto-native-execution (assuming it stays
velox-specific) can define some new protos with the extension fields. We could
introduce something to the coordinator like a "BackendOptimizer", where the
backend-specific code can hook in to add these "optimizations".

On a separate line of thought, it might be simpler to just introduce
backend-specific connectors so that the changes to a connector specific to velox
can live in one place and bolt can live in another. This would allow connectors
to live and move on independently between backends. Today, most connectors are
still Java-only and there isn't a clear contract of when a backend (e.g. Java,
Velox, Bolt) even supports a connector. Connectors don't have to live entirely
independently - shared modules like presto-hive-common can do some heavy lifting
to prevent code duplication. Then, backends can have their own modules which
override connector behavior with their own specific changes.

In this case the code structure might look something like:

presto
├── presto-java-execution (java connectors, unsupported by other backends)
│   ├── presto-kafka
│   ├── presto-mysql
│   ├── presto-hive (original java connector)
│   └── presto-postgresql
├── presto-hive-common (shared hive connector code)
├── presto-bolt-execution
│   └── presto-hive (velox-specific hive connector)
└── presto-native-execution
    └── presto-hive (bolt-specific hive connector)

Then it would also be clearer which backends support which connectors, because
today there isn't a clear distinction of when Velox (or Java) supports a
connector.

By using either of these approaches, we can still make it such that updates to
the connector for a specific backend can still occur but with less friction and
burden for developers targeting a particular execution environment.

2. Correctness differences between backends

This is a real concern, and I do not think Bolt should mean silently accepting
incorrect results. For the functionality a backend claims to support, it should
be held to Presto semantics and validated by shared correctness tests.

The practical distinction is between "supported" and "unsupported" behavior.
If Bolt does not support a function, type, operator, or semantic edge case, that
should be declared through capabilities and either rejected, planned away, or
eventually routed to a fallback path. That is a stronger contract than relying
on one implementation's behavior as the de facto definition of correctness.

Gluten is again a useful precedent: it supports multiple native backends while
using a shared contract, backend capabilities, and backend-specific validation.
The goal should not be that every backend supports every feature on day one; the
goal should be that supported features are correct, unsupported features are
explicit, and the test matrix makes the difference visible.

So my view is: common workloads and supported features should return the same
results, and the RFC should define the correctness/conformance bar clearly. But
we should not require every optional backend to immediately cover 100% of the
existing Java/Velox surface before it can provide value to users.

3. Long-term maintenance

We agree that Bolt should not become an unbounded maintenance burden. The best
way to avoid that is to make the native worker protocol stable and explicit.
Changes to the shared API surface should be responsible for ensuring supported
backends do not break, but the design should make sweeping shared protocol
changes uncommon. Backend-specific behavior should usually be added through
backend-specific extensions (or connectors), so a Velox-specific or
Bolt-specific feature does not force churn across every environment.

The conformance bar should also be explicit. Bolt should pass the shared native
tests and agreed benchmark/query suites such as TPC-DS for the feature set it
claims to support. If a protocol change breaks Bolt, that should be visible in
CI and addressed as part of the API change. If Bolt adds a new backend-specific
optimization, it should be able to evolve independently from Velox while still
maintaining compatibility with the shared Presto protocol.

I would also point again to Gluten: we are in the midst of contributing Bolt as
a backend there as well, and the Gluten community has agreed to accept it under
a model where a common backend contract coexists with backend-specific
extensions. Our commitment to bolt in Gluten should also be a signal that Bolt
is not a one-off experiment, and that we are committed to maintaining these
integrations.

4. Coordinated Changes

Also, to address Amit's example changes, most of them seem to be changes
required in the native worker's CPP code rather than the actual Java code due to
velox changes. For these cases, the native worker code structure should be
designed such that native worker code in the Presto repository that requires
when velox is bumped is isolated to velox-specific code in the worker. Updating
velox should not require updating or maintaining a secondary backend if the code
architecture is designed well.

I think our work in the RFC prototype to refactor common parts of the native
server but keep the engine-specific components in their respective native
backend modules is a step in the right direction. I don't think every problem is
solved by the existing prototype, but we can iterate to improve on the native
worker's design as we find more issues.

So I would summarize the value this way: Bolt is worth adding not because every
Presto deployment needs Bolt, but because it makes Presto's native execution
layer more useful and more durable if we invest in stabilizing Presto's protocol
APIs. It gives users another set of performance tradeoffs while preserving the
same coordinator, SQL surface, connector model, and operational model, and it
pushes the coordinator-worker contract to become an explicit, versioned,
extensible API rather than a Velox-shaped implementation detail. If the RFC
commits to shared conformance tests, explicit capabilities, CI coverage, and
backend-specific extension points, then Bolt can evolve alongside Velox without
turning every Velox or connector change into a cross-backend rewrite. That is
the value I think Presto gets: more choice for users, a healthier native
execution ecosystem, and a stronger long-term backend contract.

@yingsu00
Copy link
Copy Markdown

I think the current discussion should be framed around a more fundamental question:

Should Presto own a backend-neutral execution contract, or should Presto’s native execution path remain implicitly shaped around one backend implementation?

Supporting a backend-neutral architecture does not mean adding one backend becomes everyone’s maintenance responsibility. I do not think anyone is asking existing maintainers to own Bolt internals or even the Bolt-Presto contracts. If ByteDance contributes and supports the Bolt backend, then ByteDance should own the Bolt-specific maintenance work.

The shared responsibility should be limited to what Presto itself needs to own anyway:

  • stable execution semantics
  • clear coordinator-worker contracts
  • explicit protocol boundaries
  • shared conformance tests

That is not Bolt-specific overhead. That is an architectural cleanup.

I believe this direction is also consistent with how Velox itself has been presented publicly. The Velox paper describes Velox as providing

“reusable, extensible, high-performance, and dialect-agnostic data processing components for building execution engines.”

It also says Velox is intended to help make data systems “more modular and interoperable” and to support the “one size does not fit all” principle.

The Velox website similarly describes Velox as

composable C++ execution library with reusable components for different analytical workloads.

The newer Axiom direction makes this even more explicit. Axiom is described as:

A C++ library for building fully composable, high-performance query engines, built on top of Velox.

and:

Think of it as Lego for query processing — the pieces are compatible, but don't restrict how you put them together.

https://velox-lib.io/blog/

That is a composable execution infrastructure philosophy. It is not a “single backend forever” philosophy.

So I find it difficult to reconcile the public positioning of Velox/Axiom as reusable, composable, engine-neutral infrastructure with the idea that Presto should not allow another serious community-maintained backend.

This is also how mature infrastructure ecosystems usually evolve:

  • Linux supports multiple filesystems.
  • LLVM supports many compiler frontends and targets.
  • MySQL supports multiple storage engines.
  • The JVM supports multiple garbage collectors.
  • Arrow is used by many independent compute engines.

In all these cases, the ecosystem became stronger by defining stable contracts and allowing multiple implementations to compete, specialize, and evolve. Nobody expects one implementation to satisfy every workload forever.

Query execution should be no different.

No single backend can address every community’s needs:

  • different workload profiles
  • different connector ecosystems
  • different shuffle architectures
  • different deployment environments
  • different hardware targets
  • different organizational priorities

That is exactly why backend-neutral contracts matter.

I also want to respectfully push back on the framing of:

“What does Presto get out of Bolt?”

Open-source ecosystems are healthiest when multiple serious community participants are able to contribute meaningful technical directions, especially when they are willing to own the implementation and maintenance cost themselves.

Bolt already brings concrete value:

  • additional connector investment, including Paimon, a widely used table format focused on high-throughput streaming workloads
  • active work toward cleaner connector abstractions
  • pressure toward cleaner external contracts with Presto
  • more modular execution boundaries
  • another execution engine with better performance on some workloads

That is not just “another copy of Velox.”

It is an opportunity to make Presto’s native execution architecture cleaner, more modular, and more community-driven.

On the specific concerns:

  1. “Velox is a leaf dependency”

I do not think this is true anymore in practice.

Velox changes often require coordinated updates in Presto native execution code, connector behavior, protocol structures, tests, and sometimes planner assumptions. That means Velox is not merely a leaf library. It is already shaping Presto’s native execution architecture.

The problem is not that Bolt creates coupling.

The problem is that existing coupling is currently hidden because Velox is the only native backend.

Bolt exposes that coupling and forces us to formalize the boundary.

  1. “Adding Bolt makes maintenance messier and creates burden”

Only if we design the architecture incorrectly.

Bolt-specific compatibility should be owned by the Bolt maintainers. Velox-specific compatibility should be owned by the Velox backend maintainers. Shared Presto contracts should be explicit, stable, and versioned.

The examples of Velox advancement PRs requiring coordinated changes mostly show that the current Presto-Velox boundary is too tightly coupled. They do not prove that another backend is inherently wrong.

A better architecture should make:

  • Velox changes mostly Velox-local
  • Bolt changes mostly Bolt-local
  • shared protocol/API changes explicit and intentional

If adding Bolt reveals that some interfaces are too implementation-specific or too unstable, that is useful architectural feedback.

And importantly, supporting Bolt does not automatically mean existing maintainers now need to spend substantial time adapting Bolt whenever Velox changes. ByteDance is fully capable of owning Bolt-side adaptation work, just like many organizations already maintain their own integrations, forks, connectors, deployment layers, and infrastructure extensions on top of Presto and Velox today.

In summary, the deeper question here is whether Presto wants to remain a truly open execution platform, or whether native execution should effectively become tied to one backend’s assumptions and priorities.

I believe Presto will be healthier if its execution semantics and contracts belong to Presto itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants