Add RFC to introduce Bolt backend for native engine#59
Conversation
|
Looking forward to having bolt in native presto workers! |
Improve RFC with more implementation specifics
|
|
||
| The current code includes dedicated Bolt converters such as: | ||
|
|
||
| * `PrestoToBoltQueryPlan` |
There was a problem hiding this comment.
- The Bolt worker deserializes those fragments using the existing Presto protocol model
How will future divergence in the protocol with Velox be handled? What happens if Velox requires a protocol change that isn’t compatible with Bolt, and vice versa?
There was a problem hiding this comment.
It depends on where the change originates.
If it’s a change to the Presto protocol, then it becomes a contract change, and we need to ensure both Bolt and Velox can handle it correctly (ideally in a backward-compatible way).
If it’s a Bolt- or Velox-specific interface change, then it should be handled within their respective translation/converter layers.
There was a problem hiding this comment.
- The Bolt worker deserializes those fragments using the existing Presto protocol model
How will future divergence in the protocol with Velox be handled? What happens if Velox requires a protocol change that isn’t compatible with Bolt, and vice versa?
I have the same question too. It's actually not just about Presto protocol but a very general concern overall. I think the right way to handle such kind of concerns is to support versioning on common interfaces like the Presto SPI, the Presto communication protocol, Velox interfaces, etc. In the past people have been very cautious when changes need to be made on Presto SPI, but changes were made very frequently and freely on Velox side, like the connector interfaces, DWIO interfaces, and Presto protocol. This caused lots of rebase conflicts in our internal repo in the past. I hope Bytedance Bolt can do a better job on this in the future.
So I see this as an opportunity to start cleaning things up, and maybe we can start working on versioning support on protocol in Presto and Bolt repos first.
| * `PrestoToBoltExpr` | ||
| * `PrestoToBoltConnector` | ||
| * `PrestoToBoltSplit` | ||
| * `BoltPlanConversion` and `BoltPlanValidator` |
There was a problem hiding this comment.
Does all the side-car code...the callbacks here : https://github.com/prestodb/presto/blob/master/presto-native-execution/presto_cpp/main/PrestoServer.cpp#L1851 remain the same ?
There was a problem hiding this comment.
Bolt execution aims to cover all side-car callbacks and extend them where needed.
|
|
||
| ### 5. CI Plan | ||
|
|
||
| CI for Bolt should be split into a few clear lanes: |
There was a problem hiding this comment.
We have a fairly exhaustive sets in the presto-native-tests module https://github.com/prestodb/presto/tree/master/presto-native-tests. Please ensure these are covered as well.
There was a problem hiding this comment.
End-to-end tests will use the same test module wherever possible.
| * `PrestoToBoltSplit` | ||
| * `BoltPlanConversion` and `BoltPlanValidator` | ||
|
|
||
| This is intentionally backend-local. The initial implementation does not try to share plan conversion logic with the Velox backend. |
There was a problem hiding this comment.
can we have the function-coverage delta: which Velox functions are not yet in Bolt, and which Bolt functions don't match Velox semantics?
There was a problem hiding this comment.
Function-coverage will be reflected in the bolt/bolt-execution unittests and match Presto semantics.
| * worker server implementation | ||
| * task execution logic | ||
| * operators | ||
| * plan, expression, connector, and split conversion |
There was a problem hiding this comment.
The coordinator's planner produces one plan, and the RFC names BoltPlanValidator as the Bolt-side analog of getVeloxPlanValidator(). But validation is downstream of plan emission — by the time the worker rejects a plan, the query has already been sent. It might be better to have a coordinator-side capability description so the planner can avoid emitting plans the deployed backend can't run. The RFC should at least call out that this gap exists and how it's
bridged for the homogeneous-pool case.
|
|
||
| The initial implementation keeps the existing Velox-based worker unchanged and adds a sibling module, `presto-bolt-execution`, that implements the same Presto worker protocol against Bolt. The coordinator, query protocol, and external worker model remain unchanged. | ||
|
|
||
| The current code does not turn `presto-native-execution` into a generic shared framework. Instead, it adds a Bolt-specific worker tree and extracts only a small set of reusable helpers from `presto-native-execution`. Build enablement is also separate in the initial implementation: Velox and Bolt are built from different module directories and produce different worker binaries from different build roots. |
There was a problem hiding this comment.
Where will the small set of reusable helpers reside?
There was a problem hiding this comment.
We could add a top-level presto-native-common-helper module to provide a unified abstraction layer for reusable native integration helpers.
| ## Summary | ||
|
|
||
| This RFC introduces Bolt as an additional backend for Presto's native worker implementation. | ||
|
|
There was a problem hiding this comment.
Presto currently has other native modules at root level:
- presto-native-sidecar-plugin
- presto-native-tests
What's the plan for presto-bolt-execution to work with them? Would presto-native-tests be used to cover both Velox and Bolt?
There was a problem hiding this comment.
Before the PR is finalized, presto-bolt-execution will be validated through presto-native-tests. This should help ensure Bolt stays aligned with existing Presto behaviors and test coverage.
For presto-native-sidecar-plugin, we have not tried integrating with it yet. This will be part of our next-step investigation and integration plan.
|
|
||
| The initial implementation keeps the existing Velox-based worker unchanged and adds a sibling module, `presto-bolt-execution`, that implements the same Presto worker protocol against Bolt. The coordinator, query protocol, and external worker model remain unchanged. | ||
|
|
||
| The current code does not turn `presto-native-execution` into a generic shared framework. Instead, it adds a Bolt-specific worker tree and extracts only a small set of reusable helpers from `presto-native-execution`. Build enablement is also separate in the initial implementation: Velox and Bolt are built from different module directories and produce different worker binaries from different build roots. |
There was a problem hiding this comment.
Given Presto already has multiple top level native modules and it's unclear if the native side car and native tests can work with Bolt, maybe we can consider add a common top level folder to host the common helpers?
|
LGTM (with protocol versioning etc) |
rschlussel
left a comment
There was a problem hiding this comment.
I have some concerns about the overall idea. Mainly, it boils down to this adding significant complexity/overhead, and the benefit isn't clear to me. What does Presto get out of having Bolt as a backend?
Some of the challenges are:
- as mentioned by some other reviewers, it makes any protocol changes much harder if we need to worry about two backends
- There's a huge risk of correctness issues/behavior differences between the two backends. The java -> Prestissimo migration had to contend with so many correctness issues, and that was meant to be a one way migration. Even with a test suite, there just will definitely be corner cases where there will be correctness differences between the two, and we will be dealing with that risk indefinitely
- maintenance. Once we add this support we will need to maintain it, even if the people originating it move on to other things. Is that something we want to take on as a project? Again, it comes down to that I'm not clear on what the advantages are of having this backend.
|
I share the concern that Rebecca raised. While I appreciate the potential value this could bring, I'd like to highlight the maintenance burden that comes with adding a nearly identical native execution path. Today, whenever we advance Velox as a submodule, we frequently need to make coordinated changes across both Velox and Presto to keep things working smoothly (e.g., #27390, #27271, #27716). Adding a third execution backend would compound this, a change in Velox that requires adaptation in Presto could just as easily break Bolt-based execution, and vice versa. This can quickly lead to a situation where even a straightforward change requires juggling three separate repositories. This is manageable today because Velox operates as a single leaf node. However, with multiple leaf nodes as this proposal envisions, the complexity grows significantly. The protocols, refactorings, and code contracts between Presto and Velox are continuously evolving, and maintaining a third integration point without a clear, distinct benefit to the broader community is something I think we should consider carefully before committing to. |
|
Thanks for your comments Rebecca and Amit. Sorry for the late reply. We're To address your points: What does Presto get out of Bolt? Presto was originally sold as "SQL on Everything" -- granted, that referred to Aside from project health, Bolt gives Presto users another set of performance A second backend also forces the contract between the coordinator and worker to Additionally, this type of change is not an isolated case. Apache Gluten is a Now to address the specific concerns: 1. Protocol changes become harder This is a tough problem, and I have a few ideas. I'll lay out what I have I do agree that adding bolt could make protocol changes more difficult, but I First want to state that I feel "Presto Protocol" is a loaded term as it
Right now it is my understanding that most protocol updates are happening on the As to how we might adjust protocol itself to reduce friction, if the native worker There is similar precedent in the Substrait Currently the core and connector protocols is implicitly defined with Java Then, for parts of the protocol which might need backend-specific extensions we Then at runtime during split and table handle generation, we can add in Here is a rough example of what I'm thinking: Connector structs could be defined in proto: message BackendExtension {
string identifier;
Any extension;
}
message ConnectorSplit {
optional bytes connector_split;
repeated BackendExtension extensions;
}
message HiveConnectorSplit {
...
}With a protocol definition like this, presto-native-execution (assuming it stays On a separate line of thought, it might be simpler to just introduce In this case the code structure might look something like: Then it would also be clearer which backends support which connectors, because By using either of these approaches, we can still make it such that updates to 2. Correctness differences between backends This is a real concern, and I do not think Bolt should mean silently accepting The practical distinction is between "supported" and "unsupported" behavior. Gluten is again a useful precedent: it supports multiple native backends while So my view is: common workloads and supported features should return the same 3. Long-term maintenance We agree that Bolt should not become an unbounded maintenance burden. The best The conformance bar should also be explicit. Bolt should pass the shared native I would also point again to Gluten: we are in the midst of contributing Bolt as 4. Coordinated Changes Also, to address Amit's example changes, most of them seem to be changes I think our work in the RFC prototype to refactor common parts of the native So I would summarize the value this way: Bolt is worth adding not because every |
|
I think the current discussion should be framed around a more fundamental question: Should Presto own a backend-neutral execution contract, or should Presto’s native execution path remain implicitly shaped around one backend implementation? Supporting a backend-neutral architecture does not mean adding one backend becomes everyone’s maintenance responsibility. I do not think anyone is asking existing maintainers to own Bolt internals or even the Bolt-Presto contracts. If ByteDance contributes and supports the Bolt backend, then ByteDance should own the Bolt-specific maintenance work. The shared responsibility should be limited to what Presto itself needs to own anyway:
That is not Bolt-specific overhead. That is an architectural cleanup. I believe this direction is also consistent with how Velox itself has been presented publicly. The Velox paper describes Velox as providing
It also says Velox is intended to help make data systems “more modular and interoperable” and to support the “one size does not fit all” principle. The Velox website similarly describes Velox as
The newer Axiom direction makes this even more explicit. Axiom is described as:
and:
That is a composable execution infrastructure philosophy. It is not a “single backend forever” philosophy. So I find it difficult to reconcile the public positioning of Velox/Axiom as reusable, composable, engine-neutral infrastructure with the idea that Presto should not allow another serious community-maintained backend. This is also how mature infrastructure ecosystems usually evolve:
In all these cases, the ecosystem became stronger by defining stable contracts and allowing multiple implementations to compete, specialize, and evolve. Nobody expects one implementation to satisfy every workload forever. Query execution should be no different. No single backend can address every community’s needs:
That is exactly why backend-neutral contracts matter. I also want to respectfully push back on the framing of:
Open-source ecosystems are healthiest when multiple serious community participants are able to contribute meaningful technical directions, especially when they are willing to own the implementation and maintenance cost themselves. Bolt already brings concrete value:
That is not just “another copy of Velox.” It is an opportunity to make Presto’s native execution architecture cleaner, more modular, and more community-driven. On the specific concerns:
I do not think this is true anymore in practice. Velox changes often require coordinated updates in Presto native execution code, connector behavior, protocol structures, tests, and sometimes planner assumptions. That means Velox is not merely a leaf library. It is already shaping Presto’s native execution architecture. The problem is not that Bolt creates coupling. The problem is that existing coupling is currently hidden because Velox is the only native backend. Bolt exposes that coupling and forces us to formalize the boundary.
Only if we design the architecture incorrectly. Bolt-specific compatibility should be owned by the Bolt maintainers. Velox-specific compatibility should be owned by the Velox backend maintainers. Shared Presto contracts should be explicit, stable, and versioned. The examples of Velox advancement PRs requiring coordinated changes mostly show that the current Presto-Velox boundary is too tightly coupled. They do not prove that another backend is inherently wrong. A better architecture should make:
If adding Bolt reveals that some interfaces are too implementation-specific or too unstable, that is useful architectural feedback. And importantly, supporting Bolt does not automatically mean existing maintainers now need to spend substantial time adapting Bolt whenever Velox changes. ByteDance is fully capable of owning Bolt-side adaptation work, just like many organizations already maintain their own integrations, forks, connectors, deployment layers, and infrastructure extensions on top of Presto and Velox today. In summary, the deeper question here is whether Presto wants to remain a truly open execution platform, or whether native execution should effectively become tied to one backend’s assumptions and priorities. I believe Presto will be healthier if its execution semantics and contracts belong to Presto itself. |
Summary
Introduce Bolt as an additional backend for the Presto native execution engine.
The initial implementation provides a Bolt-based native worker that implements the Presto worker protocol and integrates with the existing Presto coordinator.
To support the Bolt backend build and dependency requirements, a Conan-based dependency flow is introduced for this worker module. Standardizing dependency management across all native backends is out of scope for this RFC.
@frankobe @ZacBlanco