|
| 1 | +% Licensed to the Apache Software Foundation (ASF) under one |
| 2 | + |
| 3 | +% or more contributor license agreements. See the NOTICE file |
| 4 | + |
| 5 | +% distributed with this work for additional information |
| 6 | + |
| 7 | +% regarding copyright ownership. The ASF licenses this file |
| 8 | + |
| 9 | +% to you under the Apache License, Version 2.0 (the |
| 10 | + |
| 11 | +% "License"); you may not use this file except in compliance |
| 12 | + |
| 13 | +% with the License. You may obtain a copy of the License at |
| 14 | + |
| 15 | +% http://www.apache.org/licenses/LICENSE-2.0 |
| 16 | + |
| 17 | +% Unless required by applicable law or agreed to in writing, |
| 18 | + |
| 19 | +% software distributed under the License is distributed on an |
| 20 | + |
| 21 | +% "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 22 | + |
| 23 | +% KIND, either express or implied. See the License for the |
| 24 | + |
| 25 | +% specific language governing permissions and limitations |
| 26 | + |
| 27 | +% under the License. |
| 28 | + |
| 29 | +(ffi)= |
| 30 | + |
| 31 | +# Python Extensions |
| 32 | + |
| 33 | +The DataFusion in Python project is designed to allow users to extend its functionality in a few core |
| 34 | +areas. Ideally many users would like to package their extensions as a Python package and easily |
| 35 | +integrate that package with this project. This page serves to describe some of the challenges we face |
| 36 | +when doing these integrations and the approach our project uses. |
| 37 | + |
| 38 | +## The Primary Issue |
| 39 | + |
| 40 | +Suppose you wish to use DataFusion and you have a custom data source that can produce tables that |
| 41 | +can then be queried against, similar to how you can register a {ref}`CSV <io_csv>` or |
| 42 | +{ref}`Parquet <io_parquet>` file. In DataFusion terminology, you likely want to implement a |
| 43 | +{ref}`Custom Table Provider <io_custom_table_provider>`. In an effort to make your data source |
| 44 | +as performant as possible and to utilize the features of DataFusion, you may decide to write |
| 45 | +your source in Rust and then expose it through [PyO3](https://pyo3.rs) as a Python library. |
| 46 | + |
| 47 | +At first glance, it may appear the best way to do this is to add the `datafusion-python` |
| 48 | +crate as a dependency, provide a `PyTable`, and then to register it with the |
| 49 | +`SessionContext`. Unfortunately, this will not work. |
| 50 | + |
| 51 | +When you produce your code as a Python library and it needs to interact with the DataFusion |
| 52 | +library, at the lowest level they communicate through an Application Binary Interface (ABI). |
| 53 | +The acronym sounds similar to API (Application Programming Interface), but it is distinctly |
| 54 | +different. |
| 55 | + |
| 56 | +The ABI sets the standard for how these libraries can share data and functions between each |
| 57 | +other. One of the key differences between Rust and other programming languages is that Rust |
| 58 | +does not have a stable ABI. What this means in practice is that if you compile a Rust library |
| 59 | +with one version of the `rustc` compiler and I compile another library to interface with it |
| 60 | +but I use a different version of the compiler, there is no guarantee the interface will be |
| 61 | +the same. |
| 62 | + |
| 63 | +In practice, this means that a Python library built with `datafusion-python` as a Rust |
| 64 | +dependency will generally **not** be compatible with the DataFusion Python package, even |
| 65 | +if they reference the same version of `datafusion-python`. If you attempt to do this, it may |
| 66 | +work on your local computer if you have built both packages with the same optimizations. |
| 67 | +This can sometimes lead to a false expectation that the code will work, but it frequently |
| 68 | +breaks the moment you try to use your package against the released packages. |
| 69 | + |
| 70 | +You can find more information about the Rust ABI in their |
| 71 | +[online documentation](https://doc.rust-lang.org/reference/abi.html). |
| 72 | + |
| 73 | +## The FFI Approach |
| 74 | + |
| 75 | +Rust supports interacting with other programming languages through it's Foreign Function |
| 76 | +Interface (FFI). The advantage of using the FFI is that it enables you to write data structures |
| 77 | +and functions that have a stable ABI. The allows you to use Rust code with C, Python, and |
| 78 | +other languages. In fact, the [PyO3](https://pyo3.rs) library uses the FFI to share data |
| 79 | +and functions between Python and Rust. |
| 80 | + |
| 81 | +The approach we are taking in the DataFusion in Python project is to incrementally expose |
| 82 | +more portions of the DataFusion project via FFI interfaces. This allows users to write Rust |
| 83 | +code that does **not** require the `datafusion-python` crate as a dependency, expose their |
| 84 | +code in Python via PyO3, and have it interact with the DataFusion Python package. |
| 85 | + |
| 86 | +Early adopters of this approach include [delta-rs](https://delta-io.github.io/delta-rs/) |
| 87 | +who has adapted their Table Provider for use in `` `datafusion-python` `` with only a few lines |
| 88 | +of code. Also, the DataFusion Python project uses the existing definitions from |
| 89 | +[Apache Arrow CStream Interface](https://arrow.apache.org/docs/format/CStreamInterface.html) |
| 90 | +to support importing **and** exporting tables. Any Python package that supports reading |
| 91 | +the Arrow C Stream interface can work with DataFusion Python out of the box! You can read |
| 92 | +more about working with Arrow sources in the {ref}`Data Sources <user_guide_data_sources>` |
| 93 | +page. |
| 94 | + |
| 95 | +To learn more about the Foreign Function Interface in Rust, the |
| 96 | +[Rustonomicon](https://doc.rust-lang.org/nomicon/ffi.html) is a good resource. |
| 97 | + |
| 98 | +## Inspiration from Arrow |
| 99 | + |
| 100 | +DataFusion is built upon [Apache Arrow](https://arrow.apache.org/). The canonical Python |
| 101 | +Arrow implementation, [pyarrow](https://arrow.apache.org/docs/python/index.html) provides |
| 102 | +an excellent way to share Arrow data between Python projects without performing any copy |
| 103 | +operations on the data. They do this by using a well defined set of interfaces. You can |
| 104 | +find the details about their stream interface |
| 105 | +[here](https://arrow.apache.org/docs/format/CStreamInterface.html). The |
| 106 | +[Rust Arrow Implementation](https://github.com/apache/arrow-rs) also supports these |
| 107 | +`C` style definitions via the Foreign Function Interface. |
| 108 | + |
| 109 | +In addition to using these interfaces to transfer Arrow data between libraries, `pyarrow` |
| 110 | +goes one step further to make sharing the interfaces easier in Python. They do this |
| 111 | +by exposing PyCapsules that contain the expected functionality. |
| 112 | + |
| 113 | +You can learn more about PyCapsules from the official |
| 114 | +[Python online documentation](https://docs.python.org/3/c-api/capsule.html). PyCapsules |
| 115 | +have excellent support in PyO3 already. The |
| 116 | +[PyO3 online documentation](https://pyo3.rs/main/doc/pyo3/types/struct.pycapsule) is a good source |
| 117 | +for more details on using PyCapsules in Rust. |
| 118 | + |
| 119 | +Two lessons we leverage from the Arrow project in DataFusion Python are: |
| 120 | + |
| 121 | +- We reuse the existing Arrow FFI functionality wherever possible. |
| 122 | +- We expose PyCapsules that contain a FFI stable struct. |
| 123 | + |
| 124 | +## Implementation Details |
| 125 | + |
| 126 | +The bulk of the code necessary to perform our FFI operations is in the upstream |
| 127 | +[DataFusion](https://datafusion.apache.org/) core repository. You can review the code and |
| 128 | +documentation in the [datafusion-ffi] crate. |
| 129 | + |
| 130 | +Our FFI implementation is narrowly focused at sharing data and functions with Rust backed |
| 131 | +libraries. This allows us to use the [abi_stable crate](https://crates.io/crates/abi_stable). |
| 132 | +This is an excellent crate that allows for easy conversion between Rust native types |
| 133 | +and FFI-safe alternatives. For example, if you needed to pass a `Vec<String>` via FFI, |
| 134 | +you can simply convert it to a `RVec<RString>` in an intuitive manner. It also supports |
| 135 | +features like `RResult` and `ROption` that do not have an obvious translation to a |
| 136 | +C equivalent. |
| 137 | + |
| 138 | +The [datafusion-ffi] crate has been designed to make it easy to convert from DataFusion |
| 139 | +traits into their FFI counterparts. For example, if you have defined a custom |
| 140 | +[TableProvider](https://docs.rs/datafusion/45.0.0/datafusion/catalog/trait.TableProvider.html) |
| 141 | +and you want to create a sharable FFI counterpart, you could write: |
| 142 | + |
| 143 | +```rust |
| 144 | +let my_provider = MyTableProvider::default(); |
| 145 | +let ffi_provider = FFI_TableProvider::new(Arc::new(my_provider), false, None); |
| 146 | +``` |
| 147 | + |
| 148 | +(ffi_pyclass_mutability)= |
| 149 | + |
| 150 | +## PyO3 class mutability guidelines |
| 151 | + |
| 152 | +PyO3 bindings should present immutable wrappers whenever a struct stores shared or |
| 153 | +interior-mutable state. In practice this means that any `#[pyclass]` containing an |
| 154 | +`Arc<RwLock<_>>` or similar synchronized primitive must opt into `#[pyclass(frozen)]` |
| 155 | +unless there is a compelling reason not to. |
| 156 | + |
| 157 | +The execution context illustrates the preferred pattern. `PySessionContext` in |
| 158 | +{file}`src/context.rs` stays frozen even though it shares mutable state internally via |
| 159 | +`SessionContext`. This ensures PyO3 tracks borrows correctly while Python-facing APIs |
| 160 | +clone the inner `SessionContext` or return new wrappers instead of mutating the |
| 161 | +existing instance in place: |
| 162 | + |
| 163 | +```rust |
| 164 | +#[pyclass(from_py_object, frozen, name = "SessionContext", module = "datafusion", subclass)] |
| 165 | +#[derive(Clone)] |
| 166 | +pub struct PySessionContext { |
| 167 | + pub ctx: SessionContext, |
| 168 | +} |
| 169 | +``` |
| 170 | + |
| 171 | +Occasionally a type must remain mutable—for example when PyO3 attribute setters need to |
| 172 | +update fields directly. In these rare cases add an inline justification so reviewers and |
| 173 | +future contributors understand why `frozen` is unsafe to enable. `DataTypeMap` in |
| 174 | +{file}`src/common/data_type.rs` includes such a comment because PyO3 still needs to track |
| 175 | +field updates: |
| 176 | + |
| 177 | +```rust |
| 178 | +// TODO: This looks like this needs pyo3 tracking so leaving unfrozen for now |
| 179 | +#[derive(Debug, Clone)] |
| 180 | +#[pyclass(from_py_object, name = "DataTypeMap", module = "datafusion.common", subclass)] |
| 181 | +pub struct DataTypeMap { |
| 182 | + #[pyo3(get, set)] |
| 183 | + pub arrow_type: PyDataType, |
| 184 | + #[pyo3(get, set)] |
| 185 | + pub python_type: PythonType, |
| 186 | + #[pyo3(get, set)] |
| 187 | + pub sql_type: SqlType, |
| 188 | +} |
| 189 | +``` |
| 190 | + |
| 191 | +When reviewers encounter a mutable `#[pyclass]` without a comment, they should request |
| 192 | +an explanation or ask that `frozen` be added. Keeping these wrappers frozen by default |
| 193 | +helps avoid subtle bugs stemming from PyO3's interior mutability tracking. |
| 194 | + |
| 195 | +If you were interfacing with a library that provided the above `FFI_TableProvider` and |
| 196 | +you needed to turn it back into an `TableProvider`, you can turn it into a |
| 197 | +`ForeignTableProvider` with implements the `TableProvider` trait. |
| 198 | + |
| 199 | +```rust |
| 200 | +let foreign_provider: ForeignTableProvider = ffi_provider.into(); |
| 201 | +``` |
| 202 | + |
| 203 | +If you review the code in [datafusion-ffi] you will find that each of the traits we share |
| 204 | +across the boundary has two portions, one with a `FFI_` prefix and one with a `Foreign` |
| 205 | +prefix. This is used to distinguish which side of the FFI boundary that struct is |
| 206 | +designed to be used on. The structures with the `FFI_` prefix are to be used on the |
| 207 | +**provider** of the structure. In the example we're showing, this means the code that has |
| 208 | +written the underlying `TableProvider` implementation to access your custom data source. |
| 209 | +The structures with the `Foreign` prefix are to be used by the receiver. In this case, |
| 210 | +it is the `datafusion-python` library. |
| 211 | + |
| 212 | +In order to share these FFI structures, we need to wrap them in some kind of Python object |
| 213 | +that can be used to interface from one package to another. As described in the above |
| 214 | +section on our inspiration from Arrow, we use `PyCapsule`. We can create a `PyCapsule` |
| 215 | +for our provider thusly: |
| 216 | + |
| 217 | +```rust |
| 218 | +let name = CString::new("datafusion_table_provider")?; |
| 219 | +let my_capsule = PyCapsule::new_bound(py, provider, Some(name))?; |
| 220 | +``` |
| 221 | + |
| 222 | +On the receiving side, turn this pycapsule object into the `FFI_TableProvider`, which |
| 223 | +can then be turned into a `ForeignTableProvider` the associated code is: |
| 224 | + |
| 225 | +```rust |
| 226 | +let capsule = capsule.cast::<PyCapsule>()?; |
| 227 | +let data: NonNull<FFI_TableProvider> = capsule |
| 228 | + .pointer_checked(Some(name))? |
| 229 | + .cast(); |
| 230 | +let codec = unsafe { data.as_ref() }; |
| 231 | +``` |
| 232 | + |
| 233 | +By convention the `datafusion-python` library expects a Python object that has a |
| 234 | +`TableProvider` PyCapsule to have this capsule accessible by calling a function named |
| 235 | +`__datafusion_table_provider__`. You can see a complete working example of how to |
| 236 | +share a `TableProvider` from one python library to DataFusion Python in the |
| 237 | +[repository examples folder](https://github.com/apache/datafusion-python/tree/main/examples/datafusion-ffi-example). |
| 238 | + |
| 239 | +This section has been written using `TableProvider` as an example. It is the first |
| 240 | +extension that has been written using this approach and the most thoroughly implemented. |
| 241 | +As we continue to expose more of the DataFusion features, we intend to follow this same |
| 242 | +design pattern. |
| 243 | + |
| 244 | +## Alternative Approach |
| 245 | + |
| 246 | +Suppose you needed to expose some other features of DataFusion and you could not wait |
| 247 | +for the upstream repository to implement the FFI approach we describe. In this case |
| 248 | +you decide to create your dependency on the `datafusion-python` crate instead. |
| 249 | + |
| 250 | +As we discussed, this is not guaranteed to work across different compiler versions and |
| 251 | +optimization levels. If you wish to go down this route, there are two approaches we |
| 252 | +have identified you can use. |
| 253 | + |
| 254 | +1. Re-export all of `datafusion-python` yourself with your extensions built in. |
| 255 | +2. Carefully synchronize your software releases with the `datafusion-python` CI build |
| 256 | + system so that your libraries use the exact same compiler, features, and |
| 257 | + optimization level. |
| 258 | + |
| 259 | +We currently do not recommend either of these approaches as they are difficult to |
| 260 | +maintain over a long period. Additionally, they require a tight version coupling |
| 261 | +between libraries. |
| 262 | + |
| 263 | +## Status of Work |
| 264 | + |
| 265 | +At the time of this writing, the FFI features are under active development. To see |
| 266 | +the latest status, we recommend reviewing the code in the [datafusion-ffi] crate. |
| 267 | + |
| 268 | +[datafusion-ffi]: https://crates.io/crates/datafusion-ffi |
0 commit comments