Skip to content

Commit a400ec1

Browse files
timsaucerclaude
andcommitted
docs: convert restructuredText sources to MyST markdown
Phase 2 of the documentation-site refresh. Run `rst2myst convert` over every human-authored .rst file under docs/source/ and remove the originals. The result: - 33 .rst files become 33 .md files (user guide, contributor guide, index, links). - Headings, paragraphs, hyperlinks, code blocks, admonitions, and toctree directives all map cleanly to MyST syntax. - Cross-reference anchors round-trip through MyST as `(label)=` blocks. The converter kebab-cased the labels (e.g. `(io-csv)=`), but every `{ref}` target in the corpus still uses the underscore form from the original RST (`{ref}\`CSV <io_csv>\``) and so do the Python docstrings that AutoAPI pulls in. Rewrite the anchors back to the underscore form so the existing references resolve. - 86 `{eval-rst}` blocks remain — they all wrap `.. ipython::` directives, which have no first-class MyST equivalent. They render identically and don't block the build. conf.py changes: - Enable `colon_fence` and `deflist` MyST extensions (rst-to-myst emits these on a few files, particularly execution-metrics.md). - Keep `.rst` in `source_suffix` even though no human-authored RST remains: sphinx-autoapi generates RST under autoapi/ at build time and Sphinx needs the suffix registered to parse it. AGENTS.md: update the two .rst paths called out under "Aggregate and Window Function Documentation" to point at the .md equivalents. Verified by building locally — `build succeeded`, no warnings, all internal cross-references resolve, the ipython examples on the landing page and basics page still execute. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 99e10de commit a400ec1

63 files changed

Lines changed: 4256 additions & 3888 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

AGENTS.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -84,9 +84,9 @@ Every Python function must include a docstring with usage examples.
8484
When adding or updating an aggregate or window function, ensure the corresponding
8585
site documentation is kept in sync:
8686

87-
- **Aggregations**: `docs/source/user-guide/common-operations/aggregations.rst`
87+
- **Aggregations**: `docs/source/user-guide/common-operations/aggregations.md`
8888
add new aggregate functions to the "Aggregate Functions" list and include usage
8989
examples if appropriate.
90-
- **Window functions**: `docs/source/user-guide/common-operations/windows.rst`
90+
- **Window functions**: `docs/source/user-guide/common-operations/windows.md`
9191
add new window functions to the "Available Functions" list and include usage
9292
examples if appropriate.

docs/source/conf.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,10 @@
5353
"autoapi.extension",
5454
]
5555

56+
# NOTE: .rst stays alongside .md because sphinx-autoapi generates RST
57+
# under autoapi/ and Sphinx needs the suffix to parse it. The human-
58+
# authored docs are all MyST .md now; the .rst entry is only for the
59+
# autoapi build artifacts.
5660
source_suffix = {
5761
".rst": "restructuredtext",
5862
".md": "markdown",
@@ -171,5 +175,9 @@ def setup(sphinx) -> None:
171175
# tell myst_parser to auto-generate anchor links for headers h1, h2, h3
172176
myst_heading_anchors = 3
173177

174-
# enable nice rendering of checkboxes for the task lists
175-
myst_enable_extensions = ["tasklist"]
178+
# MyST extensions:
179+
# - tasklist: GitHub-style `- [x]` checkboxes
180+
# - colon_fence: `:::{directive}` blocks (needed by execution-metrics.md
181+
# after the RST -> MyST conversion)
182+
# - deflist: definition lists (used in a couple of converted pages)
183+
myst_enable_extensions = ["tasklist", "colon_fence", "deflist"]
Lines changed: 268 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,268 @@
1+
% Licensed to the Apache Software Foundation (ASF) under one
2+
3+
% or more contributor license agreements. See the NOTICE file
4+
5+
% distributed with this work for additional information
6+
7+
% regarding copyright ownership. The ASF licenses this file
8+
9+
% to you under the Apache License, Version 2.0 (the
10+
11+
% "License"); you may not use this file except in compliance
12+
13+
% with the License. You may obtain a copy of the License at
14+
15+
% http://www.apache.org/licenses/LICENSE-2.0
16+
17+
% Unless required by applicable law or agreed to in writing,
18+
19+
% software distributed under the License is distributed on an
20+
21+
% "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
22+
23+
% KIND, either express or implied. See the License for the
24+
25+
% specific language governing permissions and limitations
26+
27+
% under the License.
28+
29+
(ffi)=
30+
31+
# Python Extensions
32+
33+
The DataFusion in Python project is designed to allow users to extend its functionality in a few core
34+
areas. Ideally many users would like to package their extensions as a Python package and easily
35+
integrate that package with this project. This page serves to describe some of the challenges we face
36+
when doing these integrations and the approach our project uses.
37+
38+
## The Primary Issue
39+
40+
Suppose you wish to use DataFusion and you have a custom data source that can produce tables that
41+
can then be queried against, similar to how you can register a {ref}`CSV <io_csv>` or
42+
{ref}`Parquet <io_parquet>` file. In DataFusion terminology, you likely want to implement a
43+
{ref}`Custom Table Provider <io_custom_table_provider>`. In an effort to make your data source
44+
as performant as possible and to utilize the features of DataFusion, you may decide to write
45+
your source in Rust and then expose it through [PyO3](https://pyo3.rs) as a Python library.
46+
47+
At first glance, it may appear the best way to do this is to add the `datafusion-python`
48+
crate as a dependency, provide a `PyTable`, and then to register it with the
49+
`SessionContext`. Unfortunately, this will not work.
50+
51+
When you produce your code as a Python library and it needs to interact with the DataFusion
52+
library, at the lowest level they communicate through an Application Binary Interface (ABI).
53+
The acronym sounds similar to API (Application Programming Interface), but it is distinctly
54+
different.
55+
56+
The ABI sets the standard for how these libraries can share data and functions between each
57+
other. One of the key differences between Rust and other programming languages is that Rust
58+
does not have a stable ABI. What this means in practice is that if you compile a Rust library
59+
with one version of the `rustc` compiler and I compile another library to interface with it
60+
but I use a different version of the compiler, there is no guarantee the interface will be
61+
the same.
62+
63+
In practice, this means that a Python library built with `datafusion-python` as a Rust
64+
dependency will generally **not** be compatible with the DataFusion Python package, even
65+
if they reference the same version of `datafusion-python`. If you attempt to do this, it may
66+
work on your local computer if you have built both packages with the same optimizations.
67+
This can sometimes lead to a false expectation that the code will work, but it frequently
68+
breaks the moment you try to use your package against the released packages.
69+
70+
You can find more information about the Rust ABI in their
71+
[online documentation](https://doc.rust-lang.org/reference/abi.html).
72+
73+
## The FFI Approach
74+
75+
Rust supports interacting with other programming languages through it's Foreign Function
76+
Interface (FFI). The advantage of using the FFI is that it enables you to write data structures
77+
and functions that have a stable ABI. The allows you to use Rust code with C, Python, and
78+
other languages. In fact, the [PyO3](https://pyo3.rs) library uses the FFI to share data
79+
and functions between Python and Rust.
80+
81+
The approach we are taking in the DataFusion in Python project is to incrementally expose
82+
more portions of the DataFusion project via FFI interfaces. This allows users to write Rust
83+
code that does **not** require the `datafusion-python` crate as a dependency, expose their
84+
code in Python via PyO3, and have it interact with the DataFusion Python package.
85+
86+
Early adopters of this approach include [delta-rs](https://delta-io.github.io/delta-rs/)
87+
who has adapted their Table Provider for use in `` `datafusion-python` `` with only a few lines
88+
of code. Also, the DataFusion Python project uses the existing definitions from
89+
[Apache Arrow CStream Interface](https://arrow.apache.org/docs/format/CStreamInterface.html)
90+
to support importing **and** exporting tables. Any Python package that supports reading
91+
the Arrow C Stream interface can work with DataFusion Python out of the box! You can read
92+
more about working with Arrow sources in the {ref}`Data Sources <user_guide_data_sources>`
93+
page.
94+
95+
To learn more about the Foreign Function Interface in Rust, the
96+
[Rustonomicon](https://doc.rust-lang.org/nomicon/ffi.html) is a good resource.
97+
98+
## Inspiration from Arrow
99+
100+
DataFusion is built upon [Apache Arrow](https://arrow.apache.org/). The canonical Python
101+
Arrow implementation, [pyarrow](https://arrow.apache.org/docs/python/index.html) provides
102+
an excellent way to share Arrow data between Python projects without performing any copy
103+
operations on the data. They do this by using a well defined set of interfaces. You can
104+
find the details about their stream interface
105+
[here](https://arrow.apache.org/docs/format/CStreamInterface.html). The
106+
[Rust Arrow Implementation](https://github.com/apache/arrow-rs) also supports these
107+
`C` style definitions via the Foreign Function Interface.
108+
109+
In addition to using these interfaces to transfer Arrow data between libraries, `pyarrow`
110+
goes one step further to make sharing the interfaces easier in Python. They do this
111+
by exposing PyCapsules that contain the expected functionality.
112+
113+
You can learn more about PyCapsules from the official
114+
[Python online documentation](https://docs.python.org/3/c-api/capsule.html). PyCapsules
115+
have excellent support in PyO3 already. The
116+
[PyO3 online documentation](https://pyo3.rs/main/doc/pyo3/types/struct.pycapsule) is a good source
117+
for more details on using PyCapsules in Rust.
118+
119+
Two lessons we leverage from the Arrow project in DataFusion Python are:
120+
121+
- We reuse the existing Arrow FFI functionality wherever possible.
122+
- We expose PyCapsules that contain a FFI stable struct.
123+
124+
## Implementation Details
125+
126+
The bulk of the code necessary to perform our FFI operations is in the upstream
127+
[DataFusion](https://datafusion.apache.org/) core repository. You can review the code and
128+
documentation in the [datafusion-ffi] crate.
129+
130+
Our FFI implementation is narrowly focused at sharing data and functions with Rust backed
131+
libraries. This allows us to use the [abi_stable crate](https://crates.io/crates/abi_stable).
132+
This is an excellent crate that allows for easy conversion between Rust native types
133+
and FFI-safe alternatives. For example, if you needed to pass a `Vec<String>` via FFI,
134+
you can simply convert it to a `RVec<RString>` in an intuitive manner. It also supports
135+
features like `RResult` and `ROption` that do not have an obvious translation to a
136+
C equivalent.
137+
138+
The [datafusion-ffi] crate has been designed to make it easy to convert from DataFusion
139+
traits into their FFI counterparts. For example, if you have defined a custom
140+
[TableProvider](https://docs.rs/datafusion/45.0.0/datafusion/catalog/trait.TableProvider.html)
141+
and you want to create a sharable FFI counterpart, you could write:
142+
143+
```rust
144+
let my_provider = MyTableProvider::default();
145+
let ffi_provider = FFI_TableProvider::new(Arc::new(my_provider), false, None);
146+
```
147+
148+
(ffi_pyclass_mutability)=
149+
150+
## PyO3 class mutability guidelines
151+
152+
PyO3 bindings should present immutable wrappers whenever a struct stores shared or
153+
interior-mutable state. In practice this means that any `#[pyclass]` containing an
154+
`Arc<RwLock<_>>` or similar synchronized primitive must opt into `#[pyclass(frozen)]`
155+
unless there is a compelling reason not to.
156+
157+
The execution context illustrates the preferred pattern. `PySessionContext` in
158+
{file}`src/context.rs` stays frozen even though it shares mutable state internally via
159+
`SessionContext`. This ensures PyO3 tracks borrows correctly while Python-facing APIs
160+
clone the inner `SessionContext` or return new wrappers instead of mutating the
161+
existing instance in place:
162+
163+
```rust
164+
#[pyclass(from_py_object, frozen, name = "SessionContext", module = "datafusion", subclass)]
165+
#[derive(Clone)]
166+
pub struct PySessionContext {
167+
pub ctx: SessionContext,
168+
}
169+
```
170+
171+
Occasionally a type must remain mutable—for example when PyO3 attribute setters need to
172+
update fields directly. In these rare cases add an inline justification so reviewers and
173+
future contributors understand why `frozen` is unsafe to enable. `DataTypeMap` in
174+
{file}`src/common/data_type.rs` includes such a comment because PyO3 still needs to track
175+
field updates:
176+
177+
```rust
178+
// TODO: This looks like this needs pyo3 tracking so leaving unfrozen for now
179+
#[derive(Debug, Clone)]
180+
#[pyclass(from_py_object, name = "DataTypeMap", module = "datafusion.common", subclass)]
181+
pub struct DataTypeMap {
182+
#[pyo3(get, set)]
183+
pub arrow_type: PyDataType,
184+
#[pyo3(get, set)]
185+
pub python_type: PythonType,
186+
#[pyo3(get, set)]
187+
pub sql_type: SqlType,
188+
}
189+
```
190+
191+
When reviewers encounter a mutable `#[pyclass]` without a comment, they should request
192+
an explanation or ask that `frozen` be added. Keeping these wrappers frozen by default
193+
helps avoid subtle bugs stemming from PyO3's interior mutability tracking.
194+
195+
If you were interfacing with a library that provided the above `FFI_TableProvider` and
196+
you needed to turn it back into an `TableProvider`, you can turn it into a
197+
`ForeignTableProvider` with implements the `TableProvider` trait.
198+
199+
```rust
200+
let foreign_provider: ForeignTableProvider = ffi_provider.into();
201+
```
202+
203+
If you review the code in [datafusion-ffi] you will find that each of the traits we share
204+
across the boundary has two portions, one with a `FFI_` prefix and one with a `Foreign`
205+
prefix. This is used to distinguish which side of the FFI boundary that struct is
206+
designed to be used on. The structures with the `FFI_` prefix are to be used on the
207+
**provider** of the structure. In the example we're showing, this means the code that has
208+
written the underlying `TableProvider` implementation to access your custom data source.
209+
The structures with the `Foreign` prefix are to be used by the receiver. In this case,
210+
it is the `datafusion-python` library.
211+
212+
In order to share these FFI structures, we need to wrap them in some kind of Python object
213+
that can be used to interface from one package to another. As described in the above
214+
section on our inspiration from Arrow, we use `PyCapsule`. We can create a `PyCapsule`
215+
for our provider thusly:
216+
217+
```rust
218+
let name = CString::new("datafusion_table_provider")?;
219+
let my_capsule = PyCapsule::new_bound(py, provider, Some(name))?;
220+
```
221+
222+
On the receiving side, turn this pycapsule object into the `FFI_TableProvider`, which
223+
can then be turned into a `ForeignTableProvider` the associated code is:
224+
225+
```rust
226+
let capsule = capsule.cast::<PyCapsule>()?;
227+
let data: NonNull<FFI_TableProvider> = capsule
228+
.pointer_checked(Some(name))?
229+
.cast();
230+
let codec = unsafe { data.as_ref() };
231+
```
232+
233+
By convention the `datafusion-python` library expects a Python object that has a
234+
`TableProvider` PyCapsule to have this capsule accessible by calling a function named
235+
`__datafusion_table_provider__`. You can see a complete working example of how to
236+
share a `TableProvider` from one python library to DataFusion Python in the
237+
[repository examples folder](https://github.com/apache/datafusion-python/tree/main/examples/datafusion-ffi-example).
238+
239+
This section has been written using `TableProvider` as an example. It is the first
240+
extension that has been written using this approach and the most thoroughly implemented.
241+
As we continue to expose more of the DataFusion features, we intend to follow this same
242+
design pattern.
243+
244+
## Alternative Approach
245+
246+
Suppose you needed to expose some other features of DataFusion and you could not wait
247+
for the upstream repository to implement the FFI approach we describe. In this case
248+
you decide to create your dependency on the `datafusion-python` crate instead.
249+
250+
As we discussed, this is not guaranteed to work across different compiler versions and
251+
optimization levels. If you wish to go down this route, there are two approaches we
252+
have identified you can use.
253+
254+
1. Re-export all of `datafusion-python` yourself with your extensions built in.
255+
2. Carefully synchronize your software releases with the `datafusion-python` CI build
256+
system so that your libraries use the exact same compiler, features, and
257+
optimization level.
258+
259+
We currently do not recommend either of these approaches as they are difficult to
260+
maintain over a long period. Additionally, they require a tight version coupling
261+
between libraries.
262+
263+
## Status of Work
264+
265+
At the time of this writing, the FFI features are under active development. To see
266+
the latest status, we recommend reviewing the code in the [datafusion-ffi] crate.
267+
268+
[datafusion-ffi]: https://crates.io/crates/datafusion-ffi

0 commit comments

Comments
 (0)