When reading an Arrow IPC stream that contains a nested Struct array named regions, some of its fields — e.g. regional_key — are explicitly typed as uint64 in the source API schema (described here: https://md.umwelt.info/swagger-ui/ at the bottom in Dataset/regions/RegionalKey).
Across different record batches in the stream, the R arrow package inconsistently casts this nested uint64 field:
In chunks with smaller values, it is converted into an R integer.
In chunks with larger values, it is converted into an R double to preserve precision.
Because this type unification fails globally across chunks for nested elements, the resulting R data frame ends up with conflicting types row-by-row within the list-column. This causes standard tidyverse tools like tidyr::unnest_wider() to fail due to a loss of precision mismatch.
In contrast, Python's pyarrow handling of the same IPC stream correctly unifies the entire nested regional_key field into a consistent float type across all records.
According to https://arrow.apache.org/docs/r/articles/data_types.html there is the option to use
options(arrow.int64_downcast = FALSE)
but this seems to work only at the top level and not for nested structures.
Minimal Reproducible Example
library(arrow)
library(httr2)
library(tidyr)
# 1. Fetch the exact IPC stream that contains the heterogeneous nested structures
req <- httr2::request("https://md.umwelt.info/search/all?format=arrow_ipc") |>
httr2::req_url_query(
query = "type:'/Daten und Messstellen/Wasser/Flüsse' AND measuring_station:true",
language = "de"
)
resp <- httr2::req_perform(req)
raw_bytes <- httr2::resp_body_raw(resp)
# 2. Parse the stream into an R data frame via Arrow
df <- as.data.frame(arrow::read_ipc_stream(raw_bytes))
# 3. Attempt to unnest the 'regions' struct column
# This fails because 'regional_key' inside 'regions' alternates between integer and double across rows
df |> unnest_wider(any_of(c("regions")), names_sep = "_")
# Throws:
# Error in `unnest_wider()`:
# ! Can't convert from `..1` <double> to <integer> due to loss of precision.
# • Locations: 2
Expected behavior
If a uint64 field inside a regions struct chunk requires upcasting to an R double, the entire nested column across all record batches should be uniformly converted to double to maintain data structure integrity upon data frame extraction.
Environment
OS: Ubuntu 26.04 LTS
R Version: 4.5.2
arrow R Package Version: 24.0.0
Component(s)
R
When reading an Arrow IPC stream that contains a nested Struct array named regions, some of its fields — e.g. regional_key — are explicitly typed as uint64 in the source API schema (described here: https://md.umwelt.info/swagger-ui/ at the bottom in Dataset/regions/RegionalKey).
Across different record batches in the stream, the R arrow package inconsistently casts this nested uint64 field:
In chunks with smaller values, it is converted into an R integer.
In chunks with larger values, it is converted into an R double to preserve precision.
Because this type unification fails globally across chunks for nested elements, the resulting R data frame ends up with conflicting types row-by-row within the list-column. This causes standard tidyverse tools like tidyr::unnest_wider() to fail due to a loss of precision mismatch.
In contrast, Python's pyarrow handling of the same IPC stream correctly unifies the entire nested regional_key field into a consistent float type across all records.
According to https://arrow.apache.org/docs/r/articles/data_types.html there is the option to use
options(arrow.int64_downcast = FALSE)
but this seems to work only at the top level and not for nested structures.
Minimal Reproducible Example
Expected behavior
If a uint64 field inside a regions struct chunk requires upcasting to an R double, the entire nested column across all record batches should be uniformly converted to double to maintain data structure integrity upon data frame extraction.
Environment
Component(s)
R