Summary
Implement the pandas ExtensionArray hook _reduce for: -
ArkoudaStringArray - ArkoudaCategoricalArray
to align with pandas behavior and avoid fallbacks that materialize data
as NumPy/object.
pandas uses _reduce to implement reductions like min, max, any,
all, and sometimes sum/prod depending on dtype. Without a correct
_reduce, pandas may: - raise TypeError/NotImplementedError - fall
back to object arrays - compute reductions client-side - lose dtype
semantics (especially around missing values)
Background / Why
pandas reduction operations on Series/Index often route through the
ExtensionArray API:
Series.min/max/any/all
Index.min/max
- internal reductions used by groupby/joins/algorithms
For Arkouda-backed arrays, we want: - correct pandas-compatible
semantics - correct missing-value handling (skipna) - server-side
execution where possible - predictable error behavior for unsupported
reductions
pandas _reduce Contract (high-level)
Signature (pandas-private, may vary by version):
def _reduce(self, name: str, skipna: bool = True, keepdims: bool = False, **kwargs):
...
Behavior: - name selects the reduction (e.g., "min", "max",
"any", "all", "sum", "prod", etc.) - returns a scalar (or 1D
array if keepdims=True) - skipna controls missing-value handling -
should raise for unsupported reductions/dtypes consistently with pandas
This ticket should follow the contract pandas expects for the version(s)
Arkouda supports.
Expected Semantics
Strings
Typical pandas expectations for string reductions: - min / max:
lexicographic min/max over non-missing values - any / all: typically
not meaningful for strings; pandas may raise TypeError (confirm
baseline behavior and match) - sum / prod: not supported; should
raise TypeError/NotImplementedError
Missing value handling: - skipna=True: - ignore missing values - if
all values missing → result is missing (often pd.NA) -
skipna=False: - if any missing present → result is missing
Edge cases: - empty array: match pandas (often raises or returns missing
depending on op)
Categoricals
Categorical reductions in pandas are constrained: - min / max
supported if categories are ordered (and maybe for unordered in some
cases?) - confirm pandas baseline and match exactly - any / all:
likely unsupported (confirm and match) - sum / prod: unsupported
Missing value handling: - same skipna behavior as above (ignore vs
propagate)
Metadata: - If reduction returns a category value, return the scalar
category label (not the code), consistent with pandas.
Scope
In Scope
- Implement
_reduce for both arrays with signature compatible with
pandas usage
- Support at minimum:
- Strings:
min, max
- Categoricals:
min, max (with ordered semantics matching
pandas)
- Correctly implement
skipna
- Implement
keepdims behavior if pandas calls it (return length-1
array or scalar)
- Add unit tests comparing Arkouda dtype reductions to pandas
baselines
Out of Scope
- Full support for every reduction name if pandas doesn't require it
for these dtypes
- Groupby reductions (pandas orchestrates, but may call
_reduce on
chunks)
- Performance tuning beyond avoiding obvious fallbacks
Summary
Implement the pandas ExtensionArray hook
_reducefor: -ArkoudaStringArray-ArkoudaCategoricalArrayto align with pandas behavior and avoid fallbacks that materialize data
as NumPy/object.
pandas uses
_reduceto implement reductions likemin,max,any,all, and sometimessum/proddepending on dtype. Without a correct_reduce, pandas may: - raiseTypeError/NotImplementedError- fallback to object arrays - compute reductions client-side - lose dtype
semantics (especially around missing values)
Background / Why
pandas reduction operations on Series/Index often route through the
ExtensionArray API:
Series.min/max/any/allIndex.min/maxFor Arkouda-backed arrays, we want: - correct pandas-compatible
semantics - correct missing-value handling (
skipna) - server-sideexecution where possible - predictable error behavior for unsupported
reductions
pandas
_reduceContract (high-level)Signature (pandas-private, may vary by version):
Behavior: -
nameselects the reduction (e.g.,"min","max","any","all","sum","prod", etc.) - returns a scalar (or 1Darray if
keepdims=True) -skipnacontrols missing-value handling -should raise for unsupported reductions/dtypes consistently with pandas
This ticket should follow the contract pandas expects for the version(s)
Arkouda supports.
Expected Semantics
Strings
Typical pandas expectations for string reductions: -
min/max:lexicographic min/max over non-missing values -
any/all: typicallynot meaningful for strings; pandas may raise
TypeError(confirmbaseline behavior and match) -
sum/prod: not supported; shouldraise
TypeError/NotImplementedErrorMissing value handling: -
skipna=True: - ignore missing values - ifall values missing → result is missing (often
pd.NA) -skipna=False: - if any missing present → result is missingEdge cases: - empty array: match pandas (often raises or returns missing
depending on op)
Categoricals
Categorical reductions in pandas are constrained: -
min/maxsupported if categories are ordered (and maybe for unordered in some
cases?) - confirm pandas baseline and match exactly -
any/all:likely unsupported (confirm and match) -
sum/prod: unsupportedMissing value handling: - same
skipnabehavior as above (ignore vspropagate)
Metadata: - If reduction returns a category value, return the scalar
category label (not the code), consistent with pandas.
Scope
In Scope
_reducefor both arrays with signature compatible withpandas usage
min,maxmin,max(with ordered semantics matchingpandas)
skipnakeepdimsbehavior if pandas calls it (return length-1array or scalar)
baselines
Out of Scope
for these dtypes
_reduceonchunks)