API: how to treat numpy string dtype (and aliases) as dtype argument in constructors?

There is one remaining TODO item in the string dtype tracker issue for the actual implementation (https://github.com/pandas-dev/pandas/issues/54792):

> Ensure dtype=str/ astype(str) works as an alias when the future mode is enabled (and any other alias which will currently convert the input to strings, like "U"?)

The Python `str` is already an alias for the `"str"` /  `pd.StringDtype(na_value=np.nan)`, but so the remaining question is: what to do with the other aliases that currently "kind of" give you strings.

Summary of the current behaviour when using numpy's fixed-width unicode string dtype (`np.dtype("U")`, and its aliases like `np.str_`, `"U"`, `"unicode"`, `"str_", ...):

- `Series(.., dtype="U")` constructor: converts all data to strings while preserving missing values as-is, but then returns the result as object dtype
  - here, we end up calling `lib.ensure_string_array` with the default of `skipna=True` 
- `DataFrame(.., dtype="U")` constructor seems to be consistent with Series
- `.astype("U")` method: converts all data to strings, _including_ stringifying the missing values, and then returns the result as object dtype
  - here, we end up calling `lib.ensure_string_array` (in `_astype_nansafe`) with `skipna=False`, causing those missing values to be stringified
  - (this is the behaviour we originally also had for `astype(str)`, but that is something we fixed for 3.0; before that was essentially also a alias of the numpy dtype)
- when passing an actual numpy array of this dtype to the Series constructor, at that point we actually _do_ convert it to our '"str"` dtype and no longer returning it as object-dtype Series
- `pd.array(..., dtype="U")` constructor: converts all data to strings, including stringifying the missing values. This essentially just defers to the `np.array(..)` behaviour for that dtype, and also returns it as a NumpyEA with the numpy "U" dtype

Code examples illustrating the above:

```python
# first, with the actual string dtype 
# -> converts non-strings to string, preserves missing values (coerced to NaN)
>>> pd.Series([1, None, np.nan], dtype=str).values
<ArrowStringArray>
['1', nan, nan]
Length: 3, dtype: str

# using np.dtype("U") (or any of its aliases) 
# -> converts non-strings, preserves missing values as is -> but returns it as object dtype
>>> pd.Series([1, None, np.nan], dtype="U").values
array(['1', None, nan], dtype=object)

# on the other hand, when passing a numpy string _array_, we actually do convert to our string dtype
>>> pd.Series(np.array([1, None, np.nan], dtype=np.str_)).values
<ArrowStringArray>
['1', 'None', 'nan']  # conversion of None/NaN to strings was already done by `np.array(..)`
Length: 3, dtype: str

# with astype, also the missing values get stringified, but again returned as object dtype
>>> ser = pd.Series([1, None, np.nan], dtype=object)
>>> ser.astype("U").values
array(['1', 'None', 'nan'], dtype=object)

# with pd.array(..) -> actually defers to `np.array(..)` behaviour
>>> pd.array([1, None, np.nan], dtype="U")
<NumpyExtensionArray>
['1', 'None', 'nan']
Length: 3, dtype: str128
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

API: how to treat numpy string dtype (and aliases) as dtype argument in constructors? #64560

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

API: how to treat numpy string dtype (and aliases) as dtype argument in constructors? #64560

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions