Skip to content

API: how to treat numpy string dtype (and aliases) as dtype argument in constructors? #64560

@jorisvandenbossche

Description

@jorisvandenbossche

There is one remaining TODO item in the string dtype tracker issue for the actual implementation (#54792):

Ensure dtype=str/ astype(str) works as an alias when the future mode is enabled (and any other alias which will currently convert the input to strings, like "U"?)

The Python str is already an alias for the "str" / pd.StringDtype(na_value=np.nan), but so the remaining question is: what to do with the other aliases that currently "kind of" give you strings.

Summary of the current behaviour when using numpy's fixed-width unicode string dtype (np.dtype("U"), and its aliases like np.str_, "U", "unicode", `"str_", ...):

  • Series(.., dtype="U") constructor: converts all data to strings while preserving missing values as-is, but then returns the result as object dtype
    • here, we end up calling lib.ensure_string_array with the default of skipna=True
  • DataFrame(.., dtype="U") constructor seems to be consistent with Series
  • .astype("U") method: converts all data to strings, including stringifying the missing values, and then returns the result as object dtype
    • here, we end up calling lib.ensure_string_array (in _astype_nansafe) with skipna=False, causing those missing values to be stringified
    • (this is the behaviour we originally also had for astype(str), but that is something we fixed for 3.0; before that was essentially also a alias of the numpy dtype)
  • when passing an actual numpy array of this dtype to the Series constructor, at that point we actually do convert it to our '"str"` dtype and no longer returning it as object-dtype Series
  • pd.array(..., dtype="U") constructor: converts all data to strings, including stringifying the missing values. This essentially just defers to the np.array(..) behaviour for that dtype, and also returns it as a NumpyEA with the numpy "U" dtype

Code examples illustrating the above:

# first, with the actual string dtype 
# -> converts non-strings to string, preserves missing values (coerced to NaN)
>>> pd.Series([1, None, np.nan], dtype=str).values
<ArrowStringArray>
['1', nan, nan]
Length: 3, dtype: str

# using np.dtype("U") (or any of its aliases) 
# -> converts non-strings, preserves missing values as is -> but returns it as object dtype
>>> pd.Series([1, None, np.nan], dtype="U").values
array(['1', None, nan], dtype=object)

# on the other hand, when passing a numpy string _array_, we actually do convert to our string dtype
>>> pd.Series(np.array([1, None, np.nan], dtype=np.str_)).values
<ArrowStringArray>
['1', 'None', 'nan']  # conversion of None/NaN to strings was already done by `np.array(..)`
Length: 3, dtype: str

# with astype, also the missing values get stringified, but again returned as object dtype
>>> ser = pd.Series([1, None, np.nan], dtype=object)
>>> ser.astype("U").values
array(['1', 'None', 'nan'], dtype=object)

# with pd.array(..) -> actually defers to `np.array(..)` behaviour
>>> pd.array([1, None, np.nan], dtype="U")
<NumpyExtensionArray>
['1', 'None', 'nan']
Length: 3, dtype: str128

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignNeeds DiscussionRequires discussion from core team before further actionStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions