There is one remaining TODO item in the string dtype tracker issue for the actual implementation (#54792):
Ensure dtype=str/ astype(str) works as an alias when the future mode is enabled (and any other alias which will currently convert the input to strings, like "U"?)
The Python str is already an alias for the "str" / pd.StringDtype(na_value=np.nan), but so the remaining question is: what to do with the other aliases that currently "kind of" give you strings.
Summary of the current behaviour when using numpy's fixed-width unicode string dtype (np.dtype("U"), and its aliases like np.str_, "U", "unicode", `"str_", ...):
Series(.., dtype="U") constructor: converts all data to strings while preserving missing values as-is, but then returns the result as object dtype
- here, we end up calling
lib.ensure_string_array with the default of skipna=True
DataFrame(.., dtype="U") constructor seems to be consistent with Series
.astype("U") method: converts all data to strings, including stringifying the missing values, and then returns the result as object dtype
- here, we end up calling
lib.ensure_string_array (in _astype_nansafe) with skipna=False, causing those missing values to be stringified
- (this is the behaviour we originally also had for
astype(str), but that is something we fixed for 3.0; before that was essentially also a alias of the numpy dtype)
- when passing an actual numpy array of this dtype to the Series constructor, at that point we actually do convert it to our '"str"` dtype and no longer returning it as object-dtype Series
pd.array(..., dtype="U") constructor: converts all data to strings, including stringifying the missing values. This essentially just defers to the np.array(..) behaviour for that dtype, and also returns it as a NumpyEA with the numpy "U" dtype
Code examples illustrating the above:
# first, with the actual string dtype
# -> converts non-strings to string, preserves missing values (coerced to NaN)
>>> pd.Series([1, None, np.nan], dtype=str).values
<ArrowStringArray>
['1', nan, nan]
Length: 3, dtype: str
# using np.dtype("U") (or any of its aliases)
# -> converts non-strings, preserves missing values as is -> but returns it as object dtype
>>> pd.Series([1, None, np.nan], dtype="U").values
array(['1', None, nan], dtype=object)
# on the other hand, when passing a numpy string _array_, we actually do convert to our string dtype
>>> pd.Series(np.array([1, None, np.nan], dtype=np.str_)).values
<ArrowStringArray>
['1', 'None', 'nan'] # conversion of None/NaN to strings was already done by `np.array(..)`
Length: 3, dtype: str
# with astype, also the missing values get stringified, but again returned as object dtype
>>> ser = pd.Series([1, None, np.nan], dtype=object)
>>> ser.astype("U").values
array(['1', 'None', 'nan'], dtype=object)
# with pd.array(..) -> actually defers to `np.array(..)` behaviour
>>> pd.array([1, None, np.nan], dtype="U")
<NumpyExtensionArray>
['1', 'None', 'nan']
Length: 3, dtype: str128
There is one remaining TODO item in the string dtype tracker issue for the actual implementation (#54792):
The Python
stris already an alias for the"str"/pd.StringDtype(na_value=np.nan), but so the remaining question is: what to do with the other aliases that currently "kind of" give you strings.Summary of the current behaviour when using numpy's fixed-width unicode string dtype (
np.dtype("U"), and its aliases likenp.str_,"U","unicode", `"str_", ...):Series(.., dtype="U")constructor: converts all data to strings while preserving missing values as-is, but then returns the result as object dtypelib.ensure_string_arraywith the default ofskipna=TrueDataFrame(.., dtype="U")constructor seems to be consistent with Series.astype("U")method: converts all data to strings, including stringifying the missing values, and then returns the result as object dtypelib.ensure_string_array(in_astype_nansafe) withskipna=False, causing those missing values to be stringifiedastype(str), but that is something we fixed for 3.0; before that was essentially also a alias of the numpy dtype)pd.array(..., dtype="U")constructor: converts all data to strings, including stringifying the missing values. This essentially just defers to thenp.array(..)behaviour for that dtype, and also returns it as a NumpyEA with the numpy "U" dtypeCode examples illustrating the above: