TST (string dtype): resolve xfails for frame methods #60336

WillAyd · 2024-11-16T15:26:22Z

No description provided.

WillAyd

These are pretty tricky and not sure I've approached correctly. Could use some extra input @jorisvandenbossche

WillAyd · 2024-11-16T15:26:59Z

pandas/core/internals/blocks.py

@@ -2362,5 +2362,6 @@ def external_values(values: ArrayLike) -> ArrayLike:
        values.flags.writeable = False

    # TODO(CoW) we should also mark our ExtensionArrays as read-only


Have we already had discussions on how to make ExtensionArrays readonly?

WillAyd · 2024-11-16T15:28:16Z

pandas/tests/frame/methods/test_dropna.py

        dt1 = datetime.datetime(2015, 1, 1, tzinfo=dateutil.tz.tzutc())
        dt2 = datetime.datetime(2015, 2, 2, tzinfo=dateutil.tz.tzutc())
-        df["Time"] = [dt1]
+        df = DataFrame({"Time": [dt1]})


The problem with this test is that an empty DataFrame is created first, which creates an object dtype column; subsequently, the assignment of a column keeps the column dtype as object.

That seems like a more general usage issue which needs to be resolved, although for this test I didn't think it was important to use that construction pattern

Yeah, see my comment above about this (and opened #60338 about it), but for the tests the above change is indeed fine

WillAyd · 2024-11-16T15:31:00Z

pandas/core/frame.py

@@ -6273,6 +6274,10 @@ class    max    type
            else:
                to_insert = ((self.index, None),)

+            if len(new_obj.columns) == 0 and names:


This is a local fix to the problem of appending column names to an empty set, which defaults the column dtype to object. While this fix the tests, there seems to be a larger issue at play that I'm not sure how to solve

This seems to be a similar issue like the pattern of creating an empty dataframe and then adding columns that I also encountered in the tests (for now I always got the tests passing by either ensuring the expected uses object dtype or ensuring the empty dataframe starts with an empty columns Index of dtype "str").

I am not sure we should "fix" this issue, as it would also introduce an inconsistency in the expected dtype, but opened #60338 to give this a bit more visibility.

I am not sure we should "fix" this issue

I think any code changes should perhaps be separate PRs to the general resolving xfails PRs and maybe to avoid any regressions on 2.3.x be wrapped in using_string_dtype if blocks?

Thanks for the feedback. I'll get this removed

@simonjayhawkins just to confirm I understand, are you asking to separate out PRs that need to change tests to correct the xfails from PRs that need to change the core implementation?

if the PR title implies that the changes are test related then I don't normally expect to see code changes to the core implementation, so yes, I think splitting this PR is wise.

jorisvandenbossche · 2024-11-16T17:26:54Z

pandas/tests/frame/methods/test_astype.py

            assert item is pd.NA
-
-        # For non-NA values, we should match what we get for non-EA str
-        alt = obj.astype(str)


Maybe repeat the above also with dta.astype("str"), so we test the default string dtype as well

Meant to leave a comment on this. It's not visible in the diff but there is already a tm.assert_frame_equal call a few lines up from this. Is there any expected value calling that and then calling it with a slice?

I meant to repeat the expected = frame_or_series(dta.astype("string")) as expected = frame_or_series(dta.astype("str")) (to test both the NA and NaN variant)

jorisvandenbossche · 2024-11-16T17:27:44Z

pandas/tests/frame/methods/test_dropna.py

        dt1 = datetime.datetime(2015, 1, 1, tzinfo=dateutil.tz.tzutc())
        dt2 = datetime.datetime(2015, 2, 2, tzinfo=dateutil.tz.tzutc())
-        df["Time"] = [dt1]
+        df = DataFrame({"Time": [dt1]})


Yeah, see my comment above about this (and opened #60338 about it), but for the tests the above change is indeed fine

jorisvandenbossche · 2024-11-16T17:28:35Z

pandas/tests/frame/methods/test_dtypes.py

-            expected = Series([np.array(["bar"])])
-        else:
-            expected = Series(["bar"])
+        expected = Series(np.array(["bar"]), dtype=object)


Hmm, shouldn't we expect str dtype here if that is enabled?

I could see either way and I don't have a strong preference

Ah, I missed that this is a kind of "reducing" apply, because the applied lambda returns an 0dim array (kind of an array scalar).

When doing a normal apply preserving the column length, it already infers it as string:

In [25]: result = df.apply(lambda col: np.array(["bar"])) In [26]: result Out[26]: 0 0 bar In [27]: result.dtypes Out[27]: 0 str dtype: object

So here it is essentially reducing each column and then creating a Series with the results. Now, also in this case I would expect that we infer the dtype?
But it seems this is not specific to strings, because also when doing the same with an integer, we get object dtype:

In [31]: result = df.apply(lambda col: np.array(1)) In [32]: result Out[32]: 0 1 dtype: object

Ah, and the reason that it is object dtype is because we actually store the 0dim array object in the Series.. Continuing with the last example above:

In [33]: result.values Out[33]: array([array(1)], dtype=object)

So yes, object dtype is correct here, but it's also just a strange test .. (I would say that ideally we "unpack" those 0dim arrays into actual scalars and then do proper type inference)

Yea I'm not sure. I don't quite understand how this test is useful in practice, so hard to form an opinion

jorisvandenbossche · 2024-11-16T19:14:39Z

pandas/tests/frame/methods/test_interpolate.py

@@ -64,7 +64,6 @@ def test_interpolate_inplace(self, frame_or_series, request):
        assert np.shares_memory(orig, obj.values)
        assert orig.squeeze()[1] == 1.5

-    # TODO(infer_string) raise proper TypeError in case of string dtype


This still needs to be done?

simonjayhawkins · 2024-11-18T12:44:40Z

pandas/tests/frame/methods/test_astype.py

-        alt = obj.astype(str)
-        assert np.all(alt.iloc[1:] == result.iloc[1:])
+        else:
+            assert item is np.nan


item should never be np.nan with the original string dtype?

Nice catch - I'll take a closer look as to why that happens

there should be no need for inference here. the result and expected are both astyped. I would expect that the using_infer_string fixture is not needed at all. @jorisvandenbossche has asked that you also test with astype("str") and that would not change any inference. There is a fixture for testing the different the string dtypes.

WillAyd added this to the 2.3 milestone Nov 16, 2024

TST (string dtype): resolve xfails for frame methods

a2e8dc3

WillAyd force-pushed the fix-string-frame-methods branch from 9592c2d to a2e8dc3 Compare November 16, 2024 15:32

WillAyd commented Nov 16, 2024

View reviewed changes

jorisvandenbossche reviewed Nov 16, 2024

View reviewed changes

simonjayhawkins reviewed Nov 18, 2024

View reviewed changes

simonjayhawkins added the Strings String extension data type and string data label Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TST (string dtype): resolve xfails for frame methods #60336

TST (string dtype): resolve xfails for frame methods #60336

WillAyd commented Nov 16, 2024

WillAyd left a comment

WillAyd Nov 16, 2024

WillAyd Nov 16, 2024

jorisvandenbossche Nov 16, 2024

WillAyd Nov 16, 2024

jorisvandenbossche Nov 16, 2024

simonjayhawkins Nov 18, 2024

WillAyd Nov 18, 2024 •

edited

Loading

simonjayhawkins Nov 18, 2024

jorisvandenbossche Nov 16, 2024

WillAyd Nov 16, 2024

jorisvandenbossche Nov 16, 2024

jorisvandenbossche Nov 16, 2024

jorisvandenbossche Nov 16, 2024

WillAyd Nov 16, 2024

jorisvandenbossche Nov 16, 2024

jorisvandenbossche Nov 16, 2024

WillAyd Nov 16, 2024

jorisvandenbossche Nov 16, 2024

simonjayhawkins Nov 18, 2024

WillAyd Nov 18, 2024

simonjayhawkins Nov 18, 2024

		@@ -2362,5 +2362,6 @@ def external_values(values: ArrayLike) -> ArrayLike:
		values.flags.writeable = False

		# TODO(CoW) we should also mark our ExtensionArrays as read-only

TST (string dtype): resolve xfails for frame methods #60336

Are you sure you want to change the base?

TST (string dtype): resolve xfails for frame methods #60336

Conversation

WillAyd commented Nov 16, 2024

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd Nov 18, 2024 •

edited

Loading