Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST (string dtype): resolve xfails for frame methods #60336

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Nov 16, 2024

No description provided.

@WillAyd WillAyd added this to the 2.3 milestone Nov 16, 2024
Copy link
Member Author

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are pretty tricky and not sure I've approached correctly. Could use some extra input @jorisvandenbossche

@@ -2362,5 +2362,6 @@ def external_values(values: ArrayLike) -> ArrayLike:
values.flags.writeable = False

# TODO(CoW) we should also mark our ExtensionArrays as read-only
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we already had discussions on how to make ExtensionArrays readonly?

dt1 = datetime.datetime(2015, 1, 1, tzinfo=dateutil.tz.tzutc())
dt2 = datetime.datetime(2015, 2, 2, tzinfo=dateutil.tz.tzutc())
df["Time"] = [dt1]
df = DataFrame({"Time": [dt1]})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this test is that an empty DataFrame is created first, which creates an object dtype column; subsequently, the assignment of a column keeps the column dtype as object.

That seems like a more general usage issue which needs to be resolved, although for this test I didn't think it was important to use that construction pattern

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, see my comment above about this (and opened #60338 about it), but for the tests the above change is indeed fine

@@ -6273,6 +6274,10 @@ class max type
else:
to_insert = ((self.index, None),)

if len(new_obj.columns) == 0 and names:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a local fix to the problem of appending column names to an empty set, which defaults the column dtype to object. While this fix the tests, there seems to be a larger issue at play that I'm not sure how to solve

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a similar issue like the pattern of creating an empty dataframe and then adding columns that I also encountered in the tests (for now I always got the tests passing by either ensuring the expected uses object dtype or ensuring the empty dataframe starts with an empty columns Index of dtype "str").

I am not sure we should "fix" this issue, as it would also introduce an inconsistency in the expected dtype, but opened #60338 to give this a bit more visibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we should "fix" this issue

I think any code changes should perhaps be separate PRs to the general resolving xfails PRs and maybe to avoid any regressions on 2.3.x be wrapped in using_string_dtype if blocks?

Copy link
Member Author

@WillAyd WillAyd Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I'll get this removed

@simonjayhawkins just to confirm I understand, are you asking to separate out PRs that need to change tests to correct the xfails from PRs that need to change the core implementation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the PR title implies that the changes are test related then I don't normally expect to see code changes to the core implementation, so yes, I think splitting this PR is wise.

assert item is pd.NA

# For non-NA values, we should match what we get for non-EA str
alt = obj.astype(str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe repeat the above also with dta.astype("str"), so we test the default string dtype as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meant to leave a comment on this. It's not visible in the diff but there is already a tm.assert_frame_equal call a few lines up from this. Is there any expected value calling that and then calling it with a slice?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to repeat the expected = frame_or_series(dta.astype("string")) as expected = frame_or_series(dta.astype("str")) (to test both the NA and NaN variant)

dt1 = datetime.datetime(2015, 1, 1, tzinfo=dateutil.tz.tzutc())
dt2 = datetime.datetime(2015, 2, 2, tzinfo=dateutil.tz.tzutc())
df["Time"] = [dt1]
df = DataFrame({"Time": [dt1]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, see my comment above about this (and opened #60338 about it), but for the tests the above change is indeed fine

expected = Series([np.array(["bar"])])
else:
expected = Series(["bar"])
expected = Series(np.array(["bar"]), dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, shouldn't we expect str dtype here if that is enabled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see either way and I don't have a strong preference

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed that this is a kind of "reducing" apply, because the applied lambda returns an 0dim array (kind of an array scalar).

When doing a normal apply preserving the column length, it already infers it as string:

In [25]: result = df.apply(lambda col: np.array(["bar"]))

In [26]: result
Out[26]: 
     0
0  bar

In [27]: result.dtypes
Out[27]: 
0    str
dtype: object

So here it is essentially reducing each column and then creating a Series with the results. Now, also in this case I would expect that we infer the dtype?
But it seems this is not specific to strings, because also when doing the same with an integer, we get object dtype:

In [31]: result = df.apply(lambda col: np.array(1))

In [32]: result
Out[32]: 
0    1
dtype: object

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, and the reason that it is object dtype is because we actually store the 0dim array object in the Series.. Continuing with the last example above:

In [33]: result.values
Out[33]: array([array(1)], dtype=object)

So yes, object dtype is correct here, but it's also just a strange test .. (I would say that ideally we "unpack" those 0dim arrays into actual scalars and then do proper type inference)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I'm not sure. I don't quite understand how this test is useful in practice, so hard to form an opinion

@@ -64,7 +64,6 @@ def test_interpolate_inplace(self, frame_or_series, request):
assert np.shares_memory(orig, obj.values)
assert orig.squeeze()[1] == 1.5

# TODO(infer_string) raise proper TypeError in case of string dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs to be done?

alt = obj.astype(str)
assert np.all(alt.iloc[1:] == result.iloc[1:])
else:
assert item is np.nan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

item should never be np.nan with the original string dtype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch - I'll take a closer look as to why that happens

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be no need for inference here. the result and expected are both astyped. I would expect that the using_infer_string fixture is not needed at all. @jorisvandenbossche has asked that you also test with astype("str") and that would not change any inference. There is a fixture for testing the different the string dtypes.

@simonjayhawkins simonjayhawkins added the Strings String extension data type and string data label Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants