-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String dtype: use ObjectEngine for indexing for now correctness over performance #60329
base: main
Are you sure you want to change the base?
String dtype: use ObjectEngine for indexing for now correctness over performance #60329
Conversation
FWIW I noticed the xfails in test_pivot.py are going to require this, as there are tests that working with missing values as column labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for adding a new engine versus changing the existing StringEngine?
pandas/_libs/index.pyi
Outdated
@@ -54,6 +54,7 @@ class UInt16Engine(IndexEngine): ... | |||
class UInt8Engine(IndexEngine): ... | |||
class ObjectEngine(IndexEngine): ... | |||
class StringEngine(IndexEngine): ... | |||
class StringObjectEngine(ObjectEngine): ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm would it be better to call this StrEngine
? Or where does the term StringObject
come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was meant to be read as "string-objectengine", i.e. essentially just the object engine, but we know that we only use it for strings (and so the _check_type
can be specialized).
But I don't mind the name exactly (although StrEngine
might also be confusing, because we currently use this for both str and string dtypes)
I was initially thinking to modify the StringEngine to be like the masked engine to properly handle missing values, but that turned out to be a bit more complicated and so to have something that works correctly I thought to (for now) just fall back to the ObjectEngine (as we were using before for the string dtype as well). You can see in the first commit that's what I did, but then I realized that the ObjectEngine itself it not yet enough if we want to have compatibility to allow looking up missing values with None vs np.nan (the object engine is strict about that, but for back compat I would prefer that the |
with pytest.raises(KeyError): | ||
index.get_loc(nulls_fixture) | ||
|
||
def test_get_loc_missing(self, any_string_dtype, nulls_fixture): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this test now means that you can use np.nan
and pd.NA
interchangeably when indexing? If that's correct, I'm not sure I agree that we should be going that far
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that we are coercing any missing value indicator to NaN upon construction, and so to preserve back compat, I think I prefer we do the same for input to indexing operations.
To express it in terms of get_loc, this works now:
>>> pd.options.future.infer_string = False
>>> pd.Index(["a", "b", None]).get_loc(None)
2
but the same on main with enabling the string dtype:
>>> pd.options.future.infer_string = True
>>> pd.Index(["a", "b", None]).get_loc(None)
...
KeyError: None
That is because now the None is no longer in the object dtype index, but has been coerced to NaN.
(on main, trying the above with np.nan
also fails (see the issue #59879), but that's because the StringEngine
simply wasn't set up to work with missing values, so that is the initial reason I replaced it now with the StringObjectEngine)
The above is with None
, but essentially happens with any other missing value indicator, like pd.NA. Maybe None
and np.nan
are the most important ones though, but I would at least prefer that indexing with None
keeps working for now (we can always start deprecating it, but I wouldn't do that it as a breaking change for 3.0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this is also already quite inconsistent depending on the data type .. See #59765 for an overview (e.g. also for datetimelike and categorical, we treat all NA-likes as the same in indexing lookups)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this is also already quite inconsistent depending on the data type .. See #59765 for an overview (e.g. also for datetimelike and categorical, we treat all NA-likes as the same in indexing lookups)
Nice - that's a great issue. Thanks for opening it.
To express it in terms of get_loc, this works now:
Hmm I'm a bit confused by how this relates to all of the missing indicators becoming essentially equal though. On main, this does not work (?):
>>> pd.options.future.infer_string = False
>>> pd.Index(["a", "b", None]).get_loc(np.nan)
KeyError: nan
Definitely understand that there is not an ideal solution here given the inconsistent history, but I don't want to go too far and just start making all of the missing value indicators interchangeable. I think containment logic should land a little closer to equality logic, and in the latter we obviously don't allow this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On main, this does not work (?):
Yes, that's the first bug that this PR is solving: right now no missing value lookup works, not even NaN itself (which is what is stored in the array). This is because the StringEngine
simply doesn't handle missing values correctly (when building the hash table, it actually converts it to a sentinel string, but then for any of the lookup methods it doesn't take that into account; it's a bit an incomplete implementation)
So by using the ObjectEngine (subclass), that fixes that first issue: ensuring NaN can be found
I think containment logic should land a little closer to equality logic, and in the latter we obviously don't allow this
Missing values don't compare equal (well, None
does, but we specifically didn't choose that long term as the sentinel moving forward; np.nan
and pd.NA
don't compare equal), so containment is already a bit of a special case anyway compared to equality, when it comes to missing values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point on the equality. I guess I'm still hung up on the indexing behavior being the same though.
I've lost track of the nuance a bit, but haven't np.nan and pd.NA always had different indexing behavior? I'm just wary of glossing over that as part of this.
Maybe worth some input from @pandas-dev/pandas-core if anyone else has thoughts
A new
StringEngine
for indexing was added in #56997, showing some performance improvements compared to the ObjectEngine.However, there are some issues with handling of missing values, see for example #59879
The change in this PR switches back to object based engine, to for now have correct/desired behaviour, and we can see later if we can optimize this (but short term for 2.3/3.0 I would prioritize correct behaviour)
xref #54792