Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG (string): contruction of Series / Index fails from dict keys when "str" dtype is specified explicitly #60343

Open
jorisvandenbossche opened this issue Nov 17, 2024 · 9 comments · May be fixed by #60383
Assignees
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Strings String extension data type and string data
Milestone

Comments

@jorisvandenbossche
Copy link
Member

When not specifying a dtype (inferring the type), construction of Index or Series from dict keys goes fine:

>>> pd.options.future.infer_string = True
>>> d = {"a": 1, "b": 2}
>>> pd.Index(d.keys())
Index(['a', 'b'], dtype='str')

But if you explicitly specify the dtype, then it fails:

>>> pd.Index(d.keys(), dtype="str")
...

File ~/scipy/repos/pandas/pandas/core/arrays/string_arrow.py:206, in ArrowStringArray._from_sequence(cls, scalars, dtype, copy)
    203     return cls(pc.cast(scalars, pa.large_string()))
    205 # convert non-na-likes to str
--> 206 result = lib.ensure_string_array(scalars, copy=copy)
    207 return cls(pa.array(result, type=pa.large_string(), from_pandas=True))

File lib.pyx:727, in pandas._libs.lib.ensure_string_array()

File lib.pyx:822, in pandas._libs.lib.ensure_string_array()

ValueError: Buffer has wrong number of dimensions (expected 1, got 0)

The reason is that at that point we pass the data directly to the dtype's array _from_sequence instead of first pre-processing the data into a numpy array, and _from_sequence calling ensure_string_array directly doesn't seem to be able to handle dict keys (although we do call np.asarray(..) inside ensure_string_array, so not entirely sure what is going wrong)

@jorisvandenbossche jorisvandenbossche added Bug Strings String extension data type and string data Constructors Series/DataFrame/Index/pd.array Constructors labels Nov 17, 2024
@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Nov 17, 2024
@tasfia8
Copy link

tasfia8 commented Nov 18, 2024

Hi Joris! If I fix this, I could send you a PR. Would you be able to merge my PR then or give suggestions on my PR so it can be merged? I have a school assignment deadline of working on an open source good first issue where the owner will at the end merge my PR. I was wondering if you can assign me this and help me? I am a 4th year Computer Engineering major.

@tasfia8
Copy link

tasfia8 commented Nov 18, 2024

Also, would you be able to tell me what files I should look at for this so I can start? Do I fork the main branch?

@KevsterAmp
Copy link
Contributor

Hi @tasfia8 kindly check the contributing docs: https://pandas.pydata.org/docs/development/contributing.html. For guidance regarding github issue assignment, proper format of PRs, etc...

I recommend you to work on an issue with a label good first issue since those issues mainly work on simple fixes that are good for first time contributors

@tasfia8
Copy link

tasfia8 commented Nov 19, 2024

I have already started working on this, would you be able to assign me this? I think I can do it and I have read the contributing files thank you.

@KevsterAmp
Copy link
Contributor

KevsterAmp commented Nov 19, 2024

@tasfia8 - issue assignment can be found on the contributing docs

@tasfia8
Copy link

tasfia8 commented Nov 19, 2024

take

@tasfia8
Copy link

tasfia8 commented Nov 19, 2024

@jorisvandenbossche
I think I have figured it out, just wanted to show both of @KevsterAmp and you before I make a PR. I will issue a PR soon and let you know. I get this as output now, is this what you are expecting? I have additional test cases as well and it passes all existing test cases as well.
Output:
Screenshot 2024-11-19 at 3 17 08 AM

The issue was that dict_keys was passed directly to the StringDtype's _from_sequence method, which could not handle non-array-like inputs like dict_keys. The fix involved updating the handling of dict_keys during the construction of an Index or Series.

@jorisvandenbossche
Copy link
Member Author

@tasfia8 apologies for the slow response. The output you show is indeed the expected behaviour.
I think the easiest will be to make a PR so we can see the code and more easily give feedback (and feel free to mark the PR as "draft" if you are unsure if it is ready, but then we can already take a look)

@tasfia8
Copy link

tasfia8 commented Nov 21, 2024

Done @jorisvandenbossche. Please see #60383.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Strings String extension data type and string data
Projects
None yet
3 participants