Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Copy attrs on join (possibly depending on left, right, etc.) #60351

Open
1 of 3 tasks
rommeswi opened this issue Nov 18, 2024 · 8 comments · May be fixed by #60357
Open
1 of 3 tasks

ENH: Copy attrs on join (possibly depending on left, right, etc.) #60351

rommeswi opened this issue Nov 18, 2024 · 8 comments · May be fixed by #60357
Labels
Enhancement metadata _metadata, .attrs Needs Info Clarification about behavior needed to assess issue Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@rommeswi
Copy link

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

df.join() does not retain the attrs of the dataset. Given that the attrs manual states that "many operations that create new datasets will retain attrs", this seems like an omission.

Feature Description

Join is different from concat because there is a clear dataframe that the operation is on. Therefore, it would seem natural if df.join() would retain the attrs of the initial dataframe.

Alternative Solutions

It would also be possible to make the attrs dependent on "how" but this would only be natural for "left" and "right".

Additional Context

No response

@rommeswi rommeswi added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 18, 2024
@timhoffm
Copy link
Contributor

timhoffm commented Nov 18, 2024

This should be handled the same way as concat: Propagate only if all inputs have the same attrs.

concat is currently a hard-coded special case,

if method == "concat":

but we may want to delegate the attrs combination back to the operation instead, i.e.

if hasattr(other, "__combined_attrs__"):
      self.attrs = other.__combined_attrs__()

Note that other is the "combination object" for there calls, i.e. _MergeOperation, _Concatenator etc, which will have to grow the logic for combining attrs.

Alternatively, one could leave the combination logic in __finalize__ but provide a uniform interface on all "combination objects" to give their inputs. Currently, thats non-uniform _Concatenator.objs, but _MergeOperation.left/_MergeOperation.right.

@rhshadrach
Copy link
Member

Thanks for the report! @timhoffm - can you post a reproducible example.

@rhshadrach rhshadrach added Needs Info Clarification about behavior needed to assess issue Reshaping Concat, Merge/Join, Stack/Unstack, Explode metadata _metadata, .attrs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 18, 2024
@timhoffm timhoffm linked a pull request Nov 19, 2024 that will close this issue
timhoffm added a commit to timhoffm/pandas that referenced this issue Nov 19, 2024
This uses the same logic as `pd.concat()`: Copy `attrs` only if all
input `attrs` are identical.

I've refactored the handling in __finalize__ from special-casing based on th the method name (previously only "concat") to handling "other" parameters
that have an `input_objs` attribute. This is a more scalable architecture compared to hard-coding method names in __finalize__.

Tests added for `concat()` and `merge()`.

Closes pandas-dev#60351.
@timhoffm
Copy link
Contributor

timhoffm commented Nov 19, 2024

#60357 should fix this. I've choosen the somewhat smaller refactoring and not pushed the combination logic back into the "combination objects". In fact #59141 removed _Concatenator in favor of simple functions. Therefore, I've now choosen the common API to be "provides the inputs via input_objs parameter".

Note that join() is implemented via concat() or merge() depending on the case. I've only added explicit tests for these fundamental operations, not for join(), but could add that if desired.

timhoffm added a commit to timhoffm/pandas that referenced this issue Nov 19, 2024
This uses the same logic as `pd.concat()`: Copy `attrs` only if all
input `attrs` are identical.

I've refactored the handling in __finalize__ from special-casing based on th the method name (previously only "concat") to handling "other" parameters
that have an `input_objs` attribute. This is a more scalable architecture compared to hard-coding method names in __finalize__.

Tests added for `concat()` and `merge()`.

Closes pandas-dev#60351.
timhoffm added a commit to timhoffm/pandas that referenced this issue Nov 19, 2024
This uses the same logic as `pd.concat()`: Copy `attrs` only if all
input `attrs` are identical.

I've refactored the handling in __finalize__ from special-casing based on th the method name (previously only "concat") to handling "other" parameters
that have an `input_objs` attribute. This is a more scalable architecture compared to hard-coding method names in __finalize__.

Tests added for `concat()` and `merge()`.

Closes pandas-dev#60351.
timhoffm added a commit to timhoffm/pandas that referenced this issue Nov 19, 2024
This uses the same logic as `pd.concat()`: Copy `attrs` only if all
input `attrs` are identical.

I've refactored the handling in __finalize__ from special-casing based on th the method name (previously only "concat") to handling "other" parameters
that have an `input_objs` attribute. This is a more scalable architecture compared to hard-coding method names in __finalize__.

Tests added for `concat()` and `merge()`.

Closes pandas-dev#60351.
timhoffm added a commit to timhoffm/pandas that referenced this issue Nov 19, 2024
This uses the same logic as `pd.concat()`: Copy `attrs` only if all
input `attrs` are identical.

I've refactored the handling in __finalize__ from special-casing based on th the method name (previously only "concat") to handling "other" parameters
that have an `input_objs` attribute. This is a more scalable architecture compared to hard-coding method names in __finalize__.

Tests added for `concat()` and `merge()`.

Closes pandas-dev#60351.
@rommeswi
Copy link
Author

This should be handled the same way as concat: Propagate only if all inputs have the same attrs.

I disagree because concat is a symmetric operation operating on a list of dfs but join is an operation that joins other data to a specific dataframe. The properties of the initial dataframe should then be maintained unless pandas is instructed otherwise. It's as if you start dropping rows from your dataframe after a join just because the added data has some NaNs.

@timhoffm
Copy link
Contributor

I recommend to be defensive on all operations that have multiple DataFrames as input. The inputs may have contradicting Information in their attrs and joining them can lead to misinterpretation. Say, df1.attrs['date'] = "1.1.2024"; df2.attrs['date'] = "1.1.2025" then, having one of the dates put on the join is not fair.

Instead, I propose to implement a set_attrs() helper as proposed in #52166 (comment) to enable easy explicit handling of attrs by the user.

@rommeswi
Copy link
Author

Something like set_attrs() saves half a line of boilerplate code. But the main cost is cognitive: users constantly need to chase after the attrs to make sure they are copied. If the manual says that most operations copy them, I would expect any operation to copy those attrs that look like a reasonable default. Otherwise I expect the manual to say that attrs are usually dropped unless pandas is absolutely sure that the user wants to maintain the attrs.

I would even argue that on a concat the most sensible thing to do would be to merge the attrs by field and throw a warning in case of conflicting fields.

@timhoffm
Copy link
Contributor

timhoffm commented Nov 20, 2024

We can possibly improve the docs.

Concerning the behavior, there are two aspects stemming from the experimental state of the feature

  • The intended behavior is not fully defined.
  • The implementation is incomplete.

This has lead to the suggestion to deprecate attrs (#52166). I've argued against the deprecation because I believe attrs are valuable, but the topic is still an experimental and incomplete feature. That's also the reason for the vague "most operations copy" wording.

My personal take (as a non-pandas developer but advocate for attrs) on the intended behavior is to be very defensive and minimal:

  • Operations DataFrame -> DataFrame should copy the attrs
  • Operations [DataFrame, DataFrame, ...] -> DataFrame should copy the attrs only if all inputs are identical. - In the face of ambiguity, refuse the temptation to guess.
  • Copies are always deep to prevent accidental modification of other DataFrames

@rhshadrach
Copy link
Member

I'm supportive of @timhoffm's proposal. Users can also override this behavior via subclassing.

import functools as ft

import pandas as pd

class Foo(pd.DataFrame):
    @property
    def _constructor(self):
        return Foo

    @ft.wraps(DataFrame.join)
    def join(self, other, *args, **kwargs):
        self_attrs = self.attrs
        other_attrs = other.attrs
        result = super().join(other, *args, **kwargs)
        result.attrs = self_attrs | other_attrs
        return result

df1 = Foo({"a": [1, 2, 3], "b": [3, 4, 5]}).set_index("a")
df1.attrs = {"key1": "value1"}
df2 = Foo({"a": [2, 3, 4], "c": [6, 7, 8]}).set_index("a")
df2.attrs = {"key2": "value2"}
result = df1.join(df2, how="inner")
print(result)
#    b  c
# a      
# 2  4  6
# 3  5  7
print(result.attrs)
# {'key1': 'value1', 'key2': 'value2'}

Perhaps that's not helpful in all cases, but hopefully is in many.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement metadata _metadata, .attrs Needs Info Clarification about behavior needed to assess issue Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants