Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[40706] Explanation of parity5_plus_5 and potential issues #54

Open
amueller opened this issue Sep 29, 2023 · 12 comments
Open

[40706] Explanation of parity5_plus_5 and potential issues #54

amueller opened this issue Sep 29, 2023 · 12 comments
Labels
Bad Data Dataset contains bad-data that is not marked, such as duplicated rows. Documentation Something is wrong or missing in a dataset description

Comments

@amueller
Copy link

Hey! So I'm trying to understand parity5_plus_5 and I'm a bit confused.
It has 1124 rows, but the data counts up from 0 to 1023 in binary, so there are 100 duplicate rows. Is that on purpose?
I know OpenML contains specific train/test splits and these might account for the duplication but a lot of people use the datasets without the splits, like me and @SamuelGabriel and @noahho.

The dataset is marked as "validated" and uploaded by @PGijsbers, so he might know more.
It would also be great to have a dataset description.

Thanks!

ps: is there a way to change dataset version in the OpenML website right now? I'm not sure I'm looking at the most recent version cc @joaquinvanschoren

@PGijsbers
Copy link

I think the "verified" status only means that the file was processed correctly by the server. I believe it is "active" from the old website. Assuming you are referring to dataset 40706, I uploaded that dataset from the PMLB. Based on their documentation, they don't have explicit train/test splits. The Parity5+5 dataset they have also has the same issues and has no description. Therefor I would assume it was an error on their end. It would probably be good to open an issue on their repository, hopefully they can address the issue (or given an explanation) and also double-check their other datasets.

@joaquinvanschoren Why was "active" changed to "verified"? I think "verified" might give the impression there is some kind of quality control here.

@PGijsbers PGijsbers added the Bad Data Dataset contains bad-data that is not marked, such as duplicated rows. label Oct 2, 2023
@PGijsbers PGijsbers changed the title Explanation of parity5_plus_5 and potential issues [40706 ] Explanation of parity5_plus_5 and potential issues Oct 2, 2023
@PGijsbers PGijsbers changed the title [40706 ] Explanation of parity5_plus_5 and potential issues [40706] Explanation of parity5_plus_5 and potential issues Oct 2, 2023
@PGijsbers PGijsbers added the Documentation Something is wrong or missing in a dataset description label Oct 2, 2023
@joaquinvanschoren
Copy link

I kept getting questions from people what 'active' means, and I always had to explain that it meant it was verified by some automated tests. If you have a better word for it, I'm happy to change it.

@PGijsbers
Copy link

If the only statuses are "in processing", "deactivated", or "active", why visibly show "active" (or "verified") on the website at all?
When the dataset is "in processing" or "deactivated", the user should be informed, but for the expected status ("active") I don't think we need to show additional confirmation to the user.

@joaquinvanschoren
Copy link

It's also a filter option, so I guess we should have an intuitive name for all non-in-preparation, non-deactivated datasets.
For 'de-activated' I think 'deprecated' is a better word.

@amueller
Copy link
Author

amueller commented Oct 2, 2023

hm maybe "valid"? Though I think active is not so bad. @PGijsbers do you want to follow up with them or should I?

@amueller
Copy link
Author

amueller commented Oct 2, 2023

Actually, I'm not sure what the dataset is. It clearly counts binary numbers, but I'm not sure if the left-most is 2^0 or if the right-most is 2^0. So there's at least two ways to decode it to an integer. But I don't see how to get from that integer to the class.

@PGijsbers
Copy link

PGijsbers commented Oct 3, 2023

Based on the name would assume it's two 5-bit integers which then get added? But even then I wouldn't know how to construct the class. If you have the time to follow up with PMLB, I would appreciate that a lot :)

@amueller
Copy link
Author

amueller commented Oct 10, 2023

I have a solution to the dataset, but no explanation:

data['class'] == data[['Bit_2', 'Bit_3', 'Bit_4', 'Bit_6', 'Bit_8']].sum(axis=1) % 2

I opened EpistasisLab/pmlb#179

@amueller
Copy link
Author

amueller commented Oct 10, 2023

I think the "plus" refers to the fact that there are just 5 bits that are noise and are being ignored. The dataset is solvable (as opposed to parity5, which doesn't seem solvable without extra knowledge) because once the model figured out which columns to ignore, there are duplicates. So it's a dataset checking feature selection.

@amueller
Copy link
Author

The original paper apparently doesn't mention the duplicate rows: EpistasisLab/pmlb#179 (comment)

@PGijsbers
Copy link

Thanks for getting in touch and letting us know! I guess we can keep this version active with some kind of notice, and make a newer version of this dataset with duplicate rows removed?

@amueller
Copy link
Author

That was my plan, though it's not gonna be that useful until dataset versions are visible again: openml/openml.org#95

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bad Data Dataset contains bad-data that is not marked, such as duplicated rows. Documentation Something is wrong or missing in a dataset description
Projects
None yet
Development

No branches or pull requests

3 participants