[40706] Explanation of parity5_plus_5 and potential issues #54

amueller · 2023-09-29T17:15:24Z

Hey! So I'm trying to understand parity5_plus_5 and I'm a bit confused.
It has 1124 rows, but the data counts up from 0 to 1023 in binary, so there are 100 duplicate rows. Is that on purpose?
I know OpenML contains specific train/test splits and these might account for the duplication but a lot of people use the datasets without the splits, like me and @SamuelGabriel and @noahho.

The dataset is marked as "validated" and uploaded by @PGijsbers, so he might know more.
It would also be great to have a dataset description.

Thanks!

ps: is there a way to change dataset version in the OpenML website right now? I'm not sure I'm looking at the most recent version cc @joaquinvanschoren

PGijsbers · 2023-10-02T09:02:33Z

I think the "verified" status only means that the file was processed correctly by the server. I believe it is "active" from the old website. Assuming you are referring to dataset 40706, I uploaded that dataset from the PMLB. Based on their documentation, they don't have explicit train/test splits. The Parity5+5 dataset they have also has the same issues and has no description. Therefor I would assume it was an error on their end. It would probably be good to open an issue on their repository, hopefully they can address the issue (or given an explanation) and also double-check their other datasets.

@joaquinvanschoren Why was "active" changed to "verified"? I think "verified" might give the impression there is some kind of quality control here.

joaquinvanschoren · 2023-10-02T14:20:21Z

I kept getting questions from people what 'active' means, and I always had to explain that it meant it was verified by some automated tests. If you have a better word for it, I'm happy to change it.

PGijsbers · 2023-10-02T14:54:37Z

If the only statuses are "in processing", "deactivated", or "active", why visibly show "active" (or "verified") on the website at all?
When the dataset is "in processing" or "deactivated", the user should be informed, but for the expected status ("active") I don't think we need to show additional confirmation to the user.

joaquinvanschoren · 2023-10-02T15:45:47Z

It's also a filter option, so I guess we should have an intuitive name for all non-in-preparation, non-deactivated datasets.
For 'de-activated' I think 'deprecated' is a better word.

amueller · 2023-10-02T18:10:06Z

hm maybe "valid"? Though I think active is not so bad. @PGijsbers do you want to follow up with them or should I?

amueller · 2023-10-02T18:25:06Z

Actually, I'm not sure what the dataset is. It clearly counts binary numbers, but I'm not sure if the left-most is 2^0 or if the right-most is 2^0. So there's at least two ways to decode it to an integer. But I don't see how to get from that integer to the class.

PGijsbers · 2023-10-03T13:29:39Z

Based on the name would assume it's two 5-bit integers which then get added? But even then I wouldn't know how to construct the class. If you have the time to follow up with PMLB, I would appreciate that a lot :)

amueller · 2023-10-10T18:27:15Z

I have a solution to the dataset, but no explanation:

data['class'] == data[['Bit_2', 'Bit_3', 'Bit_4', 'Bit_6', 'Bit_8']].sum(axis=1) % 2

I opened EpistasisLab/pmlb#179

amueller · 2023-10-10T18:40:05Z

I think the "plus" refers to the fact that there are just 5 bits that are noise and are being ignored. The dataset is solvable (as opposed to parity5, which doesn't seem solvable without extra knowledge) because once the model figured out which columns to ignore, there are duplicates. So it's a dataset checking feature selection.

amueller · 2023-10-23T21:08:23Z

The original paper apparently doesn't mention the duplicate rows: EpistasisLab/pmlb#179 (comment)

PGijsbers · 2023-10-25T07:17:50Z

Thanks for getting in touch and letting us know! I guess we can keep this version active with some kind of notice, and make a newer version of this dataset with duplicate rows removed?

amueller · 2023-10-25T15:50:44Z

That was my plan, though it's not gonna be that useful until dataset versions are visible again: openml/openml.org#95

PGijsbers added the Bad Data Dataset contains bad-data that is not marked, such as duplicated rows. label Oct 2, 2023

PGijsbers changed the title ~~Explanation of parity5_plus_5 and potential issues~~ [40706 ] Explanation of parity5_plus_5 and potential issues Oct 2, 2023

PGijsbers changed the title ~~[40706 ] Explanation of parity5_plus_5 and potential issues~~ [40706] Explanation of parity5_plus_5 and potential issues Oct 2, 2023

PGijsbers added the Documentation Something is wrong or missing in a dataset description label Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[40706] Explanation of parity5_plus_5 and potential issues #54

[40706] Explanation of parity5_plus_5 and potential issues #54

amueller commented Sep 29, 2023

PGijsbers commented Oct 2, 2023

joaquinvanschoren commented Oct 2, 2023

PGijsbers commented Oct 2, 2023

joaquinvanschoren commented Oct 2, 2023

amueller commented Oct 2, 2023

amueller commented Oct 2, 2023

PGijsbers commented Oct 3, 2023 •

edited

Loading

amueller commented Oct 10, 2023 •

edited

Loading

amueller commented Oct 10, 2023 •

edited

Loading

amueller commented Oct 23, 2023

PGijsbers commented Oct 25, 2023

amueller commented Oct 25, 2023

[40706] Explanation of parity5_plus_5 and potential issues #54

[40706] Explanation of parity5_plus_5 and potential issues #54

Comments

amueller commented Sep 29, 2023

PGijsbers commented Oct 2, 2023

joaquinvanschoren commented Oct 2, 2023

PGijsbers commented Oct 2, 2023

joaquinvanschoren commented Oct 2, 2023

amueller commented Oct 2, 2023

amueller commented Oct 2, 2023

PGijsbers commented Oct 3, 2023 • edited Loading

amueller commented Oct 10, 2023 • edited Loading

amueller commented Oct 10, 2023 • edited Loading

amueller commented Oct 23, 2023

PGijsbers commented Oct 25, 2023

amueller commented Oct 25, 2023

PGijsbers commented Oct 3, 2023 •

edited

Loading

amueller commented Oct 10, 2023 •

edited

Loading

amueller commented Oct 10, 2023 •

edited

Loading