-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[40706] Explanation of parity5_plus_5 and potential issues #54
Comments
I think the "verified" status only means that the file was processed correctly by the server. I believe it is "active" from the old website. Assuming you are referring to dataset 40706, I uploaded that dataset from the PMLB. Based on their documentation, they don't have explicit train/test splits. The Parity5+5 dataset they have also has the same issues and has no description. Therefor I would assume it was an error on their end. It would probably be good to open an issue on their repository, hopefully they can address the issue (or given an explanation) and also double-check their other datasets. @joaquinvanschoren Why was "active" changed to "verified"? I think "verified" might give the impression there is some kind of quality control here. |
I kept getting questions from people what 'active' means, and I always had to explain that it meant it was verified by some automated tests. If you have a better word for it, I'm happy to change it. |
If the only statuses are "in processing", "deactivated", or "active", why visibly show "active" (or "verified") on the website at all? |
It's also a filter option, so I guess we should have an intuitive name for all non-in-preparation, non-deactivated datasets. |
hm maybe "valid"? Though I think active is not so bad. @PGijsbers do you want to follow up with them or should I? |
Actually, I'm not sure what the dataset is. It clearly counts binary numbers, but I'm not sure if the left-most is 2^0 or if the right-most is 2^0. So there's at least two ways to decode it to an integer. But I don't see how to get from that integer to the class. |
Based on the name would assume it's two 5-bit integers which then get added? But even then I wouldn't know how to construct the class. If you have the time to follow up with PMLB, I would appreciate that a lot :) |
I have a solution to the dataset, but no explanation: data['class'] == data[['Bit_2', 'Bit_3', 'Bit_4', 'Bit_6', 'Bit_8']].sum(axis=1) % 2 I opened EpistasisLab/pmlb#179 |
I think the "plus" refers to the fact that there are just 5 bits that are noise and are being ignored. The dataset is solvable (as opposed to parity5, which doesn't seem solvable without extra knowledge) because once the model figured out which columns to ignore, there are duplicates. So it's a dataset checking feature selection. |
The original paper apparently doesn't mention the duplicate rows: EpistasisLab/pmlb#179 (comment) |
Thanks for getting in touch and letting us know! I guess we can keep this version active with some kind of notice, and make a newer version of this dataset with duplicate rows removed? |
That was my plan, though it's not gonna be that useful until dataset versions are visible again: openml/openml.org#95 |
Hey! So I'm trying to understand parity5_plus_5 and I'm a bit confused.
It has 1124 rows, but the data counts up from 0 to 1023 in binary, so there are 100 duplicate rows. Is that on purpose?
I know OpenML contains specific train/test splits and these might account for the duplication but a lot of people use the datasets without the splits, like me and @SamuelGabriel and @noahho.
The dataset is marked as "validated" and uploaded by @PGijsbers, so he might know more.
It would also be great to have a dataset description.
Thanks!
ps: is there a way to change dataset version in the OpenML website right now? I'm not sure I'm looking at the most recent version cc @joaquinvanschoren
The text was updated successfully, but these errors were encountered: