-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about parity5_plus_5 #179
Comments
@ryanurbs do you happen to know the equation for this dataset? |
I think the explanation is actually just that there's a subset of 5 bits whose parity is computed and the other bits are ignored. but I'm still confused by the duplication of some rows. |
@lacava @amueller I found a published description of the parity5+5 problem here: https://sci2s.ugr.es/keel/pdf/algorithm/congreso/liu-3.pdf You are indeed correct that only 5 of the features are relevant (Bits 2,3,4,6,8) and the other 5 are randomly generated. The underlying predictive pattern in this dataset is that if there are an even number of zeros across those features, then the outcome is 1, otherwise 0. I'm not sure why there are extra redundant rows in this dataset, as there should be 1024 unique rows as described in the above paper as well. I'm not certain of the exact origins of this particular dataset so it might not be possible to track down where the extra rows came from, but you might just remove the redundant rows depending on what experiment you are looking to run. The name parity5+5 comes from the fact that this dataset is basically the original parity5 problem with 5 irrelevant features added to it. |
@ryanurbs thank you for the explanation. Interesting to know that the published version only has 1024 rows, so this might have been some processing mix-up along the way. Feel free to close. I was asking for |
Would it be possible to get a description of the
parity5_plus_5
dataset? There's several things that are confusing about it for me.First, there are some duplicate rows, which seems odd. The rows count from 0 to 1023 in binary, and there are 1124 rows in the dataset, meaning there are 100 duplicate rows.
Also, I'm not sure I understand the name of the dataset. The equation for the class label seems to be
but I'm not sure what the intuition behind this is or how it relates to the name. I assume there's some simple binary formula behind this, but I don't immediately see it.
Or is it just referring to the fact that the other five bits don't influence the outcome?
The text was updated successfully, but these errors were encountered: