Questions about parity5_plus_5 #179

amueller · 2023-10-10T18:31:11Z

Would it be possible to get a description of the parity5_plus_5 dataset? There's several things that are confusing about it for me.
First, there are some duplicate rows, which seems odd. The rows count from 0 to 1023 in binary, and there are 1124 rows in the dataset, meaning there are 100 duplicate rows.

Also, I'm not sure I understand the name of the dataset. The equation for the class label seems to be

data['class'] == data[['Bit_2', 'Bit_3', 'Bit_4', 'Bit_6', 'Bit_8']].sum(axis=1) % 2

but I'm not sure what the intuition behind this is or how it relates to the name. I assume there's some simple binary formula behind this, but I don't immediately see it.
Or is it just referring to the fact that the other five bits don't influence the outcome?

The text was updated successfully, but these errors were encountered:

lacava · 2023-10-23T13:18:39Z

@ryanurbs do you happen to know the equation for this dataset?

amueller · 2023-10-23T18:36:28Z

I think the explanation is actually just that there's a subset of 5 bits whose parity is computed and the other bits are ignored. but I'm still confused by the duplication of some rows.

ryanurbs · 2023-10-23T19:20:24Z

@lacava @amueller I'm looking into getting a definitive answer to your question. We received this dataset from a colleague.

ryanurbs · 2023-10-23T20:18:41Z

@lacava @amueller I found a published description of the parity5+5 problem here: https://sci2s.ugr.es/keel/pdf/algorithm/congreso/liu-3.pdf

You are indeed correct that only 5 of the features are relevant (Bits 2,3,4,6,8) and the other 5 are randomly generated. The underlying predictive pattern in this dataset is that if there are an even number of zeros across those features, then the outcome is 1, otherwise 0. I'm not sure why there are extra redundant rows in this dataset, as there should be 1024 unique rows as described in the above paper as well. I'm not certain of the exact origins of this particular dataset so it might not be possible to track down where the extra rows came from, but you might just remove the redundant rows depending on what experiment you are looking to run. The name parity5+5 comes from the fact that this dataset is basically the original parity5 problem with 5 irrelevant features added to it.

amueller · 2023-10-23T21:07:27Z

@ryanurbs thank you for the explanation. Interesting to know that the published version only has 1024 rows, so this might have been some processing mix-up along the way. Feel free to close. I was asking for openml.org where we might decide to drop the duplicate rows in a new version of the dataset.

amueller mentioned this issue Oct 10, 2023

[40706] Explanation of parity5_plus_5 and potential issues openml/openml-data#54

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about parity5_plus_5 #179

Questions about parity5_plus_5 #179

amueller commented Oct 10, 2023 •

edited

Loading

lacava commented Oct 23, 2023

amueller commented Oct 23, 2023

ryanurbs commented Oct 23, 2023

ryanurbs commented Oct 23, 2023

amueller commented Oct 23, 2023 •

edited

Loading

Questions about parity5_plus_5 #179

Questions about parity5_plus_5 #179

Comments

amueller commented Oct 10, 2023 • edited Loading

lacava commented Oct 23, 2023

amueller commented Oct 23, 2023

ryanurbs commented Oct 23, 2023

ryanurbs commented Oct 23, 2023

amueller commented Oct 23, 2023 • edited Loading

amueller commented Oct 10, 2023 •

edited

Loading

amueller commented Oct 23, 2023 •

edited

Loading