This repository contains the code for the paper "Fast Algorithms for Denial Constraint Discovery".
Before building the algorithms, make sure to install the following prerequisites:
- Java JDK 1.8 or later
- Maven 3.1.0 or later
- Git
- Boost (only for enumeration with the MMCS algorithm)
As the first step, clone this repository :
$ git clone https://github.com/EduardoPena/fdcd.git
$ cd fdcd
Then, build fdcd with the following maven command:
.../fdcd$ mvn clean install
The command above will create a "fat" jar called discoverDCs.jar and place it into the target folder.
DC enumeration with the MMCS algorithm requires a C++ implementation, found in: MHS generation algorithms. If you are willing to use it, please, follow the instructions to build the executable (we use the default name, agdmhs). Then, copy the executable agdmhs into the folder containg the fdcd jar (e.g., target).
Once you have compiled the code, you can run the discovery, for example:
.../fdcd$ java -jar target/discoverDCs.jar data/tax.csv
The only required parameter is the dataset. See the data/
folder for sample .csv files.
Additionally, you can specify three optional parameters:
-n
: number of rows. For example, the following command executes the discovery with the first 10000 rows of the dataset:
.../fdcd$ java -jar target/discoverDCs.jar data/tax.csv -n 10000
-o
: output file path. In case the parameter -o is not specified, the program only shows the number of results. The following command saves the discovered DCs in the taxdcs.out file.
.../fdcd$ java -jar target/discoverDCs.jar data/tax.csv -n 10000 -o taxdcs.out
-e
: enumeration method. The enumeration method to be used with the ECP algorithm. The following algorithms are available: INCS, EI, HEI, MMCS, HMMCS, MCS (check the paper for technical details). The default is INCS. For example, the following command runs the discovery using the HEI enumeration algorithm:
.../fdcd$ java -jar target/discoverDCs.jar data/tax.csv -n 10000 -o taxdcs.out -e HEI
src/
: the Java implementation of fdcddata/
: a sample of the datasets used for experiments
This repository contains only sample datasets. The full datasets used in the paper are hosted here
We compare our algorithms with state-of-the-art algorithms found here These algorithms are integrated with Metanome, a specialized data profiling plataform. We intend to integrate our algorithm into the plataform soon.