This project represents a dataset of vulnerabilities in open source projects, as published in Mining Software Repositories 2018 (MSR) conference.
This README file presents how researchers can use this repository for:
- importing the already existing dataset of vulnerabilities or,
- using the provided source code to build the dataset from scratch.
The dataset directory contains the SQL dump of the VulinOSS database. It's a self-contained file that includes the db schema and thus, it can be directly restored in one step. Additionally, there is a notebook included in the repository that demonstrates how the dataset can be harnessed in order to extract interesting information.
The src directory contains the python scripts and the necessary data .csv for generating the VulinOSS dataset.
The prerequisites for running the analysis are the following:
- Python 3
- (For Windows users) a Unix-like command-line interface like Cygwin or Git Bash is required.
- Perl
- Count Lines of Code (cloc), a tool that counts blank lines, comment lines, and physical lines of source code in many programming languages. The perl executable should be stored under the following path
lib/cloc.pl
(create the lib directory if it doesn't exist)
Moreover, the following python modules are also required:
- pymysql
- colorama
- codecs
- jupyter (if you want to run the provided notebook)
To generate the VulinOSS dataset the following steps are required:
-
Generate the VulinOSS db schema with the schema_generator.sql that is located in the src/data directory.
-
Clone locally the projects repositories. The repo_downloader.sh located in the src/vulinoss directory, automates this process by giving the highest_cve_rated_oss.csv as an input. Note that, if you execute this step manually, the local repo directory should have as a name a substring of the repository's URL (with the / symbols replace by _). For example, the https://github.com/owncloud/core.git should be stored as owncloud_core.git
-
Execute the python script responsible for parsing the NVD json files and storing the matches to the database requires the following arguments. Note that the db credentials must be changed in the nvd_json_parser.py script.
usage: nvd_json_parser.py [-h] [-m PROJECT_NAME_MAPPING] [-w] [-cb CONNECT_TO_CODE_BASE] cve_feed_directory oss_list positional arguments: cve_feed_directory The directory which contains the JSON feed files oss_list The csv with the most vulnerable open source systems optional arguments: -h, --help show this help message and exit -m --project_name_mapping PROJECT_NAME_MAPPING The csv file that matches alternative project names to their main names -w, --write_to_db Writes to the database -cb --connect_to_code_base CONNECT_TO_CODE_BASE Scans the local repositories for connecting NVD versions to repository snapshots
-
Finally, if -cb was used in the previous step you can retrieve code metrics for every project release by executing the following python script:
usage: code_metrics_retriever.py [-h] [-w WRITE_TO_FILE] oss_list repository_root_directory positional arguments: oss_list The csv with the list of the projects to be retrieved from the database and analyzed repository_root_directory The root directory that contains the downloaded repositories optional arguments: -h, --help show this help message and exit -w, --write_to_file WRITE_TO_FILE The output csv file
Note that this step creates sql insert statements and does not store the information directly to the database.
This work is licensed under a Creative Commons Attribution 4.0 International License.