VulinOSS - Vulnerabilities in open-source systems

This project represents a dataset of vulnerabilities in open source projects, as published in Mining Software Repositories 2018 (MSR) conference.

This README file presents how researchers can use this repository for:

importing the already existing dataset of vulnerabilities or,
using the provided source code to build the dataset from scratch.

Import the VulinOSS dataset

The dataset directory contains the SQL dump of the VulinOSS database. It's a self-contained file that includes the db schema and thus, it can be directly restored in one step. Additionally, there is a notebook included in the repository that demonstrates how the dataset can be harnessed in order to extract interesting information.

Build the dataset from scratch

The src directory contains the python scripts and the necessary data .csv for generating the VulinOSS dataset.

The prerequisites for running the analysis are the following:

Python 3
(For Windows users) a Unix-like command-line interface like Cygwin or Git Bash is required.
Perl
Count Lines of Code (cloc), a tool that counts blank lines, comment lines, and physical lines of source code in many programming languages. The perl executable should be stored under the following path lib/cloc.pl (create the lib directory if it doesn't exist)

Moreover, the following python modules are also required:

pymysql
colorama
codecs
jupyter (if you want to run the provided notebook)

Generate the dataset and populate the database

To generate the VulinOSS dataset the following steps are required:

Generate the VulinOSS db schema with the schema_generator.sql that is located in the src/data directory.
Clone locally the projects repositories. The repo_downloader.sh located in the src/vulinoss directory, automates this process by giving the highest_cve_rated_oss.csv as an input. Note that, if you execute this step manually, the local repo directory should have as a name a substring of the repository's URL (with the / symbols replace by _). For example, the https://github.com/owncloud/core.git should be stored as owncloud_core.git

Execute the python script responsible for parsing the NVD json files and storing the matches to the database requires the following arguments. Note that the db credentials must be changed in the nvd_json_parser.py script.

  usage: nvd_json_parser.py [-h] [-m PROJECT_NAME_MAPPING] [-w]
                            [-cb CONNECT_TO_CODE_BASE]
                            cve_feed_directory oss_list

  positional arguments:
    cve_feed_directory    The directory which contains the JSON feed files
    oss_list              The csv with the most vulnerable open source systems

  optional arguments:
    -h, --help            show this help message and exit
    -m  --project_name_mapping PROJECT_NAME_MAPPING
                          The csv file that matches alternative project names to
                          their main names
    -w, --write_to_db     Writes to the database
    -cb  --connect_to_code_base CONNECT_TO_CODE_BASE
                          Scans the local repositories for connecting NVD
                          versions to repository snapshots

Finally, if -cb was used in the previous step you can retrieve code metrics for every project release by executing the following python script:

  usage: code_metrics_retriever.py [-h] [-w WRITE_TO_FILE]
                           oss_list repository_root_directory

  positional arguments:
    oss_list              The csv with the list of the projects to be retrieved
                          from the database and analyzed
    repository_root_directory
                          The root directory that contains the downloaded
                          repositories

  optional arguments:
    -h, --help            show this help message and exit
    -w, --write_to_file WRITE_TO_FILE
                          The output csv file

Note that this step creates sql insert statements and does not store the information directly to the database.

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LICENSE.md		LICENSE.md
data		data
dataset		dataset
notebooks		notebooks
poster		poster
vulinoss		vulinoss
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VulinOSS - Vulnerabilities in open-source systems

Import the VulinOSS dataset

Build the dataset from scratch

Generate the dataset and populate the database

License

About

Releases

Packages

Languages

AUEB-BALab/VulinOSS

Folders and files

Latest commit

History

Repository files navigation

VulinOSS - Vulnerabilities in open-source systems

Import the VulinOSS dataset

Build the dataset from scratch

Generate the dataset and populate the database

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages