Scripts to mine GitHub for GraphQL schemas, defined in the Schema Definition Language. Details on the mining and preprocessing as well as on an analysis of thousands of mined schemas can be found in our 2019 ICSOC paper.
In order to use these scripts, perform the following steps:
- Clone this repository and install dependencies using
npm i
. - Create a file named
.env
in the main folder containing the following:GH_USERNAME=<your GitHub username> GH_PASSWORD=<your GitHub password>
The first set of scripts in this repository searches GitHub for GraphQL schema definitions, and downloads metadata as well as the content of such files.
npm run load-metadata <min-size> <interval-size>
: Searches GitHub for GraphQL schema definition files. The argument<min-size>
is the minimum number of bytes a schema file should have. The argument<interval-size>
is the size of the interval in bytes for the crawler to look in. The crawler automatically updates the interval size subsequently to search for all schema definition files that have at least the given minimum size. A reasonable command to start off with isnpm run load-metadata 0 5
. The results will be written into JSON files created in aschemas
folder, which will contain files for different intervals. The data, so far, does not contain the content of the schema definition files yet, only references to the files on GitHub. This step can take multiple hours, as the crawler is designed to adhere to GitHub's API rate limits.npm run load-content <begin-at>
: Load the actual content of the schema files. The argument<begin-at>
can be used to skip a number of files in that folder from processing. This script will add acontent
property to the data, containing the content of the schema definition file.
The second set of scripts in this repository preprocesses previously collected data. The scripts should be run in the outlined order. Note that these scripts may log errors, as, for example, parsing a schema fails - this is expected, and the scripts will still continue.
npm run preprocess
: Checks all files in theschemas
folder for having the same GitHub URL and adds agithubDuplicate
property. Duplication can be the result of the GitHub search being inconsistent. Also adds atype
property that denotes if the content in a found file is a schema (schema
), an operation (query
), a mixture of both (mixed
), or cannot be determined (null
). Finally, also adds aaddContainsQueryType
property that denotes if, given the content is aschema
, aQuery
type is defined in it.npm run refactor
: Refactors the data in theschemas
folder by grouping schemas based on their (shared) repository. The output is written into arefactored
folder.npm run preprocess-refactored
: Attempts to merge partial schemas that belong to the same repository as described in our ICSOC paper. If merging succeeds, a newmerged
entry will be added to the corresponding schema entry, which again contains the content of the merged schema in amergedContent
field. Also, per schema, this script adds avalidSchema
property denoting if the schema is deemed valid as defined in the GraphQL specification. Finally, adds acontentDuplicate
property that denotes if the (merged) schema is contained multiple times across all collected data.
Licensed under the Apache License, Version 2.0.