This is the source code of the backend map generation and the webserver for the Paperscape map. The source code of the browser-based map client, as well as Paperscape data, are also available on Github.
For more details and progress on Paperscape please visit the development blog.
This file is organised as follows:
- Map generation using N-body simulation
- Tile and label generation for map
- Webserver for the map client
- Data formats
- About the Paperscape map
- Copyright
For a more detailed explanation of the N-body simulation see the N-body README.
The n-body map generation source code is located in the nbody/
directory.
It is written in C.
The map generator can be run with a gui, which is useful for tuning the map, or without one (headless), which is useful for incremental updates on a server.
The corresponding programs that can be built are nbody-gui and nbody-headless, respectively.
Dependencies: the MySql C library is required by both nbody-gui and nbody-headless, while nbody-gui also depends on Cairo 2D graphics and Gtk+ 3.
Before building the nbody programs the utility library xiwilib must first be built by running make
in the nbody/util/
directory.
To build the nbody program of your choice, run make <nbody-program>
in the nbody/
directory.
That is, choose from
make nbody-gui
make nbody-headless
or simply run make
to build them all.
Run any of the nbody programs with the --help
command-line flag to see a list of command-line options eg
./nbody-gui --help
Both nbody-gui and nbody-headless can read in Json files to set initial configuration settings, category colours and an existing map layout.
Default configuration files are located in the config/
directory, and also contain comments to explain some of the available features.
Running the nbody programs with no command-line options loads the default configuration settings and category colours for the arXiv map
i.e. the following two commands are equivalent:
./nbody-gui
./nbody-gui --settings ../config/arxiv-settings.json --categories ../config/arxiv-categories.json
This will load all available arXiv papers from the database and begin building a new map.
The default behaviour of nbody-headless, on the other hand, is to load an existing layout from the map_data database table, check for new papers,
and run a fixed number of iterations.
To load an existing layout from the map_data database table in nbody-gui add the flag --layout-db
.
To instead load an existing map layout from a Json file use --layout <filename>
in both nbody programs.
Keyboard shortcuts for controlling the map in nbody-gui are printed to the terminal. Here are some useful keyboard shortcuts:
- By default the view is locked and will adjust its zoom as the graph rotates - the graph rotates to eliminate quadtree artifacts in the force calculation. To enable manual panning and zooming, toggle the view lock with V.
- Pressing space pauses or resumes graph updates.
- By default a maximum of 100k papers are shown to speed up draw times. To force a full draw of all papers press f.
- To write the current map layout positions to a Json file press J.
- To draw the current map layout positions to a png image file press w.
The nbody-headless program can write map layout positions to both a Json file or the map_data database table by specifying the flags --write-json
or --write-db
, respectively.
The nbody-gui program can only write map layout positions to a Json file, which it does on request (as described above).
To load a map layout position Json file into the database you can use the map2db program in the nbody/map2db/
directory.
This Go program must first be compiled with go build
.
To load the file out-map.json into the table map_data do
./map2db --map-table map_data out-map.json
The tile generator source code is located in the tiles/
directory and is written in Go.
It has two external dependencies, the Go packages go-cairo and GoMySQL, which must be installed with go install
to a location referred to by the environment variable GOPATH
(see this introduction to go tool).
Note that these Go packages have been added as git submodules in the gopkg/
directory, which could be used as the installation path.
To download these submodules do
git submodule init
git submodule update
and then install them each separately.
Once these dependencies have been met, the web server can be built by running the following command in the tiles/
directory
go build
This should create the program tiles, named after its parent directory.
To see a full list of command-line arguments run
./tiles --help
The tiles program requires an output directory to be specified, and by default it will load a paper graph from the map_data MySQL table and generate standard tiles and labels for it:
./tiles <output_dir>
The tiles program can read in Json files to set initial configuration settings, category colours and the map layout.
Default configuration files are located in the config/
directory, and also contain comments to explain some of the available features.
Running the tiles program with no command-line options loads the default configuration settings and category colours for the arXiv map
i.e. the above command is equivalent to:
./tiles --settings ../config/arxiv-settings.json --categories ../config/arxiv-categories.json <output_dir>
In addition to normal tiles, which are coloured according to their categories, it is also possible to generate heatmap and grayscale tiles with the flags --hm
and --gs
, respectively.
By default the heatmap tiles are coloured according to their age spectrum.
An alternative heat parameter can be specified in the configuration file.
The generated tiles are saved in PNG format and can be optimized slightly to reduce disk space with the optitiles script. This script can be run on the chosen output directory:
./optitiles <output_dir>
The generated labels are saved as Json files. A Json file called world_index.json is also generated at the base of the chosen output directory. It describes the dimensions and location paths of the tiles and labels created for the browser-based map client, which reads this file statically. To reduce the size of both world_index.json and the generated labels the gzipjson script can be run on the output directory:
./gzipjson <output_dir>
The web server serves data from a MySQL database containing the following tables (schemas are detailed below):
- meta_data - paper meta data
- pcite - paper reference and citation information
- map_data - paper locations and size in the map
- datebdry - current date boundaries
- misc - current date boundaries
- userdata* - user login and saved profile information
- sharedata* - shared profile link information
The tables with a * are only used by the My Paperscape project i.e. not the map project. Detailing their schemas is currently beyond the scope of this documentation.
The web server serves paper abstracts from a local directory specified by the --meta
flag by default.
Alternatively it can also read absracts from an abstracts table.
The web server is written in Go.
It has two external dependencies, the Go packages GoMySQL and osext, which must be installed with go install
to a location referred to by the environment variable GOPATH
(see this introduction to go tool).
Note that these Go packages have been added as git submodules in the gopkg/
directory, which could be used as the installation path.
To download these submodules do
git submodule init
git submodule update
and then install them each separately.
Once these dependencies have been met, the web server can be built with the command
go build
This should create the program webserver, named after its parent directory.
To see a full list of command-line arguments run
./webserver --help
The web server can be run using the FactCGI or HTTP protocols using the command-line arguments --fcgi :<port number>
or --http :<port number>
, respectively.
For example
./webserver --http :8089
The webserver program can read in a Json files to set initial configuration settings.
A default configuration file is located in the config/
directory, and also contain comments to explain some of the available features.
Running the webserver program with no command-line options loads the default arXiv configuration settings
i.e. the above command is equivalent to:
./webserver --settings ../config/arxiv-settings.json --http :8089
The web server is run on the Paperscape server using the run-webserver script.
Access to the MySQL database requires the following environment variables to be set:
Environment variable | Description |
---|---|
PSCP_MYSQL_HOST |
Hostname of the MySQL server e.g. localhost |
PSCP_MYSQL_SOCKET |
Path to MySQL socket e.g. /var/run/mysqld/mysqld.sock |
PSCP_MYSQL_DB |
Name of the database to use |
PSCP_MYSQL_USER |
Username |
PSCP_MYSQL_PWD |
Password |
If both a socket and hostname are specified, the socket is used.
This table can be used as input by both the n-body map generator and the tile generator. It is served by the webserver.
Only relevant fields listed; Req. = Required, Opt = Optional.
Field | Type | Description | nbody | tiles | webserver |
---|---|---|---|---|---|
id | int(10) unsigned | Unique paper identifier | Req. | Req. | Req. |
allcats | varchar(130) | List of categories (comma separated) | Req. | Req. | |
maincat | varchar(8) | Main arXiv category | Opt. | Req. | Req. |
title | varchar(500) | Paper title | Opt. | Opt. | Req. |
authors | text | Paper authors (comma separated) | Opt. | Opt. | Req. |
keywords | text | Paper keywords (comma separated) | Opt. | Opt. | Req. |
arxiv | varchar(16) | unique arXiv identifier | Opt. | ||
publ | varchar(200) | Journal publication information | Req. | ||
inspire | int(8) unsigned | Inspire record number | Opt. |
By default the id
field is ordered by publication date (version 1) as follows:
ymdh = (year - 1800) * 10000000
+ (month - 1) * 625000
+ (day - 1) * 15625
unique_id = ymdh + 4*num
If this is not the case, the ids_time_ordered
flag should be set to false
in the configuration Json file.
In the nbody programs, categories and keywords are used for creating fake links between disconnected graphs. In nbody-gui categories are also used for colouring papers in the gui display.
This table can be used as input by both the n-body map generator and the tile generator. It is served by the webserver.
Req. = Required, Opt. = Optional.
Field | Type | Description | nbody | tiles | webserver |
---|---|---|---|---|---|
id | int(10) unsigned | Unique paper identifier | Req. | Req. | |
refs | blob | Binary blob encoding references | Req. | Req. | |
numRefs | int(10) unsigned | Number of references | Req. | ||
cites | mediumblob | Binary blob encoding citation | Req. | ||
numCites | int(10) unsigned | Number of citations | Req. | ||
dNumCites1 | tinyint(4) | Change in number citations past 1 day | Opt. | ||
dNumCites5 | tinyint(4) | Change in number citations past 5 days | Opt. |
For a given paper (A), a reference is a paper (B) that paper (A) refers to in its text, while a citation is a paper (C) that refers to paper (A) ie a reverse reference.
The refs field encodes a list of references in binary, with each reference represented by 10 bytes as follows:
Reference fields | Encoding |
---|---|
id of referenced paper (B) | unsigned little-endian 32-bit int -> 4 bytes |
order of (B) in bibliography of (A) | unsigned little-endian 16-bit int -> 2 bytes |
frequency - how often (B) appears in (A) | unsigned little-endian 16-bit int -> 2 bytes |
number of citations referenced paper (B) | unsigned little-endian 16-bit int -> 2 bytes |
Likewise the cites field has a similar encoding:
Citation fields | Encoding |
---|---|
id of citing paper (C) | unsigned little-endian 32-bit int -> 4 bytes |
order of (A) in bibliography of (C) | unsigned little-endian 16-bit int -> 2 bytes |
frequency - how often (A) appears in (C) | unsigned little-endian 16-bit int -> 2 bytes |
number of citations of citing paper (C) | unsigned little-endian 16-bit int -> 2 bytes |
Note that the above encoding is the default in Paperscape. It is possible to disable the encoding/decoding of the last three 2-byte fields in the configuration file.
This table can be created as output by the n-body map generator, and used as input to the tile generator. It is served by the webserver.
Field | Type | Description |
---|---|---|
id | int(10) unsigned | Unique paper identifier |
x | int(11) | X coordinate in map |
y | int(11) | X coordinate in map |
r | int(11) | Circle radius in map |
Field | Type | Description | nbody | tiles | webserver |
---|---|---|---|---|---|
daysAgo | int(10) unsigned | Number of days ago (0-31) | Opt. | ||
id | int(10) unsigned | id corresponding to cut-off | Opt. |
The cut-off id does not refer to an actual paper, but is the maximum paper id for that day + 1.
The id for daysAgo = 1 is therefore a lower-bound for all papers of the current (submission) day, and an upper-bound for all papers from the day before.
Note that use of this table assumes that the ids are time ordered.
If this is not the case, the ids_time_ordered
flag should be set to false
in the configuration Json file.
Field | Type | Description | nbody | tiles | webserver |
---|---|---|---|---|---|
field | varchar(16) | Name of misc field | Opt. | ||
value | varchar(4096) | Value of misc field | Opt. |
The misc table is used to access miscellaneous information, such as the last download date of arXiv meta data.
By default paper abstracts are read by the webserver from raw arXiv meta data xml files (example xml file) available from the arXiv OAI.
The root directory of these files can be specified by the --meta <dirname>
flag.
If no directory is specified then the server returns "(no abstract)".
The xml files are organized by their arXiv ids into year and month subdirectories ie YYMM.12345.xml is stored as <--meta dir>/YYxx/YYMM/YYMM.12345.xml
, and arxiv-cat/YYMM123 as <--meta dir>/YYxx/YYMM/arxiv-cat/YYMM123.xml
, etc.
Alternatively, if --meta <dirnname>
, it is possible to read paper abstracts from a MySQL table if a non-trivial name is specified in the configuration file.
This table has the following schema:
Field | Type | Description | nbody | tiles | webserver |
---|---|---|---|---|---|
id | int(10) unsigned | Unique paper identifier | Opt. | ||
abstract | mediumtext | Full abstract | Opt. |
This file format can be used as input by the n-body map generator. The following Json format is used:
[
{"id":input-id,"allcats":"input-category,...","refs":[[input-ref-id,input-ref-freq],...]},
...
]
where input-id, input-ref-id and input-ref-freq are integers, and input-category is a string. Reading in keywords, title and author are currently not supported.
This file format can be created as output by the n-body map generator, and used as input to the tile generator. The following Json format is used:
[
[input-id,input-x,input-y,input-r],
...
]
where input-id, input-x, input-y, and input-r are integers.
Paperscape is an interactive map that visualises the arXiv, an open, online repository for scientific research papers. The map, which can be explored by panning and zooming, currently includes all of the papers from the arXiv and is updated daily.
Each scientific paper is represented in the map by a circle whose size is determined by the number of times that paper has been cited by others. A paper's position in the map is determined by both its citation links (papers that cite it) and its reference links (papers it refers to). These links pull related papers together, whereas papers with no or few links in common push each other away.
In the default colour scheme, where papers are coloured according to their scientific category, coloured "continents" emerge, such as theoretical high energy physics (blue) or astrophysics (pink). At their interface one finds cross-disciplinary fields, such as dark matter and cosmological inflation. Zooming in on a continent reveals substructures representing more specific fields of research. The automatically extracted keywords that appear on top of papers help to identify interesting papers and fields.
Clicking on a paper reveals its meta data, such as title, authors, journal and abstract, as well as a link to the full text. It is also possible to view the references or citations for a paper as a star-like background on the map.
The MIT License (MIT)
Copyright (C) 2011-2017 Damien P. George and Robert Knegjens
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.