Artifact Evaluation for UGache [SOSP'23]

This repository contains scripts and instructions for reproducing the experiments in our SOSP'23 paper "UGACHE: A Unified GPU Cache for Embedding-based Deep Learning Systems".

NOTE: For artifact evaluation committee, please directly jump to Reproducing the Results section and reproduce the results on the provided server. For other readers, please follow the instructions below to setup the environment and reproduce the results on your own server.

Project Structure

> tree .
├── coll_cache_lib              # source code of UGache (coll_cache is the internal name for ugache)
├── datagen                     # scripts to prepare datasets for GNN and DLR
│   ├── dlr
│   └── gnn
├── docker                      # docker file and scripts to pull and build ugache, dependency and other baselines
│   ├── Dockerfile.dlr
│   ├── Dockerfile.gnn
│   ├── setup_docker.dlr.sh
│   ├── setup_docker.gnn.sh
│   ├── setup_docker.sh
├── eval                        # evaluation scripts (per-figure)
│   ├── dlr
│   └── gnn
└── python                      # source code of UGache

Hardware Requirements

UGache aims to accelerate embedding access in multi-GPU platforms with NVLink support. In this artifact, UGache only natively supports, and is evaluated on these 3 platforms:

Server A with hard-wired NVLink
- CPU: 2 x Intel Xeon Gold 6138 CPUs (20 cores each)
- GPU: 4 x NVIDIA V100 (16GB) GPUs with symmetric fully-connected NVLink topology
Server B with hard-wired NVLink
- CPU: 2 x Intel Xeon Platinum 8163 CPUs (24 cores each)
- GPU: 8 x NVIDIA V100 (32GB) GPUs with asymmetric NVLink topology, identical to DGX-1
Server C with NVSwitch
- CPU: 2 x Intel Xeon Gold 6348 CPU (28 cores each)
- GPU: 8 x NVIDIA A100 (80GB) GPUs, connected via NVSwitch

These platforms are currently hard-coded in UGache. For other undocumented multi-GPU platforms with NVLink interconnects, we provides preliminary support in this document.

Setting up the Software Environment

We use NVIDIA's Merlin container as the base environment for UGache. Most software dependencies have been prepared inside the image(e.g. PyTorch, TensorFlow, CUDA, cuDNN). Please first ensure that your docker service supports CUDA. Here is a reference to solve Docker building images with CUDA support. Due to conflicts in dependencies, GNN and DLR evaluations are conducted in different containers(merlin-pytorch and merlin-tensorflow).

The docker images can be built using the following command:

cd <ugache-dir>/docker
docker build --pull -t ugache-gnn -f Dockerfile.gnn --build-arg RELEASE=false .
docker build --pull -t ugache-dlr -f Dockerfile.dlr --build-arg RELEASE=false .

Then launch containers using the following command:

docker run  --shm-size=200g --ulimit memlock=-1 --ulimit core=0 --runtime=nvidia --privileged=true --cap-add=SYS_ADMIN --cap-add=SYS_NICE --ipc=host --name ugache-ae-gnn -it ugache-gnn bash
docker run  --shm-size=200g --ulimit memlock=-1 --ulimit core=0 --runtime=nvidia --privileged=true --cap-add=SYS_ADMIN --cap-add=SYS_NICE --ipc=host --name ugache-ae-dlr -it ugache-dlr bash

Since datasets require a large disk volume, please also bind external storage to the container if your server stores large datasets on a separate device:

docker run  --shm-size=200g --ulimit memlock=-1 --ulimit core=0 --runtime=nvidia --privileged=true --cap-add=SYS_ADMIN --cap-add=SYS_NICE --ipc=host -v <extern_host_storage>:/dataset_gnn --name ugache-ae-gnn -it ugache-gnn bash
docker run  --shm-size=200g --ulimit memlock=-1 --ulimit core=0 --runtime=nvidia --privileged=true --cap-add=SYS_ADMIN --cap-add=SYS_NICE --ipc=host -v <extern_host_storage>:/dataset_dlr --name ugache-ae-dlr -it ugache-dlr bash

Then run these scripts to pull and build UGache, its dependencies and baselines:

## for GNN:
docker exec -it ugache-ae-gnn bash
bash /tmp/setup_docker.sh
bash /tmp/setup_docker.gnn.sh
## for DLR:
docker exec -it ugache-ae-dlr bash
bash /tmp/setup_docker.sh
bash /tmp/setup_docker.dlr.sh

Note that this will pull a new copy of UGache inside the container. The UGache repository on host is no longer used.

Prepare Gurobi license

UGache depends on Gurobi to solve MILP problems. Please refer to this link to request a trial license. We recommand WLS Academic license for academic users, ae it can be deployed in multiple containers. Place the license file to /opt/gurobi/gurobi.lic in container, and verify that the license is properly installed using the following command:

/opt/gurobi-install/linux64/bin/gurobi_cl --license

Preparing the Dataset:

We provide scripts to prepare DLR and GNN datasets in the datagen folder. Due to limited time and disk space, the preprocessing of embeddings is not included. In this artifact, the embeddings are initialized without loading the correct embedding values from the dataset. This only affects the numerical correctness of training/inference, and does not affect the computation workflow.

GNN Datasets

By default, GNN datasets will be placed in /datasets_gnn:

tree /datasets_gnn -L 2
/datasets_gnn
├── data-raw                     # original downloaded dataset
│   ├── com-friendster
│   ├── com-friendster.tar.zst
│   ├── mag240m_kddcup2021
│   ├── mag240m_kddcup2021.zip
│   ├── papers100M-bin
│   └── papers100M-bin.zip
├── gnnlab                       # converted dataset for GNNLab
│   ├── com-friendster
│   ├── mag240m-homo
│   └── papers100M-undir
└── wholegraph                   # converted dataset for UGache and WholeGraph
    ├── com_friendster
    ├── mag240m_homo
    └── ogbn_papers100M

Run the following commands to download and process GNN datasets:

# run following commands in GNN container
cd /ugache/datagen/gnn
python friendster.py
python mag240M.py
python papers100M.py

Apart from downloading ~300GB raw data, the preprocess may take around 1 hour. The final datasets in gnnlab and wholegraph occupy 130GB, while the data-raw directory occupies up to 600GB.

DLR Datasets

By default, DLR datasets will be placed in /datasets_dlr:

tree /datasets_dlr -L 2
/datasets_dlr
├── data-raw                     # original downloaded dataset
│   ├── criteo_tb
└── processed                    # converted dataset
    ├── criteo_tb
    └── syn_a12_s100_c800m

Since there's no permanent url to download criteo TB dataset, please download it manually from ailab.criteo.com or aliyun, and place day_0.gz ~ day_23.gz under /datasets_dlr/data-raw/criteo_tb/. Then, run the following commands to process DLR datasets:

# run following commands in DLR container
cd /ugache/datagen/dlr
python syn.py
cd criteo
bash criteo.sh

Depending on your network, downloading and preprocessing full criteo TB dataset may take up to 24 hours and consume around 2TB disk volume. The final dataset in processed occupies 700GB.

Reproducing the Results

Our experiments have been automated using scripts. Each figure in our paper is considered as one experiment and is associated with a subdirectory in ugache/eval. The script will automatically run the experiment, save the logs into files, parse the output data from the files, and plot corresponding figure.

tree /ugache/eval -L 2
/ugache/eval
├── dlr
│   ├── figure11-4v100
│   ├── figure11-8a100
│   ├── figure11-8v100
│   ├── figure12-4v100
│   ├── figure12-8a100
│   ├── figure12-8v100
│   ├── figure16
└── gnn
    ├── figure11-4v100
    ├── figure11-8a100
    ├── figure11-8a100-fix-cache-rate
    ├── figure11-8v100
    ├── figure12-4v100
    ├── figure12-8a100
    ├── figure12-8a100-fix-cache-rate
    ├── figure12-8v100
    ├── figure13
    ├── figure14
    └── figure15

The additional gnn/figure11-8a100-fix-cache-rate and gnn/figure12-8a100-fix-cache-rate are for the purpose of fixing misconfigured cache rate. On server C, GNNLab, Rep_U, and UGache should be able to cache all embeddings of Papers100M and Com-Friendster, since the GPU has 80GB of memory. We accidentally used a smaller cache rate during submission.

Rreproducing all experiments

We provide a one-click script to reproduce the results on multi-gpu server. These scripts simply chain commands in the following "Reproducing single figure" section.

$ cd /ugache/eval/gnn        # GNN tests in gnn folder should be run in gnn container
                             # for DLR tests, enter /ugache/eva/dlr in dlr container
$ bash run-all-4v100.sh      # run scripts that match the platform: run-all-(4v100,8v100,8a100).sh

Reproducing a single figure

In each figure* folder, execute the following commands. Take dlr/figure11-4v100 for exmaple:

# tests in dlr folder should be run in dlr container
$ cd /ugache/eval/dlr/figure11-4v100
$ make run
$ make plot
$ ls data*
data.dat	data.eps
$ cat data.dat
short_app	policy_impl	dataset_short	step.train
dlrm	SOK	CR	0.005778
dlrm	HPS	CR	0.004299
dlrm	UGache	CR	0.002626
dcn	SOK	CR	0.007870
dcn	HPS	CR	0.006381
dcn	UGache	CR	0.004722
dlrm	SOK	SYN	0.014536
dlrm	HPS	SYN	0.018224
dlrm	UGache	SYN	0.008524
dcn	SOK	SYN	0.047721
dcn	HPS	SYN	0.046759
dcn	UGache	SYN	0.037482

The make run command runs all tests, and logs will be saved to the run-logs folder. The make plot command will first parse logs in run-logs folder to produce a data.dat file, then plot corresponding figure to data.eps.

Each figure folder containers a runner.py file, and the make run is simply an alias of python runner.py. The python script iterates all configurations, generates a command for each configuration and runs them via os.system. You may execute python runner.py -m to see what command it generates and manually run one configuration. Note that manually run a configuration requires launching NVIDIA-MPS service separately. Since our runner automatically handles this, we recommand using make run or python runner.py to run all configurations.

We recommand the eps-preview extension in vscode to quickly preview eps figures.

We also provide original log files used in our paper submission in run-logs-paper folder. You may run make plot-paper to directly plot figures using these log files to quickly reproduce the figures in paper before running all tests.

Each figure should be evaluated on the designated platform. The following table shows the platform and estimated time for each figure:

Figure	Platform	Estimated Time
dlr/figure11-4v100	Server A	30 min
dlr/figure11-8v100	Server B	30 min
dlr/figure11-8a100	Server C	30 min
dlr/figure12-4v100	Server A	30 min
dlr/figure12-8v100	Server B	30 min
dlr/figure12-8a100	Server C	20 min
dlr/figure16	Server C	10 min
gnn/figure11-4v100	Server A	60 min
gnn/figure11-8v100	Server B	60 min
gnn/figure11-8a100	Server C	30 min
gnn/figure12-4v100	Server A	50 min
gnn/figure12-8v100	Server B	50 min
gnn/figure12-8a100	Server C	30 min
gnn/figure13	Server C	70 min
gnn/figure14	Server C	40 min
gnn/figure15	Server C	0 min

Note: Due to the inability to access server B and server C publicly, we provide the screencasts for the results on these platforms. The table below shows our experimental results after screen recording.

Server	APP	Screencast	Log & Script Archive	MD5
Server B	GNN	gnn-8v100.mp4	gnn-8v100.tar.gz	`d7b73603be17d5168363cf9810322b7c`
Server B	DLR	dlr-8v100.mp4	dlr-8v100.tar.gz	`67e35f264a63be94ef23d2676375879c`
Server C	GNN	gnn-8a100.mp4	gnn-8a100.tar.gz	`6269cb2fbef963314d5a468e83798973`
Server C	DLR	dlr-8a100.mp4	dlr-8a100.tar.gz	`8fc5ebcc8a6a1a02d4a248c6b2326817`

In the screeencast, we will first display the branch information of the code repository, then start the experiment using a one-click script. The script will delete all previous run-logs first. After running all experiments, the entire directory is compressed, with its corresponding MD5 value printed. Reviewers can use this value to verify consistency between the provided tar and the one in the screen recording, and run make plot in each evaluted figure to plot figures and examine results. These screencasts were evaluated a few commits before the latest version. This does not invalidate these screencast, since the newer commits only update documents, or add preliminary support for other platforms that does not affect the execution flow for these three platforms

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Artifact Evaluation for UGache [SOSP'23]

Project Structure

Hardware Requirements

Setting up the Software Environment

Prepare Gurobi license

Preparing the Dataset:

GNN Datasets

DLR Datasets

Reproducing the Results

Rreproducing all experiments

Reproducing a single figure

Files

README.md

Latest commit

History

README.md

File metadata and controls

Artifact Evaluation for UGache [SOSP'23]

Project Structure

Hardware Requirements

Setting up the Software Environment

Prepare Gurobi license

Preparing the Dataset:

GNN Datasets

DLR Datasets

Reproducing the Results

Rreproducing all experiments

Reproducing a single figure