NOTE: This benchmark uses dummy weights by default for faster experiments. It is expected if you see randomly generated garbled characters, but the throughput and latency numbers should be correct.
The following commands use ~/flexllmgen_offload_dir
as the offloading folder by default.
To get the best performance, it is recommonded to mount this folder on a fast SSD.
If you use AWS or GCP instances with local SSDs, you can use mount_nvme_aws.sh or mount_nvme_gcp.sh to mount the local SSDs.
# fp16
python3 bench_suite.py 6b7_1x1
# with int4 compression
python3 bench_suite.py 6b7_1x1_comp
# fp16
python3 bench_suite.py 30b_1x1
# with int4 compression
python3 bench_suite.py 30b_1x1_comp
# fp16
python3 bench_suite.py 175b_1x1
# with int4 compression
python3 bench_suite.py 175b_1x1_comp
sudo apt install openmpi-bin
# 1 node with 4 GPUs
bash bench_6.7b_1x4.sh
# 4 nodes and one GPU per node
bash bench_6.7b_4x1.sh
# 1 node with 4 GPUs
bash bench_30b_1x4.sh
# 4 nodes and one GPU per node
bash bench_30b_4x1.sh
# 1 node with 4 GPUs
bash bench_175b_1x4.sh
# 4 nodes and one GPU per node
bash bench_175b_4x1.sh