Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang
We introduce LEO, an embodied multi-modal generalist agent capable of grounding, reasoning, chatting, planning, and acting in the 3D world. LEO is trained in a two-stage scheme: (i) 3D vision-language (VL) alignment and (ii) 3D vision-language-action (VLA) instruction tuning.
We meticulously collect extensive diverse data for training LEO. † indicates the task contains our generated data. See Task and Data for details. We show the data statistics as below:
Dataset | Task | 2D required? | 3D assets | #data |
---|---|---|---|---|
LEO-align | object captioning | ✗ | Objaverse | 660k |
object referring† | ✗ | ScanNet + 3RScan | 354k | |
scene captioning† | ✗ | 3RScan | 20k | |
LEO-instruct | 3D captioning | ✗ | ScanNet | 37k |
3D QA† | ✗ | ScanNet + 3RScan | 83k | |
3D dialogue† | ✗ | 3RScan | 11k | |
task planning† | ✗ | 3RScan | 14k | |
navigation | ✓ | MP3D | 60k | |
manipulation | ✓ | CLIPort | 300k |
[2024.07] We release a few EAI data examples for demonstration purpose.
[2024.05] LEO is accepted by ICML 2024.
[2024.04] We release the scripts for inference and scaling law analysis, model weights, and training code of EAI tasks.
[2024.03] We release the code and data. The embodied AI (EAI) tasks (navigation and manipulation) need further organization and will be released soon.
[2024.01] We release a Huggingface interactive demo. Chat with LEO and enjoy yourself.
- Clone Github repo.
git clone [email protected]:embodied-generalist/embodied-generalist.git
cd embodied-generalist
- Create
conda
environment and install dependencies.
conda create -n leo python=3.9
conda activate leo
# install PyTorch, take our version for example
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
# install other dependencies with pip
pip install -r requirements.txt
# install peft separately to escape its install_requires
pip install peft==0.5.0 --no-deps
- Install third party libraries (for point cloud backbones). Note that if the installation of
PointNext
fails, you can either 1) comment the line of importingPointNext
inmodel/pcd_backbone.py
or 2) download the compiled file and place it atmodel/pointnext/cpp/pointnet2_batch/
, which may possibly help.
cd model
# default PointNet++
cd pointnetpp
python setup.py install
cd ..
# optional: PointNext (if you want to substitute the default PointNet++)
cd pointnext/cpp/pointnet2_batch
python setup.py build_ext --inplace
cd ../../../
cd ..
# sanity check
python -c 'from model.pointnetpp.pointnetpp import PointNetPP'
# for PointNext, run 'from model.pointnext.pointnext import PointNext'
- Go through task and data, model weights, and you are ready to run.
Data preparation. The data includes two components: scan data and language annotations.
- Scan data. To simplify the preparation and save storage, we streamline the scan data (point clouds and instance segments), which is less than 10G yet already sufficient for experiments on LEO. You can download the compressed files from the links below and arrange the data according to the illustration of scan data structure.
- ScanNet: pcd_with_global_alignment, mask (Mask3D proposals).
- 3RScan: 3RScan-ours-align.
- Cap3D. Please refer to Cap3D data for preparing the point clouds, where we use pcs_pt. The corresponding annotation file (
Cap3D_automated_Objaverse_no3Dword.csv
) is included in our released annotations.
# scan data structure
├── ${scannet_base}
├── scan_data
│ └── pcd_with_global_alignment
│ ├── ${scan_id}.pth
└── mask
├── ${scan_id}.mask.npz
├── ${rscan_base}
└── 3RScan-ours-align
├── ${scan_id}
├── pcds.pth
├── pcd-align.pth
└── inst_to_label.pth
├── ${cap3d_root}
├── Cap3D_pcs_pt
│ ├── ${obj_id}.pt
└── Cap3D_automated_Objaverse_no3Dword.csv # included in annotations
- Language annotations. The annotations are categorized into two parts according to the training stage. We provide a compressed file that wraps up all the annotations, which should be organized in the following structure:
# annotations structure
├── ${alignment_base}
├── obj_caption -> ${cap3d_root}
│ ├── Cap3D_pcs_pt
│ │ ├── ${obj_id}.pt
│ └── Cap3D_automated_Objaverse_no3Dword.csv
├── obj_scene_caption
│ ├── 3rscan_prompted.json
│ ├── 3rscan_scanscribe.json
│ ├── scannet_referit3d_nr3d_train.json
│ └── scannet_referit3d_sr3d+_train.json
└── scene_caption
├── 3rscan_scenecap_train.json
└── 3rscan_scenecap_val.json
├── ${instruction_base}
├── scan2cap
│ ├── scanrefer_train.json
│ ├── scanrefer_val.json
│ └── scanrefer_corpus.json
├── scanqa
│ ├── ScanQA_v1.0_train.json
│ └── ScanQA_v1.0_val.json
├── sqa3d
│ ├── v1_balanced_questions_train_scannetv2.json
│ ├── v1_balanced_questions_val_scannetv2.json
│ ├── v1_balanced_questions_test_scannetv2.json
│ ├── v1_balanced_sqa_annotations_train_scannetv2.json
│ ├── v1_balanced_sqa_annotations_val_scannetv2.json
│ ├── v1_balanced_sqa_annotations_test_scannetv2.json
│ └── axisAlignment.pth
├── 3rscanqa
│ ├── 3rscan_qa_train.json
│ └── 3rscan_qa_val.json
├── dialogue
│ ├── 3rscan_dialog_train.json
│ └── 3rscan_dialog_val.json
└── planning
├── 3rscan_plan_train.json
└── 3rscan_plan_val.json
Data configurations. After data preparation, check configs/data/default.yaml
to update the paths, including scan_family_base
, rscan_base
, alignment_base
and instruction_base
.
Dataloaders. The implementation of dataset per task lies in data/datasets.py
, where LeoMix
aggregates various datasets as the training dataset.
EAI. We release a small subset of EAI tasks with a few data examples for demonstration purpose. You can download here. It is recommended to put the extracted folders (mp3d_objnav
and cliport
) right inside the instruction_base
path. Though the test in simulator is not incorporated yet, it is ready for the training and validation of EAI tasks.
Pretrained weights to load.
- LLM: Vicuna-7B. We use Vicuna v1.1 from FastChat, which you can refer to for the access of Vicuna-13B or more advanced versions. Remember to update
cfg_path
inconfigs/llm/*.yaml
. - Point cloud backbone: PointNet++, PointBERT. We have not tried
PointNext
, but everything is ready except the pretrained weights. Remember to updatepath
inconfigs/vision3d/backbone/*.yaml
.
Trained LEO weights. We release two checkpoints here:
align.pth
: the checkpoint after the alignment stage, trained with LoRA.sft_noact.pth
: the checkpoint after the instruction tuning stage, based onalign.pth
and tuned without embodied acting tasks.
Training. The training pipeline is elaborated in trainer/leo_trainer.py
. Make sure the config file configs/default.yaml
is properly set up before running.
- General setup. We use
wandb
as the default experiment logger. Remember to modifylogger.entity
to your account and init thewandb
. Modifyname
,note
, andbase_dir
for proper experiment output. - Model. The components of
LeoAgent
can be configured inconfigs/llm
,configs/vision2d
andconfigs/vision3d
. - Task. You can configure the tasks by specifying a
yaml
inconfigs/task
. You can also run new tasks by creating similar configs. - GPU usage. We run the experiments on NVIDIA A100-80GB and A800-80GB. Modify
dataloader
arguments for your GPU if necessary.
We prepare some running scripts in scripts/
, covering two-stage training and evaluation. The core is to run launch.py
with proper arguments. There are three launch modes:
# python launch
python launch.py --mode python --config configs/default.yaml <HYDRA_CONFIG>
# accelerate launch
python launch.py --mode accelerate --config configs/default.yaml <HYDRA_CONFIG>
# SLURM submitit launch, default
python launch.py --mode submitit --config configs/default.yaml <HYDRA_CONFIG>
# for example, run alignment with submitit
python launch.py --mode submitit \
--config configs/default.yaml \
--name leo_tuning \ # job name
--qos lv0b \ # QoS
--time 48 \ # job execution duration (hour)
--num_nodes 1 \
--partition HGX \ # node type
--gpu_per_node 4 \
--mem_per_gpu 100 \ # memory per GPU
--port 2050 \
task=align \ # hydra: cfg.task, select task
note=align_lora \ # hydra: cfg.note, for exp_dir
Inference. We prepare an inference script scripts/inference.sh
, where we run a different python script inference.py
in python
mode by default:
# single-GPU python-mode launch
python launch.py --mode python \
--run_file inference.py \
--config configs/default.yaml \
note=tuning_noact \
pretrained_ckpt_path=null \
Modify probe
arguments in configs/default.yaml
to customize the inputs for inference. You can select a checkpoint by specifying either note
or pretrained_ckpt_path
. For the former, note
should align with the corresponding note
for the training exp_dir
. For the latter, you shoud assign with a checkpoint folder wherein pytorch_model.bin
exists.
Launch mode. For explanation of the launch arguments, use python launch.py --help
. Refer to SLURM submitit or Accelerate for more information.
We manually modify some methods of accelerate.Accelerator
in common/misc.py
, including gather_for_metrics
(fix gathering non-tensor objects), get_state_dict
(for saving only learnable parameters when calling save_state
), and prepare_scheduler
(fix behavior with gradient accumulation).
@inproceedings{huang2023embodied,
title={An Embodied Generalist Agent in 3D World},
author={Huang, Jiangyong and Yong, Silong and Ma, Xiaojian and Linghu, Xiongkun and Li, Puhao and Wang, Yan and Li, Qing and Zhu, Song-Chun and Jia, Baoxiong and Huang, Siyuan},
booktitle={Proceedings of the International Conference on Machine Learning (ICML)},
year={2024}
}