Unlock Semantic 3D from Images with NVIDIA's Large Spatial Model (LSM)

Want to reconstruct stunning 3D scenes from just a couple of images? NVIDIA's Large Spatial Model (LSM) offers an end-to-end solution for unposed images to semantic 3D reconstruction. Read on to discover how to leverage this powerful tool for your projects. This guide will walk you through installation, data preparation, training, and inference.

What is a Large Spatial Model (LSM)?

A Large Spatial Model (LSM) is a cutting-edge technology designed to generate semantic 3D models directly from unposed images. Imagine turning ordinary 2D photos of indoor spaces into rich, navigable 3D environments. This technology opens doors for various applications like robotics, virtual reality, and architectural design.

Key Features of LSM

End-to-End Reconstruction: Convert unposed images directly into semantic 3D models.
Semantic Understanding: Understands and segments the 3D scene into meaningful objects.
Flexibility: Trained for indoor scenes using datasets like ScanNet and ScanNet++.

Get Started: A Step-by-Step Guide to LSM

Ready to dive in? Here’s how to install and use NVIDIA's LSM.

1. Installation: Setting Up Your Environment

Before you begin, ensure you have the necessary environment set up.

Clone the Repository:

git clone --recurse-submodules https://github.com/NVlabs/LSM.git

Create a Conda Environment:

conda create -n lsm python=3.10
conda activate lsm

Install PyTorch and Related Packages:

conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y
conda install pytorch-cluster pytorch-scatter pytorch-sparse -c pyg -y

Install Other Python Dependencies:

pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Install PointTransformerV3:

cd submodules/PointTransformerV3/Pointcept/libs/pointops
python setup.py install
cd../../../../..

Install 3D Gaussian Splatting Modules:

pip install submodules/3d_gaussian_splatting/diff-gaussian-rasterization
pip install submodules/3d_gaussian_splatting/simple-knn

Install OpenAI CLIP:

pip install git+https://github.com/openai/CLIP.git

Build croco model:

cd submodules/dust3r/croco/models/curope
python setup.py build_ext --inplace
cd../../../../..

2. Download Pre-trained Models

Download the necessary model weights to get started quickly.

Create Checkpoints Directory:
```
mkdir -p checkpoints/pretrained_models
```

Download DUSt3R Model Weights:

wget -P checkpoints/pretrained_models https://download.europe.naverlabs.com/ComputerVision/DUSt3R/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth

Download LSEG Demo Model Weights:

gdown 1FTuHY1xPUkM-5gaDtMfgCl3D0gR89WV7 -O checkpoints/pretrained_models/demo_e200.ckpt

Download LSM Final Checkpoint:

gdown 1q57nbRJpPhrdf1m7XZTkBfUIskpgnbri -O checkpoints/pretrained_models/checkpoint-final.pth

3. Data Preparation: Fueling Your LSM

To effectively train and test your Large Spatial Model, proper data preparation is key.

For Training:
- Datasets: ScanNet and ScanNet++ are supported, requiring agreement signatures for access.
- Details: Refer to the data_process/data.md file in the repository for detailed instructions.
For Testing:
- See data_process/data.md for test dataset information.

4. Training Your Model

Once your data is ready, initiate the training process to fine-tune your model.

Command:
```
bash scripts/train.sh
```
Output Directory: Training results are saved to checkpoints/output by default.
Optional Parameters: Use --output_dir to specify a custom directory.

5. Inference: Reconstructing 3D Scenes

With a trained model, you can now infer 3D scenes from images.

Data Preparation:

Place two indoor scene images in a directory.

Example directory structure:

demo_images/
└── indoor/
    ├── scene1/
    │   ├── image1.jpg
    │   └── image2.jpg
    └── scene2/
        ├── room1.png
        └── room2.png

Run Inference:
```
bash scripts/infer.sh
```
Optional Parameters:
- --file_list: Specify input image paths.
- --output_path: Set the output directory for Gaussian points and rendered video.
- --resolution: Define the processing image resolution.

Example Usage: Generating a 3D Scene

Imagine you have two images of a living room. By placing these images in the specified directory structure and running the inference script, LSM will generate a 3D representation of the room, complete with semantic understanding of the objects within it.

Acknowledgment and Citation

This project builds upon the work of many researchers and open-source projects. If you use this work, please cite the original paper.

@misc { fan2024largespatialmodelendtoend,
 title = { Large Spatial Model: End-to-end Unposed Images to Semantic 3D},
 author = { Zhiwen Fan and Jian Zhang and Wenyan Cong and Peihao Wang and Renjie Li and Kairun Wen and Shijie Zhou and Achuta Kadambi and Zhangyang Wang and Danfei Xu and Boris Ivanovic and Marco Pavone and Yue Wang},
 year = { 2024},
 eprint = { 2410.18956},
 archivePrefix = { arXiv},
 primaryClass = { cs.CV},
 url = { https://arxiv.org/abs/2410.18956},
}

By following this guide, you'll be well-equipped to explore the capabilities of NVIDIA's Large Spatial Model, transforming 2D images into interactive semantic 3D environments. Now you can harness the power of LSM for all your spatial understanding needs!