How to Evaluate Image Description Models with DLC-Bench: A Step-by-Step Guide

Want to know how well your image description model performs? This guide walks you through evaluating models using DLC-Bench, ensuring accurate and insightful performance metrics. Get ready to dive into the specifics of setting up your environment.

Setting Up Your Environment for DLC-Bench Evaluation

Before evaluating your image description model you will need to complete the setup.

Installation Steps

Install dam package: Needed to run inference for your model. Follow the general installation instructions in the main README.

Install vLLM: This serves Llama 3.1 to 8B, enabling efficient evaluation of model outputs. The installation command below is specifically for CUDA 11.8 and Python 3.10.

export VLLM_VERSION=0.5.3.post1
export PYTHON_VERSION=310
pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118

Preparing the DLC-Bench Evaluation Data

Download the DLC-Bench Dataset: Clone the dataset from Hugging Face Datasets and place the DLC-Bench folder in the appropriate directory.
```
git lfs install
git clone https://huggingface.co/datasets/nvidia/DLC-Bench
```

Running the Evaluation: Step-by-Step Instructions to Evaluating Image Description Models

Once the environment is set, it is time to evaluate.

Start the vLLM Backend

Run the vLLM backend to evaluate generated outputs. You don't have to complete it on the same GPU as the model.
Memory Management: If running vLLM and model inference on the same GPU, reduce memory usage by setting --gpu-memory-utilization 0.5.

Start Command: This command starts a vLLM backend on GPU 1:

export LLM_NAME=meta-llama/Meta-Llama-3.1-8B-Instruct
CUDA_VISIBLE_DEVICES=1 vllm serve $LLM_NAME --tensor-parallel-size 1 --port 9000 --max-model-len 8192

Obtain Model Outputs with get_model_outputs.py

Download Model Checkpoint: Download the model checkpoint to ../checkpoints/.
Run Inference: Use get_model_outputs.py to run the model and cache the outputs (stored in model_outputs_cache/).
```
python get_model_outputs.py --model_type dam --model_path nvidia/DAM-3B
```
Note: The nvidia/DAM-3B model includes prompt augmentation, potentially impacting benchmark performance.

Evaluate Model Outputs with eval_model_outputs.py

Run Evaluation Script:

python eval_model_outputs.py --pred model_outputs_cache/dam_3b_v1.json --base-url "http://localhost:9000/v1"

Cache Management: The model_outputs_cache/ directory stores the reference cache generated by get_model_outputs.py and eval_model_outputs.py. You can remove cache files to re-run the evaluation.

Reference Results:

Summary (Pos Neg Avg(Pos, Neg)): 0.510, 0.830, 0.670

Evaluating Your Own Image Description Model Using DLC-Bench

To evaluate your own model, use the eval_model_outputs.py script and reference the cache format in model_outputs_cache/. Standardize your model outputs to match the expected format ensure compatibility with the evaluation script for accurate assessment.

Setting Up Your Environment for DLC-Bench Evaluation

Before evaluating your image description model you will need to complete the setup.

Installation Steps

Install dam package: Needed to run inference for your model. Follow the general installation instructions in the main README.

Install vLLM: This serves Llama 3.1 to 8B, enabling efficient evaluation of model outputs. The installation command below is specifically for CUDA 11.8 and Python 3.10.

export VLLM_VERSION=0.5.3.post1
export PYTHON_VERSION=310
pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118

Preparing the DLC-Bench Evaluation Data

Download the DLC-Bench Dataset: Clone the dataset from Hugging Face Datasets and place the DLC-Bench folder in the appropriate directory.

git lfs install
git clone https://huggingface.co/datasets/nvidia/DLC-Bench

Running the Evaluation: Step-by-Step Instructions to Evaluating Image Description Models

Once the environment is set, it is time to evaluate.

Start the vLLM Backend

Run the vLLM backend to evaluate generated outputs. You don't have to complete it on the same GPU as the model.

Memory Management: If running vLLM and model inference on the same GPU, reduce memory usage by setting --gpu-memory-utilization 0.5.

Start Command: This command starts a vLLM backend on GPU 1:

export LLM_NAME=meta-llama/Meta-Llama-3.1-8B-Instruct
CUDA_VISIBLE_DEVICES=1 vllm serve $LLM_NAME --tensor-parallel-size 1 --port 9000 --max-model-len 8192

Obtain Model Outputs with get_model_outputs.py

Download Model Checkpoint: Download the model checkpoint to ../checkpoints/.

Run Inference: Use get_model_outputs.py to run the model and cache the outputs (stored in model_outputs_cache/).

python get_model_outputs.py --model_type dam --model_path nvidia/DAM-3B

Note: The nvidia/DAM-3B model includes prompt augmentation, potentially impacting benchmark performance.

Evaluate Model Outputs with eval_model_outputs.py

Run Evaluation Script:

python eval_model_outputs.py --pred model_outputs_cache/dam_3b_v1.json --base-url "http://localhost:9000/v1"

Cache Management: The model_outputs_cache/ directory stores the reference cache generated by get_model_outputs.py and eval_model_outputs.py. You can remove cache files to re-run the evaluation.

Reference Results:

Summary (Pos Neg Avg(Pos, Neg)): 0.510, 0.830, 0.670