How to Evaluate Image Description Models with DLC-Bench: A Step-by-Step Guide
Want to know how well your image description model performs? This guide walks you through evaluating models using DLC-Bench, ensuring accurate and insightful performance metrics. Get ready to dive into the specifics of setting up your environment.
Setting Up Your Environment for DLC-Bench Evaluation
Before evaluating your image description model you will need to complete the setup.
Installation Steps
-
Install
dam
package: Needed to run inference for your model. Follow the general installation instructions in the main README. -
Install vLLM: This serves Llama 3.1 to 8B, enabling efficient evaluation of model outputs. The installation command below is specifically for CUDA 11.8 and Python 3.10.
Preparing the DLC-Bench Evaluation Data
-
Download the DLC-Bench Dataset: Clone the dataset from Hugging Face Datasets and place the
DLC-Bench
folder in the appropriate directory.
Running the Evaluation: Step-by-Step Instructions to Evaluating Image Description Models
Once the environment is set, it is time to evaluate.
Start the vLLM Backend
-
Run the vLLM backend to evaluate generated outputs. You don't have to complete it on the same GPU as the model.
-
Memory Management: If running vLLM and model inference on the same GPU, reduce memory usage by setting
--gpu-memory-utilization 0.5
. -
Start Command: This command starts a vLLM backend on GPU 1:
Obtain Model Outputs with get_model_outputs.py
-
Download Model Checkpoint: Download the model checkpoint to
../checkpoints/
. -
Run Inference: Use
get_model_outputs.py
to run the model and cache the outputs (stored inmodel_outputs_cache/
).Note: The
nvidia/DAM-3B
model includes prompt augmentation, potentially impacting benchmark performance.
Evaluate Model Outputs with eval_model_outputs.py
-
Run Evaluation Script:
-
Cache Management: The
model_outputs_cache/
directory stores the reference cache generated byget_model_outputs.py
andeval_model_outputs.py
. You can remove cache files to re-run the evaluation.
Reference Results:
Summary (Pos Neg Avg(Pos, Neg)): 0.510, 0.830, 0.670
Evaluating Your Own Image Description Model Using DLC-Bench
To evaluate your own model, use the eval_model_outputs.py
script and reference the cache format in model_outputs_cache/
. Standardize your model outputs to match the expected format ensure compatibility with the evaluation script for accurate assessment.