Unlock the Power of Medical AI: Introducing MedAgentBench for LLM Evaluation

Are you ready to evaluate the potential of medical AI agents in a realistic environment? MedAgentBench is the answer – a virtual EHR (Electronic Health Record) world where you can benchmark the performance of Large Language Models (LLMs) in healthcare scenarios. This guide provides a clear path to implementing and utilizing MedAgentBench.

Why MedAgentBench Matters for Medical AI Innovation

MedAgentBench offers a crucial testing ground for medical AI agents. It allows researchers and developers to:

Realistically simulate EHR interactions for comprehensive AI evaluation.
Benchmark different LLMs against standardized medical tasks.
Advance the development of reliable and effective AI-powered healthcare tools.

Quickstart: Evaluating LLMs with MedAgentBench

Ready to dive in? Here's a simplified, step-by-step guide to evaluating your LLM, such as gpt-4o-mini, on MedAgentBench.

Step 1: Lay the Foundation - Prerequisites and Setup

Clone the MedAgentBench repository:

git clone [MedAgentBench Repo URL]
cd MedAgentBench

Create a dedicated environment and install dependencies:

conda create -n medagentbench python=3.9
conda activate medagentbench
pip install -r requirements.txt

Ensure Docker is installed and running. Docker is essential for setting up the FHIR server.
Download and Run the FHIR Server Docker Image: Streamline your environment by ensuring the stability and replicability of your testing.
```
docker pull jyxsu6/medagentbench:latest
docker tag jyxsu6/medagentbench:latest medagentbench
docker run -p 8080:8080 medagentbench
```
- Verify the setup in your browser at http://localhost:8080/. A FHIR server console should be visible.
Retrieve refsol.py: Obtain this crucial file from the designated location and save it as src/server/tasks/medagentbench/refsol.py.

Step 2: Configure Your AI Agent

OpenAI API Key: Locate the configs/agents/openai-chat.yaml file and enter your OpenAI API key. Obtain your key from the OpenAI platform.
Alternative Models (Gemini, Claude): If using models like Gemini or Claude on Vertex AI, get your access token by running gcloud auth print-access-token in your terminal.
Agent Testing: Verify your agent configuration with:
```
python -m src.client.agent_test --config configs/agents/api_agents.yaml --agent gpt-4o-mini
```
Modify the --agent parameter to test different models.

Step 3: Launch the Task Server

Simplify the task worker launch with the provided script:

python -m src.start_task -a

This script automatically launches 20 task workers. Allow approximately one minute for setup. Look for ".... 200 OK" in the terminal output before proceeding. This automated setup streamlines the testing process.

Troubleshooting Port Conflicts: If you encounter port conflicts (ports 5000-5015 are required), consult platform-specific documentation (e.g., for macOS) on how to free up those ports.

Step 4: Initiate the Tasks

Start the task tests. If configured correctly, tasks will now be assigned and processed.

Step 5: Unveiling the Results

Access the evaluation results in the outputs/MedAgentBenchv1/gpt-4o-mini/medagentbench-std/overall.json file. Analyze this data to understand your medical LLM agent’s performance in the virtual EHR environment. Understanding these results can help refine your AI models and accelerate improvements in medical applications.

Citing MedAgentBench

If MedAgentBench contributes to your research, please cite the following:

@misc{jiang2025medagentbenchrealisticvirtualehr,
title={MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents},
author={Yixing Jiang and Kameron C. Black and Gloria Geng and Danny Park and James Zou and Andrew Y. Ng and Jonathan H. Chen},
year={2025},
eprint={2501.14654},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2501.14654},
}

This guide helps you quickly begin using MedAgentBench to assess and enhance medical AI agents!