Unlock the Power of Medical AI: Introducing MedAgentBench for LLM Evaluation
Are you ready to evaluate the potential of medical AI agents in a realistic environment? MedAgentBench is the answer – a virtual EHR (Electronic Health Record) world where you can benchmark the performance of Large Language Models (LLMs) in healthcare scenarios. This guide provides a clear path to implementing and utilizing MedAgentBench.
Why MedAgentBench Matters for Medical AI Innovation
MedAgentBench offers a crucial testing ground for medical AI agents. It allows researchers and developers to:
- Realistically simulate EHR interactions for comprehensive AI evaluation.
- Benchmark different LLMs against standardized medical tasks.
- Advance the development of reliable and effective AI-powered healthcare tools.
Quickstart: Evaluating LLMs with MedAgentBench
Ready to dive in? Here's a simplified, step-by-step guide to evaluating your LLM, such as gpt-4o-mini
, on MedAgentBench.
Step 1: Lay the Foundation - Prerequisites and Setup
- Clone the MedAgentBench repository:
- Create a dedicated environment and install dependencies:
- Ensure Docker is installed and running. Docker is essential for setting up the FHIR server.
- Download and Run the FHIR Server Docker Image: Streamline your environment by ensuring the stability and replicability of your testing.
- Verify the setup in your browser at
http://localhost:8080/
. A FHIR server console should be visible.
- Verify the setup in your browser at
- Retrieve
refsol.py
: Obtain this crucial file from the designated location and save it assrc/server/tasks/medagentbench/refsol.py
.
Step 2: Configure Your AI Agent
- OpenAI API Key: Locate the
configs/agents/openai-chat.yaml
file and enter your OpenAI API key. Obtain your key from the OpenAI platform. - Alternative Models (Gemini, Claude): If using models like Gemini or Claude on Vertex AI, get your access token by running
gcloud auth print-access-token
in your terminal. - Agent Testing: Verify your agent configuration with:
--agent
parameter to test different models.
Modify the
Step 3: Launch the Task Server
Simplify the task worker launch with the provided script:
This script automatically launches 20 task workers. Allow approximately one minute for setup. Look for ".... 200 OK" in the terminal output before proceeding. This automated setup streamlines the testing process.
Troubleshooting Port Conflicts: If you encounter port conflicts (ports 5000-5015 are required), consult platform-specific documentation (e.g., for macOS) on how to free up those ports.
Step 4: Initiate the Tasks
Start the task tests. If configured correctly, tasks will now be assigned and processed.
Step 5: Unveiling the Results
Access the evaluation results in the outputs/MedAgentBenchv1/gpt-4o-mini/medagentbench-std/overall.json
file. Analyze this data to understand your medical LLM agent’s performance in the virtual EHR environment. Understanding these results can help refine your AI models and accelerate improvements in medical applications.
Citing MedAgentBench
If MedAgentBench contributes to your research, please cite the following:
@misc{jiang2025medagentbenchrealisticvirtualehr,
title={MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents},
author={Yixing Jiang and Kameron C. Black and Gloria Geng and Danny Park and James Zou and Andrew Y. Ng and Jonathan H. Chen},
year={2025},
eprint={2501.14654},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2501.14654},
}
This guide helps you quickly begin using MedAgentBench to assess and enhance medical AI agents!