Unlock Medical LLM Mastery: A Quick Guide to MedAgentBench

Are you ready to revolutionize your medical Large Language Model (LLM) agent's performance? MedAgentBench provides a realistic virtual Electronic Health Record (EHR) environment designed to benchmark and refine your medical LLM agents. This guide will give you a rapid, actionable plan to get started.

Why MedAgentBench? Elevate Your Medical LLM Agent

Realistic Simulations: Mimics real-world EHR scenarios.
Comprehensive Benchmarking: Rigorously tests your agent's abilities.
Actionable Insights: Identifies areas for improvement and optimization.

By using MedAgentBench, you gain invaluable data and insights to enhance the capabilities and reliability of your medical LLM agents.

Quick Start: Evaluate Your Agent in 5 Simple Steps

This section will guide you through the essential steps to get your agent up and running with MedAgentBench so you can improve your agent on medical reasoning. Let's get started.

Step 1: Essential Prerequisites – Get Ready to Run

Clone the Repository:

git clone [repository URL]
cd MedAgentBench

Create Environment:

conda create -n medagentbench python=3.9
conda activate medagentbench
pip install -r requirements.txt

Docker Installation: Ensure Docker is installed and running correctly.

These are the foundational steps, ensuring your environment is set up for successful agent evaluation.

Step 2: Docker Setup – Launch the FHIR Server

Pull the Docker Image:

docker pull jyxsu6/medagentbench:latest
docker tag jyxsu6/medagentbench:latest medagentbench
docker run -p 8080:8080 medagentbench

Verification: After the console displays “Started Application…,” navigate to http://localhost:8080/ to confirm the FHIR server console is active.
Download Reference Solution: Download refsol.py as src/server/tasks/medagentbench/refsol.py from repository.

This step spins up the necessary FHIR server, a critical component of the MedAgentBench environment.

Step 3: Configure Your Agent – Connect to the Power Source

OpenAI API Key: Input your OpenAI API key into configs/agents/openai-chat.yaml. Get your key from the OpenAI platform.
Alternative Models: For Gemini, Claude, or Vertex AI, run gcloud auth print-access-token to obtain your access token.
Agent Testing: Use python -m src.client.agent_test to verify correct agent configuration.

This step ensures your agent can communicate with the MedAgentBench environment properly.

Step 4: Task Server Activation – Initiate the Process

Automated Script: Execute the following command to launch task workers:
```
python -m src.start_task -a
```
Port Availability: Ensure ports 5000-5015 are available.
Wait for Completion: Allow approximately 1 minute for task setup; look for ".... 200 OK" in the terminal.

This step activates the task workers, which are the engines that will execute the evaluation tasks.

Step 5: Task Assignment and Result Retrieval – Put Your Agent to the Test

Start Tasks: Initiate the task tests by running the assigner.
Retrieve Results: Find your results in outputs/MedAgentBenchv1/gpt-4o-mini/medagentbench-std/overall.json.

This is the culmination of your setup, where you'll finally see the performance metrics of your agent.

Next Steps: Data-Driven Optimization

Analyze the results to identify areas where your medical LLM agent needs improvement. Refine your agent’s models, prompts, and strategies based on the data from MedAgentBench to continuously improve its performance within a realistic EHR environment.