Fact-based news without bias awaits. Make 1440 your choice today.

Overwhelmed by biased news? Cut through the clutter and get straight facts with your daily 1440 digest. From politics to sports, join millions who start their day informed.

Hi!

Today is LLM Friday, a day to learn about Research and Tools on Large Language Models in medicine.

A paper published this week analyzed 106,942 interactions made by 989 doctors with LLMs.

Here’s what they found:

60.2% of the LLM’s questions were related to research, while 12.25% were clinical.
Younger physicians asked more clinical questions than senior physicians.
Females asked more administrative and clinical questions.

Even though these physicians use LLMs to search for information, they have concerns about accuracy and reliability. I have the same concerns when I use publicly available LLMs. I verify every single answer to make sure it is correct.

Do you use LLMs to ask clinical or research questions? Do you trust the answer?

Now, let’s dive into today’s issue.

✨LLMs

MedAgentBench: a virtual Electronic Health Record to Stress-Test LLMs

This study developed a Virtual Electronic Health Record to test how LLM agents perform in realistic clinical scenarios.

🔬Methods

Data: 100 synthetic patient profiles with 700,000+ data elements.

Physicians wrote 300 clinical tasks such as diagnosis, treatment planning, documentation, and administrative functions.

Models tested: 12 Large Language Models, including Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro, DeepSeek-V3, Qwen2.5, and Llama 3.3.

They evaluated the models on:

Queries: A question or prompt, for example: What is the most likely diagnosis, or What is the next best step in management?
Actions: Evaluating the LLM's ability to perform a sequence of actions like ordering tests or writing prescriptions.

📊 Results

Claude 3.5 Sonnet v2 performed best: 69.7% overall success rate.

Query-based: higher accuracy (Claude 85%).
Action-based: lower accuracy (Claude 54%).

Other models:

GPT-4o: 64% overall (72% query, 56% action).
Gemini 1.5 Pro: 62% overall (53% query, 71% action).
DeepSeek-V3: 62.7% overall (71% query, 55% action).

Common errors observed:

Inaccuracies: Provided incorrect or outdated clinical information.
Hallucinations: Generated fabricated content that sound real.
Critical omissions: Where essential clinical details or steps were left out of responses or actions.

🔑 Key Takeaways

LLMs still have limitations on clinical reasoning, accuracy, and the ability to handle administrative tasks.
Current LLMs, when tested with MedAgentBench have variable performance on clinical tasks.
MedAgentBench can help measure progress and identify weaknesses in LLMs.
The fact that it’s open-source allows other developers to test it. This accelerates research in medical AI agents. https://github.com/stanfordmlgroup/MedAgentBenc

🔗 Jiang Y, Black KC, Geng G, Park D, Zou J, Ng AY, Chen JH. MedAgentBench: A Virtual EHR Environment to Benchmark Medical LLM Agents. NEJM AI. Published August 14, 2025. doi:10.1056/AIdbp2500144

🦾TechTools

Medwise is a search engine for doctors, that gives evidence-based answers. It’s customizable, which allows hospitals or organizations to integrate their local protocols and guidelines so the answers match their own practice. It’s also free to use for individuals (Link).

Dr.Oracle is an AI medical assistant for clinicians and students that has depth reasoning capabilities, gives accurate medical references, and can help you understand study methodology and how to ask clinical questions. Has a free trial (Link).

Hippocratic AI is a large medical language model that builds AI agents for patient support, like follow-up, pre-operative instructions, chronic care support, and insurance coverage. Not free; (Link).

That’s all for now.

If you know people in healthcare who would like to get updates on LLM news, feel free to share it:

↪️Forwarding this email or 📲copy this link and send it on your phone.

Thank you!

Until next Wednesday.

Itzel Fer, MD PM&R

Follow me on LinkedIn | Substack | X | Instagram

Forwarded this email? Sign up here

P.S. Have fun this weekend 🏖️

AIMedily LLMs

Fact-based news without bias awaits. Make 1440 your choice today.

✨LLMs

MedAgentBench: a virtual Electronic Health Record to Stress-Test LLMs

🔬Methods

📊 Results

🔑 Key Takeaways

🦾TechTools

How did you like today's newsletter?

AIMedily LLMs 2

AIMedily

AIMedily

AIMedily