Fact-based news without bias awaits. Make 1440 your choice today.
Overwhelmed by biased news? Cut through the clutter and get straight facts with your daily 1440 digest. From politics to sports, join millions who start their day informed.
Hi!
Today is LLM Friday, a day to learn about Research and Tools on Large Language Models in medicine.
A paper published this week analyzed 106,942 interactions made by 989 doctors with LLMs.
Here’s what they found:
60.2% of the LLM’s questions were related to research, while 12.25% were clinical.
Younger physicians asked more clinical questions than senior physicians.
Females asked more administrative and clinical questions.
Even though these physicians use LLMs to search for information, they have concerns about accuracy and reliability. I have the same concerns when I use publicly available LLMs. I verify every single answer to make sure it is correct.
Do you use LLMs to ask clinical or research questions? Do you trust the answer?
Now, let’s dive into today’s issue.
✨LLMs
This study developed a Virtual Electronic Health Record to test how LLM agents perform in realistic clinical scenarios.
🔬Methods
Data: 100 synthetic patient profiles with 700,000+ data elements.
Physicians wrote 300 clinical tasks such as diagnosis, treatment planning, documentation, and administrative functions.
Models tested: 12 Large Language Models, including Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro, DeepSeek-V3, Qwen2.5, and Llama 3.3.
They evaluated the models on:
Queries: A question or prompt, for example: What is the most likely diagnosis, or What is the next best step in management?
Actions: Evaluating the LLM's ability to perform a sequence of actions like ordering tests or writing prescriptions.
📊 Results
Claude 3.5 Sonnet v2 performed best: 69.7% overall success rate.
Query-based: higher accuracy (Claude 85%).
Action-based: lower accuracy (Claude 54%).
Other models:
GPT-4o: 64% overall (72% query, 56% action).
Gemini 1.5 Pro: 62% overall (53% query, 71% action).
DeepSeek-V3: 62.7% overall (71% query, 55% action).
Common errors observed:
Inaccuracies: Provided incorrect or outdated clinical information.
Hallucinations: Generated fabricated content that sound real.
Critical omissions: Where essential clinical details or steps were left out of responses or actions.
🔑 Key Takeaways
LLMs still have limitations on clinical reasoning, accuracy, and the ability to handle administrative tasks.
Current LLMs, when tested with MedAgentBench have variable performance on clinical tasks.
MedAgentBench can help measure progress and identify weaknesses in LLMs.
The fact that it’s open-source allows other developers to test it. This accelerates research in medical AI agents. https://github.com/stanfordmlgroup/MedAgentBenc
🔗 Jiang Y, Black KC, Geng G, Park D, Zou J, Ng AY, Chen JH. MedAgentBench: A Virtual EHR Environment to Benchmark Medical LLM Agents. NEJM AI. Published August 14, 2025. doi:10.1056/AIdbp2500144
🦾TechTools
Medwise is a search engine for doctors, that gives evidence-based answers. It’s customizable, which allows hospitals or organizations to integrate their local protocols and guidelines so the answers match their own practice. It’s also free to use for individuals (Link).
Dr.Oracle is an AI medical assistant for clinicians and students that has depth reasoning capabilities, gives accurate medical references, and can help you understand study methodology and how to ask clinical questions. Has a free trial (Link).
Hippocratic AI is a large medical language model that builds AI agents for patient support, like follow-up, pre-operative instructions, chronic care support, and insurance coverage. Not free; (Link).
That’s all for now.
If you know people in healthcare who would like to get updates on LLM news, feel free to share it:
↪️Forwarding this email or 📲copy this link and send it on your phone.
Thank you!
Until next Wednesday.
Itzel Fer, MD PM&R
Forwarded this email? Sign up here
P.S. Have fun this weekend 🏖️