🤖 AIBytes

General LLMs Beat Clinical AI Tools

Researchers compared clinical AI tools with general-purpose LLMs on medical benchmarks.

The goal was to see whether specialized clinical tools perform better than frontier LLMs on medical knowledge, clinician alignment, and real clinical questions.

🔬 Methods

The study tested OpenEvidence and UpToDate Expert AI compared with GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6.
The evaluation used 500 MedQA questions and 500 HealthBench items.
Researchers also created a real clinical queries benchmark from 100 questions.
For the real clinical queries, 12 US clinicians reviewed model answers in a randomized and blinded way.
Clinicians rated answers for clinical correctness, completeness, safety, and clarity.

📊 Results

General-purpose LLMs outperformed the clinical AI tools across all three evaluations.
On MedQA, Gemini scored highest at 97.4%.
OpenEvidence scored 89.6%, and UpToDate Expert AI scored 88.4%.
On HealthBench, GPT-5.2 scored highest at 88.0.
OpenEvidence scored 62.6, and UpToDate Expert AI scored 61.3.
On real clinical queries, the frontier LLMs scored higher.
Safety outcomes did not differ across models in this study.

🔑 Key Takeaways

This study challenges the idea that clinical AI tools always outperform general-purpose LLMs.
Frontier LLMs performed better on medical knowledge, clinician alignment, and real clinical queries according to this evaluation.
The real clinical query benchmark makes the study more relevant to daily practice.
Clinicians need clear guardrails when using general LLMs for patient care.
AI tools in healthcare need independent, real-world evaluation.

🔗 Vishwanath K, Alyakin A, Ghosh M, et al. General-purpose large language models outperform specialized clinical AI tools on medical benchmarks. Nature Medicine. 2026. doi:10.1038/s41591-026-04431-5

AI Manages Simulated Follow-Up Care

Researchers tested AMIE, Google’s conversational medical AI system, across simulated follow-up visits.

The goal was to see if AI could help with disease management over time, not just diagnosis.

🔬 Methods

AMIE was compared with 21 primary care physicians.
The study used 100 simulated patient cases.
Each case included 3 text-chat visits.
Cases covered cardiology, pulmonology, OB/GYN/urology, gastroenterology, and neurology/musculoskeletal care.
AMIE and physicians had access to NICE Guidance and BMJ Best Practice guidelines.
Specialist physicians and trained patient actors evaluated the visits.
The researchers also tested medication reasoning with RxQA, a 600-question benchmark based on US and UK drug formularies.

📊 Results

AMIE’s management plans were rated at least as good as physicians across all 15 quality measures.
AMIE scored higher on overall plan quality across all 3 visits.
AMIE gave more precise treatment recommendations:
- Visit 1: 96% vs. 62%
- Visit 2: 95% vs. 65%
- Visit 3: 95% vs. 67%
AMIE’s treatment plans were also more aligned with clinical guidelines.
Medication reasoning was still hard for both groups.
On harder medication questions, AMIE performed better than physicians:
- Closed-book: 50.6% vs. 41.5%
- Open-book: 57.9% vs. 47.8%
Peak performance stayed below 75% for both AMIE and physicians.

🔑 Key Takeaways

This study tested AI for follow-up care and disease management, not just diagnosis.
AMIE performed especially well in treatment precision and guideline alignment.
Medication reasoning is still a weak area, even with drug references.
This was a simulated study with text-chat visits and patient actors.
AMIE is not ready for clinical use yet.

🔗 Liévin V, Palepu A, Weng WH, et al. Towards Conversational AI for Disease Management. Nature. 2026. doi:10.1038/s41586-026-10764-5 (Preprint)

🦾TechTools

Atropos Health

You can use Atropos to turn clinical questions into real-world evidence.
It may help when guidelines or trials do not fully answer the question you have.
You can use it to generate evidence reports, run studies on clinical data, and support questions in care, pharmacy, quality, and research.

Ada

Ada is an AI-powered symptom assessment tool.
It helps users think through symptoms and learn what level of care may be needed.
You can also use its medical library to read easy-to-understand health information created by doctors and grounded in research.

📈 Productivity tool of the week

Motion

You can use Motion to plan your day around tasks, deadlines, meetings, and priorities.
It can automatically organize, prioritize, and schedule your tasks.
You can also use it for AI projects, tasks, calendar, meeting notes, docs, reports, workflows, and dashboards in one workspace.

🧬AIMedily Snaps

The big ideas from Stanford Health AI week (Link).
NVIDIA collaborates with Abridge on the first foundation model for clinical conversations (Link).
Wolters Kluwer: How AI is Reshaping the Care Experience (Link).
Doximity: Comparing Medical AI Tools for Healthcare Workflows in 2026 (Link).
AMA: With AI increasingly part of care, transparency and quality are musts (Link).
NHS England is expanding Microsoft Copilot access to more than 500,000 staff (Link).

🧪Research Signals

The Lancet: Beyond language: generative artificial intelligence as a general computing model for medicine (Link).
JAMA: Clinical Evidence and FDA Recalls of Artificial Intelligence–Enabled Medical Devices (Link).
Elsevier: Why almost all ML models for medicine are wrong-and what we need for evidence-based medical AI (Link).
JMIR: Performance of Deep Learning in Classifying Age-Related Macular Degeneration From Images: Systematic Review and Meta-Analysis (Link).
Nature: When silence is safer: a review and decision-theoretic framework for LLM abstention in healthcare (Link).
Nature: Large language models for acute coronary syndrome triage at first medical contact in emergency departments (Link).

🧩TriviaRX

Which disease was treated in the first widely recognized randomized clinical trial in medicine?

A) Diabetes
B) Tuberculosis
C) Smallpox
D) Pneumonia

Now, let’s see if you got last week’s TriviaRX correct: ✅ B) African clawed frog

Before modern pregnancy tests, doctors used the African clawed frog (Xenopus laevis) as a living test. When injected hCG into the frog, that could trigger the frog to release eggs within hours.

Want to keep up with AI in medicine in less than 5 minutes a week?

Subscribe free to AIMedily.

Itzel Fer, MD PM&R

Follow me on LinkedIn | Substack | X | Instagram

AIMedily 52

🤖 AIBytes

General LLMs Beat Clinical AI Tools

🔬 Methods

📊 Results

🔑 Key Takeaways

AI Manages Simulated Follow-Up Care

🔬 Methods

📊 Results

🔑 Key Takeaways

🦾TechTools

Atropos Health

Ada

Motion

🧬AIMedily Snaps

🧪Research Signals

🧩TriviaRX