AIMedily 54

Hi,

Welcome back!

Mexico won its World Cup match yesterday, so I’m starting this issue in a good mood. These days, my home is all about soccer.

And in medical AI, something important happened this week: UpDoc received FDA clearance for a platform that helps clinicians manage patients between visits, starting with type 2 diabetes.

AI is getting closer to patient care.

Let’s dive into today’s issue.

🤖AIBytes

Two clinical studies that deserve a closer look.

When Medical AI Scores Hide Fragile Reasoning

This Nature Medicine study tested whether high benchmark scores from large AI models reflect real medical reasoning, or whether they hide fragile behavior.

Methods

Researchers evaluated several frontier models, including GPT-5, Gemini 2.5 Pro, OpenAI-o3, OpenAI-o4-mini, Claude 3.5 Sonnet, and GPT-4o.

They used multimodal medical benchmarks, including: NEJM, JAMA, VQA-RAD, PMC-VQA and OmniMedVQA

The team stress-tested the models by:

Removing images
Changing answer order
Replacing distractor choices
Swapping images
Checking whether model explanations were accurate

Results

When images were removed, model performance dropped.

But the models still answered correctly more often than expected, even when the image was missing.

On NEJM cases, GPT-5 dropped from 81.3% to 67.4% when images were removed.
Gemini 2.5 Pro dropped from 81.1% to 67.4%.
In image-dependent NEJM cases, models still answered above chance without the image.
GPT-5 reached about 41%, Gemini 2.5 Pro about 40%, and OpenAI-o3 about 39%. Chance level was 20%.
When the original image was replaced with another plausible medical image, GPT-5 fell from 84% to 35%.

So, step-by-step reasoning did not reliably help.

Some models gave fluent explanations, but described image findings that were false or unsupported.

^{Gu et al.,}^{Nature Medicine}^{, 2026.}

Key Takeaways

High benchmark scores do not always mean a model is ready for clinical use.
Some models may guess the right answer even when key image information is missing.
Small changes in the question format, answer choices, or images can reveal fragile reasoning.

🔗 Gu Y, Fu J, Liu X, et al. Evaluating the robustness and readiness of large frontier models in health AI applications. Nature Medicine. 2026. doi:10.1038/s41591-026-04501-8

A Real-World Trial of AI in Primary Care

This Nature Medicine trial tested whether adding an LLM-based clinical decision support tool to primary care could improve patient outcomes in real clinical practice.

Methods

Researchers conducted a pragmatic, cluster-randomized trial across 16 primary care facilities in Kenya.

The study included:

103 clinical officers
9,347 patients in the primary analysis
16 primary care facilities

Clinical officers were randomized to use the EHR with or without LLM assistance.

The AI tool used GPT-4o and was embedded inside the clinical workflow. It reviewed information entered in the medical record and generated diagnostic and treatment suggestions.

Clinicians could accept, modify, or ignore the AI recommendations.

The main outcome was treatment failure within 14 days.

Results;

Treatment failure was similar in both groups:

AI-assisted group: 102 of 4,693 patients, or 2.2%
Control group: 94 of 4,654 patients, or 2.0%

The difference was not statistically significant.

The strongest signal was documentation quality. Clinicians using AI were more likely to:

record an appropriate diagnosis
write a more complete note
document an appropriate treatment plan

Patient satisfaction was similar in both groups.

Consultation time was also similar: about 11 minutes in both groups.

No serious adverse events were judged to be related to the AI tool.

^{Agweyu et al.,}^{Nature Medicine}^{, 2026.}

Key Takeaways

This is an important real-world trial, not just a benchmark study.
LLM support appeared safe in this primary care setting, but it did not significantly reduce short-term treatment failure.
The strongest benefit was better documentation, not better patient outcomes.

🔗Agweyu A, Mwaniki P, Menon V, et al. Generative AI-enabled clinical decision support system in primary care: a pragmatic, cluster-randomized trial. Nature Medicine. 2026. doi:10.1038/s41591-026-04503-6

🧬AIMedily Snaps

Medical AI updates worth knowing this week.

UpDoc: became the first FDA-cleared clinical AI platform to help clinicians manage patients between visits, starting with type 2 diabetes care (Read).
Claude Science: Anthropic launched an AI tool to help scientists read papers, analyze data, make figures, and manage research workflows in one place (Read).
ZS: 2026 Future of Health Report, a survey of nearly 10,000 healthcare consumers and providers across the U.S., Germany and China (Read).
KFF: 3 in 10 adults turn to social media or AI for health information (Read).
Mayo Clinic: Are we deploying AI algorithms without appropriate oversight? (Read).
Department of Health and Human Services is working on clearer guidance for healthcare AI (Read).

🧪Research Signals

New papers worth your time.

Nature: An ECG biomarker for sudden cardiac death discovered with deep learning (Paper).
Nature: AI supported scoliosis surgical planning and postoperative prediction in adolescent idiopathic scoliosis (Paper).
BMJ: AI for cervical cancer screening and diagnosis using Pap smear images (Paper).
JAMA: AI note summarization in the emergency department (Paper).
Pediatrics: Sociodemographic variability in pediatric emergency decisions by AI (Paper).
Nature: LLMs showed structured reasoning failures when interpreting real oncology notes (Paper).

🦾TechTools

AI tools clinicians may want on their radar.

VisualDx

Clinical decision support tool for building a visual differential diagnosis.
Uses medical images and clinical findings to support point-of-care decisions.
Useful for dermatology, primary care, and urgent care.

Pearl Health

AI-powered platform for value-based primary care.
Helps care teams identify high-priority patients and act earlier.
Useful for health systems managing Medicare patients.

📈 Productivity AI tool of the week:

Shortwave

AI email app built for Gmail.
Helps summarize threads, draft replies, and organize your inbox.
Useful for managing administrative email, research coordination, and communication.

🧩TriviaRX

A quick question to test your knowledge.

Which physician is credited with helping stop the 1854 cholera outbreak in London by persuading authorities to remove the handle from a contaminated water pump?

Options:
A) William Farr
B) Ignaz Semmelweis
C) John Snow
D) Rudolf Virchow

Now, the answer from last week’s TriviaRX: ✅ A) Letheon

In 1846, William Morton used “Letheon” during the famous public ether anesthesia demonstration at Massachusetts Hospital. The name helped hide that the substance was ether.

That’s it for today.

As always, thank you for being part of this community.

You’re already ahead of the curve in medical AI — don’t keep it to yourself. Share AIMedily with your colleagues who’d appreciate the insights.

Until next week.

Itzel Fer, MD PM&R

Follow me on LinkedIn | Substack | X | Instagram

Forwarded this email? Subscribe free.

P.S. Enjoying AIMedily? 👉 Write a review here (it takes less than a minute).

How did you like today's newsletter?

News worth waking up for

Morning Brew unpacks the biggest stories in business, finance, and tech every morning.

Explained clearly. Read quickly. And a little fun too.

Try it yourself and join 4 million+ professionals reading daily.

Check it out