Sponsored by

The most trustworthy AI admin agent for busy executives.

Busy executive. Packed calendar. An inbox that never empties.

Everyone is talking about AI agents. Every week brings another promise of an assistant that can do it all.

Catch is the real deal.

A smart, proactive AI admin agent focused solely on taking administrative work off your plate.

It schedules meetings, triages your inbox, drafts emails in your voice, resolves conflicts, sends follow-ups, and handles the countless small tasks that consume your day.

Available wherever you work — Gmail, Outlook, Slack, WhatsApp, and even over the phone.

No setup. No training. No learning curve.

Catch learns how you work, takes action when it's confident, and keeps things moving without constant supervision.

From swamped to sorted in seconds.

Get started with Catch and have your assistant ready before your next meeting.

Hi!

Welcome back to AIMedily.

It’s almost summer. My kids are done with school, and in Michigan, the weather still can’t decide what it wants to do.

Over the last two weeks, my coworkers and I have been working on a prosthetic ankle called VSPA (check it out). It’s designed to adapt in real time to the needs of the user, and it’s been really interesting to see how patients interact with it during testing.

All of that to say: it has been a busy couple of weeks.

Now, let’s dive into today’s issue.

🤖 AIBytes

Researchers compared clinical AI tools with general-purpose LLMs on medical benchmarks.

The goal was to see whether specialized clinical tools perform better than frontier LLMs on medical knowledge, clinician alignment, and real clinical questions.

🔬 Methods

  • The evaluation used 500 MedQA questions and 500 HealthBench items.

  • Researchers also created a real clinical queries benchmark from 100 questions.

  • For the real clinical queries, 12 US clinicians reviewed model answers in a randomized and blinded way.

  • Clinicians rated answers for clinical correctness, completeness, safety, and clarity.

📊 Results

  • General-purpose LLMs outperformed the clinical AI tools across all three evaluations.

  • On MedQA, Gemini scored highest at 97.4%.

  • OpenEvidence scored 89.6%, and UpToDate Expert AI scored 88.4%.

  • On HealthBench, GPT-5.2 scored highest at 88.0.

  • OpenEvidence scored 62.6, and UpToDate Expert AI scored 61.3.

  • On real clinical queries, the frontier LLMs scored higher.

  • Safety outcomes did not differ across models in this study.

🔑 Key Takeaways

  • This study challenges the idea that clinical AI tools always outperform general-purpose LLMs.

  • Frontier LLMs performed better on medical knowledge, clinician alignment, and real clinical queries according to this evaluation.

  • The real clinical query benchmark makes the study more relevant to daily practice.

  • Clinicians need clear guardrails when using general LLMs for patient care.

  • AI tools in healthcare need independent, real-world evaluation.

🔗 Vishwanath K, Alyakin A, Ghosh M, et al. General-purpose large language models outperform specialized clinical AI tools on medical benchmarks. Nature Medicine. 2026. doi:10.1038/s41591-026-04431-5

Researchers tested AMIE, Google’s conversational medical AI system, across simulated follow-up visits.

The goal was to see if AI could help with disease management over time, not just diagnosis.

🔬 Methods

  • AMIE was compared with 21 primary care physicians.

  • The study used 100 simulated patient cases.

  • Each case included 3 text-chat visits.

  • Cases covered cardiology, pulmonology, OB/GYN/urology, gastroenterology, and neurology/musculoskeletal care.

  • AMIE and physicians had access to NICE Guidance and BMJ Best Practice guidelines.

  • Specialist physicians and trained patient actors evaluated the visits.

  • The researchers also tested medication reasoning with RxQA, a 600-question benchmark based on US and UK drug formularies.

📊 Results

  • AMIE’s management plans were rated at least as good as physicians across all 15 quality measures.

  • AMIE scored higher on overall plan quality across all 3 visits.

  • AMIE gave more precise treatment recommendations:

    • Visit 1: 96% vs. 62%

    • Visit 2: 95% vs. 65%

    • Visit 3: 95% vs. 67%

  • AMIE’s treatment plans were also more aligned with clinical guidelines.

  • Medication reasoning was still hard for both groups.

  • On harder medication questions, AMIE performed better than physicians:

    • Closed-book: 50.6% vs. 41.5%

    • Open-book: 57.9% vs. 47.8%

  • Peak performance stayed below 75% for both AMIE and physicians.

🔑 Key Takeaways

  • This study tested AI for follow-up care and disease management, not just diagnosis.

  • AMIE performed especially well in treatment precision and guideline alignment.

  • Medication reasoning is still a weak area, even with drug references.

  • This was a simulated study with text-chat visits and patient actors.

  • AMIE is not ready for clinical use yet.

🔗 Liévin V, Palepu A, Weng WH, et al. Towards Conversational AI for Disease Management. Nature. 2026. doi:10.1038/s41586-026-10764-5 (Preprint)

🦾TechTools

  • You can use Atropos to turn clinical questions into real-world evidence.

  • It may help when guidelines or trials do not fully answer the question you have.

  • You can use it to generate evidence reports, run studies on clinical data, and support questions in care, pharmacy, quality, and research.

  • Ada is an AI-powered symptom assessment tool.

  • It helps users think through symptoms and learn what level of care may be needed.

  • You can also use its medical library to read easy-to-understand health information created by doctors and grounded in research.

📈 Productivity tool of the week

  • You can use Motion to plan your day around tasks, deadlines, meetings, and priorities.

  • It can automatically organize, prioritize, and schedule your tasks.

  • You can also use it for AI projects, tasks, calendar, meeting notes, docs, reports, workflows, and dashboards in one workspace.

🧬AIMedily Snaps

  • The big ideas from Stanford Health AI week (Link).

  • NVIDIA collaborates with Abridge on the first foundation model for clinical conversations (Link).

  • Wolters Kluwer: How AI is Reshaping the Care Experience (Link).

  • Doximity: Comparing Medical AI Tools for Healthcare Workflows in 2026 (Link).

  • AMA: With AI increasingly part of care, transparency and quality are musts (Link).

  • NHS England is expanding Microsoft Copilot access to more than 500,000 staff (Link).

🧪Research Signals

  • The Lancet: Beyond language: generative artificial intelligence as a general computing model for medicine (Link).

  • JAMA: Clinical Evidence and FDA Recalls of Artificial Intelligence–Enabled Medical Devices (Link).

  • Elsevier: Why almost all ML models for medicine are wrong-and what we need for evidence-based medical AI (Link).

  • JMIR: Performance of Deep Learning in Classifying Age-Related Macular Degeneration From Images: Systematic Review and Meta-Analysis (Link).

  • Nature: When silence is safer: a review and decision-theoretic framework for LLM abstention in healthcare (Link).

  • Nature: Large language models for acute coronary syndrome triage at first medical contact in emergency departments (Link).

🧩TriviaRX

Which disease was treated in the first widely recognized randomized clinical trial in medicine?

A) Diabetes
B) Tuberculosis
C) Smallpox
D) Pneumonia

Now, let’s see if you got last week’s TriviaRX correct: B) African clawed frog

Before modern pregnancy tests, doctors used the African clawed frog (Xenopus laevis) as a living test. When injected hCG into the frog, that could trigger the frog to release eggs within hours.

That’s it for today.

As always, thank you for taking the time to read.

You’re already ahead of the curve in medical AI — don’t keep it to yourself. Forward AIMedily to your colleagues who’d appreciate the insights.

Until next week.

Itzel Fer, MD PM&R

Follow me on LinkedIn | Substack | X | Instagram

Forwarded this email? Sign up here

P.S. Enjoying AIMedily? 👉 Write a review here (it takes less than a minute).

How did you like today's newsletter?

Login or Subscribe to participate

1  

2