In partnership with

Help us make better ads

Did you recently see an ad for beehiiv in a newsletter? We’re running a short brand lift survey to understand what’s actually breaking through (and what’s not).

It takes about 20 seconds, the questions are super easy, and your feedback directly helps us improve how we show up in the newsletters you read and love.

If you’ve got a few moments, we’d really appreciate your insight.

Hi!

Welcome back to AIMedily.

How’s winter treating you? Here in Ann Arbor, it’s been snowy and cold. If the forecast holds, temperatures may drop to –13°F (–25°C) this weekend 🥶. I’m definitely not ready for that.

Now, back to what matters.

This week, Stanford ARISE released a new RThe Syateport that offers a clear, grounded view of what’s happening in AI in medicine.

It’s a comprehensive piece, includes key research, how these tools can be translated into real-world care, and where the field may be heading next.

Ready to dive in?

🤖 AIBytes

Researchers developed MedHELM, a clinician-validated framework to evaluate LLMs across real-world medical tasks, beyond exam questions.

🔬 Methods

  • 29 clinicians across 14 medical specialties reviewed and validated the evaluation framework.


    The authors organized medical AI tasks into:

    • 5 categories (clinical decision support, clinical note generation, patient communication & education, medical research assistance, administration & workflow).

    • 22 subcategories

    • 121 specific real-world clinical tasks.

  • 37 Benchmarks used: Public datasets, gated datasets, private real-world datasets, including electronic health record (EHR)–based tasks.

  • Models evaluated:
    Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek R1, o3-mini, GPT-4o, GPT-4o mini, Gemini 1.5 Pro, Gemini 2.0 Flash, and Llama 3.3.

  • Evaluation approach:

    • Exact-match scoring for structured tasks.

    • LLM evaluation (three independent models) for open-ended clinical outputs, assessing accuracy, completeness, and clarity.

    • Cost–performance analysis.

📊 Results

Top overall performance: DeepSeek R1 and o3-mini (win rate 66% across benchmarks).

Best balance of performance and cost: Claude 3.5 Sonnet (win rate 63%, 15% lower computational cost).

Strongest task areas across models:

  • Clinical note generation

  • Patient communication and education

Weakest area for all models: Administration & workflow.

Performance on medical licensing exams did not reliably predict real-world clinical task performance.

🔑 Key Takeaways

  • High exam scores does not align with readiness for real clinical deployment.

  • LLM performance varies by task type, not just by model brand.

  • Administrative and workflow tasks remain a major limitation.

  • MedHELM provides a practical way for health systems to select AI tools based on real clinical needs.

🔗Bedi S, Cui H, Fuentes M, et al. Holistic evaluation of large language models for medical tasks with MedHELM. Nature Medicine. 2026.
DOI: https://doi.org/10.1038/s41591-025-04151-2

Researchers tested whether an LLM chatbot could help patients and primary care clinicians create clearer and complete referrals to specialists—without increasing risk or clinical burden.

🔬 Methods

  • 2,069 patients / care partners

  • 111 specialists across 24 medical disciplines

  • Intervention:

    • PreA (Pre-Assessment) — a co-designed chatbot (GPT-4.0 mini)

    • Conducted structured history-taking, generated preliminary diagnostic suggestions, suggested tests, and produced referral reports for specialists.

  • Groups:

    • PreA-only (autonomous use)

    • PreA-human (with staff assistance)

    • No-PreA (usual care)

  • Primary outcomes:

    • Consultation duration

    • Physician-rated care coordination

    • Patient-reported ease of communication

  • Secondary outcomes:

    • Physician workload (patients per shift)

    • Patient satisfaction, attentiveness, acceptability

    • Clinical decision-making patterns

📊 Results

  • Consultation time: PreA-only reduced specialist consultation duration by 28.7% (P < 0.001).

  • Care coordination (physician-rated): 113% relative improvement with PreA vs usual referrals, P < 0.001.

  • Patient-reported communication ease: 16% relative improvement with PreA, P < 0.001

  • Autonomy: No significant differences between PreA-only and PreA-human groups → confirms autonomous operation.

  • Physician workload: Physicians saw 15.3% more patients per shift without increased waiting times.

  • Clinical safety:

    • No detectable changes in physician clinical decision-making patterns.

  • Referral quality:

    • History-taking: 65.8%

    • Diagnoses: 66.7%

    • Test ordering: 70.7%

  • Co-designed models outperformed locally fine-tuned models across all clinical domains.

🔑 Key Takeaways

  • LLMs chatbot can reduce specialist workload while improving patient experience.

  • Co-design with local clinical stakeholders outperforms passive training.

  • Efficiency gains did not compromise clinical reasoning or care.

  • Results are most applicable to high-volume, resource-constrained health systems

🔗 Tao X, Zhou S, Ding K, et al. An LLM chatbot to facilitate primary-to-specialist care transitions: a randomized controlled trial. Nature Medicine. 2026. doi:10.1038/s41591-025-04176-7

🦾TechTools

  • Is a voice-based AI platform that converts clinician–patient conversations into clinical documentation.

  • It’s designed to support workflows within existing clinical systems.

  • Reduces manual documentation effort while keeping clinicians fully in the loop.

  • Platform focused on secure aggregation and sharing of longitudinal medical records.

  • Enables consent-driven access to patient data across care settings.

  • Designed to improve information availability for care coordination.

  • AI-powered conversational search engine.

  • Built as a privacy-first, ad-free alternative to commercial search engines.

  • Helps you explore research and ideas before moving to primary sources.

🧬AIMedily Snaps

  • Horizon 1000: Gates Foundation and OpenAI committing to $50 million to strengthen primary healthcare for 1,000 African clinics (Link).

  • OpenEvidence partners with Rwanda Biomedical Center to adapt clinical decision support AI tool for use in low- and middle-income countries (Link).

  • Guiding Principles of Good AI Practice in Drug Development (Link).

  • Top healthcare AI trends in 2026 (Link).

  • NVIDIA and Lilly Announce Co-Innovation AI Lab to Reinvent Drug Discovery (Link)

  • European Medicines Agency and FDA set common principles for AI in medicine development (Link).

🧪 Research Signals

  • Nature: An artificial intelligence-powered learning health system to improve sepsis detection and quality of care: a before-and-after study (Link).

  • NEJM: Assessment of Short-Answer Questions by ChatGPT in a Medical School Course (Link).

  • NEJM: The Paradoxical Challenge of High-Value Medical Artificial Intelligence (Link)

  • JAMA: Ambient Artificial Intelligence Scribes and Physician Financial Productivity (Link).

  • NEJM: Grading LLMs on the Ability to Grade (Link).

🧩TriviaRX

Which everyday medical tool was originally introduced to detect anesthesia-related hypoxia, not to monitor patients continuously?

A) Blood pressure cuff
B) Pulse oximeter
C) Capnography
D) ECG

Now, time for the answer from last week TriviaRX. What was one of the first real clinical uses of machine learning in medicine?

B) ECG interpretation

Computer-assisted ECG analysis was used clinically as early as the 1960s.

Thank you for taking the time to read.

You’re already ahead of the curve in medical AI — don’t keep it to yourself. Forward AIMedily to your colleagues who’d appreciate the insights.

Until next Wednesday.

Itzel Fer, MD PM&R

Follow me on LinkedIn | Substack | X | Instagram

Forwarded this email? Sign up here

P.S. Enjoying AIMedily? 👉 Write a review here (it takes less than a minute).

How did you like today's newsletter?

Login or Subscribe to participate