In partnership with

Help us make better ads

Did you recently see an ad for beehiiv in a newsletter? We’re running a short brand lift survey to understand what’s actually breaking through (and what’s not).

It takes about 20 seconds, the questions are super easy, and your feedback directly helps us improve how we show up in the newsletters you read and love.

If you’ve got a few moments, we’d really appreciate your insight.

Take the survey.

Hi!

Welcome back to AIMedily.

How’s winter treating you? Here in Ann Arbor, it’s been snowy and cold. If the forecast holds, temperatures may drop to –13°F (–25°C) this weekend 🥶. I’m definitely not ready for that.

Now, back to what matters.

This week, Stanford ARISE released a new RThe Syateport that offers a clear, grounded view of what’s happening in AI in medicine.

It’s a comprehensive piece, includes key research, how these tools can be translated into real-world care, and where the field may be heading next.

Ready to dive in?

🤖 AIBytes

How Well Do LLMs Really Perform in Real Medical Work?

Researchers developed MedHELM, a clinician-validated framework to evaluate LLMs across real-world medical tasks, beyond exam questions.

🔬 Methods

29 clinicians across 14 medical specialties reviewed and validated the evaluation framework.

The authors organized medical AI tasks into:
- 5 categories (clinical decision support, clinical note generation, patient communication & education, medical research assistance, administration & workflow).
- 22 subcategories
- 121 specific real-world clinical tasks.
37 Benchmarks used: Public datasets, gated datasets, private real-world datasets, including electronic health record (EHR)–based tasks.
Models evaluated:
Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek R1, o3-mini, GPT-4o, GPT-4o mini, Gemini 1.5 Pro, Gemini 2.0 Flash, and Llama 3.3.
Evaluation approach:
- Exact-match scoring for structured tasks.
- LLM evaluation (three independent models) for open-ended clinical outputs, assessing accuracy, completeness, and clarity.
- Cost–performance analysis.

📊 Results

Top overall performance: DeepSeek R1 and o3-mini (win rate 66% across benchmarks).

Best balance of performance and cost: Claude 3.5 Sonnet (win rate 63%, 15% lower computational cost).

Strongest task areas across models:

Clinical note generation
Patient communication and education

Weakest area for all models: Administration & workflow.

Performance on medical licensing exams did not reliably predict real-world clinical task performance.

🔑 Key Takeaways

High exam scores does not align with readiness for real clinical deployment.
LLM performance varies by task type, not just by model brand.
Administrative and workflow tasks remain a major limitation.
MedHELM provides a practical way for health systems to select AI tools based on real clinical needs.

🔗Bedi S, Cui H, Fuentes M, et al. Holistic evaluation of large language models for medical tasks with MedHELM. Nature Medicine. 2026.
DOI: https://doi.org/10.1038/s41591-025-04151-2

Can an LLM Improve Primary-to-Specialist Referrals?

Researchers tested whether an LLM chatbot could help patients and primary care clinicians create clearer and complete referrals to specialists—without increasing risk or clinical burden.

🔬 Methods

2,069 patients / care partners
111 specialists across 24 medical disciplines

Intervention:
- PreA (Pre-Assessment) — a co-designed chatbot (GPT-4.0 mini)
- Conducted structured history-taking, generated preliminary diagnostic suggestions, suggested tests, and produced referral reports for specialists.
Groups:
- PreA-only (autonomous use)
- PreA-human (with staff assistance)
- No-PreA (usual care)
Primary outcomes:
- Consultation duration
- Physician-rated care coordination
- Patient-reported ease of communication
Secondary outcomes:
- Physician workload (patients per shift)
- Patient satisfaction, attentiveness, acceptability
- Clinical decision-making patterns

📊 Results

Consultation time: PreA-only reduced specialist consultation duration by 28.7% (P < 0.001).
Care coordination (physician-rated): 113% relative improvement with PreA vs usual referrals, P < 0.001.
Patient-reported communication ease: 16% relative improvement with PreA, P < 0.001
Autonomy: No significant differences between PreA-only and PreA-human groups → confirms autonomous operation.
Physician workload: Physicians saw 15.3% more patients per shift without increased waiting times.
Clinical safety:
- No detectable changes in physician clinical decision-making patterns.
Referral quality:
- History-taking: 65.8%
- Diagnoses: 66.7%
- Test ordering: 70.7%
Co-designed models outperformed locally fine-tuned models across all clinical domains.

🔑 Key Takeaways

LLMs chatbot can reduce specialist workload while improving patient experience.
Co-design with local clinical stakeholders outperforms passive training.
Efficiency gains did not compromise clinical reasoning or care.
Results are most applicable to high-volume, resource-constrained health systems

🔗 Tao X, Zhou S, Ding K, et al. An LLM chatbot to facilitate primary-to-specialist care transitions: a randomized controlled trial. Nature Medicine. 2026. doi:10.1038/s41591-025-04176-7

🦾TechTools

Knowtex

Is a voice-based AI platform that converts clinician–patient conversations into clinical documentation.
It’s designed to support workflows within existing clinical systems.
Reduces manual documentation effort while keeping clinicians fully in the loop.

HealthEX

Platform focused on secure aggregation and sharing of longitudinal medical records.
Enables consent-driven access to patient data across care settings.
Designed to improve information availability for care coordination.

Komo

AI-powered conversational search engine.
Built as a privacy-first, ad-free alternative to commercial search engines.
Helps you explore research and ideas before moving to primary sources.

🧬AIMedily Snaps

Horizon 1000: Gates Foundation and OpenAI committing to $50 million to strengthen primary healthcare for 1,000 African clinics (Link).
OpenEvidence partners with Rwanda Biomedical Center to adapt clinical decision support AI tool for use in low- and middle-income countries (Link).
Guiding Principles of Good AI Practice in Drug Development (Link).
Top healthcare AI trends in 2026 (Link).
NVIDIA and Lilly Announce Co-Innovation AI Lab to Reinvent Drug Discovery (Link)
European Medicines Agency and FDA set common principles for AI in medicine development (Link).

🧪 Research Signals

Nature: An artificial intelligence-powered learning health system to improve sepsis detection and quality of care: a before-and-after study (Link).
NEJM: Assessment of Short-Answer Questions by ChatGPT in a Medical School Course (Link).
NEJM: The Paradoxical Challenge of High-Value Medical Artificial Intelligence (Link)
JAMA: Ambient Artificial Intelligence Scribes and Physician Financial Productivity (Link).
NEJM: Grading LLMs on the Ability to Grade (Link).