Help us make better ads
Did you recently see an ad for beehiiv in a newsletter? We’re running a short brand lift survey to understand what’s actually breaking through (and what’s not).
It takes about 20 seconds, the questions are super easy, and your feedback directly helps us improve how we show up in the newsletters you read and love.
If you’ve got a few moments, we’d really appreciate your insight.
Hi!
Welcome back to AIMedily.
How’s winter treating you? Here in Ann Arbor, it’s been snowy and cold. If the forecast holds, temperatures may drop to –13°F (–25°C) this weekend 🥶. I’m definitely not ready for that.
Now, back to what matters.
This week, Stanford ARISE released a new RThe Syateport that offers a clear, grounded view of what’s happening in AI in medicine.
It’s a comprehensive piece, includes key research, how these tools can be translated into real-world care, and where the field may be heading next.
Ready to dive in?
🤖 AIBytes
Researchers developed MedHELM, a clinician-validated framework to evaluate LLMs across real-world medical tasks, beyond exam questions.
🔬 Methods
29 clinicians across 14 medical specialties reviewed and validated the evaluation framework.
The authors organized medical AI tasks into:5 categories (clinical decision support, clinical note generation, patient communication & education, medical research assistance, administration & workflow).
22 subcategories
121 specific real-world clinical tasks.
37 Benchmarks used: Public datasets, gated datasets, private real-world datasets, including electronic health record (EHR)–based tasks.
Models evaluated:
Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek R1, o3-mini, GPT-4o, GPT-4o mini, Gemini 1.5 Pro, Gemini 2.0 Flash, and Llama 3.3.Evaluation approach:
Exact-match scoring for structured tasks.
LLM evaluation (three independent models) for open-ended clinical outputs, assessing accuracy, completeness, and clarity.
Cost–performance analysis.

📊 Results
Top overall performance: DeepSeek R1 and o3-mini (win rate 66% across benchmarks).
Best balance of performance and cost: Claude 3.5 Sonnet (win rate 63%, 15% lower computational cost).
Strongest task areas across models:
Clinical note generation
Patient communication and education
Weakest area for all models: Administration & workflow.
Performance on medical licensing exams did not reliably predict real-world clinical task performance.
🔑 Key Takeaways
High exam scores does not align with readiness for real clinical deployment.
LLM performance varies by task type, not just by model brand.
Administrative and workflow tasks remain a major limitation.
MedHELM provides a practical way for health systems to select AI tools based on real clinical needs.
🔗Bedi S, Cui H, Fuentes M, et al. Holistic evaluation of large language models for medical tasks with MedHELM. Nature Medicine. 2026.
DOI: https://doi.org/10.1038/s41591-025-04151-2
Researchers tested whether an LLM chatbot could help patients and primary care clinicians create clearer and complete referrals to specialists—without increasing risk or clinical burden.
🔬 Methods
2,069 patients / care partners
111 specialists across 24 medical disciplines
Intervention:
PreA (Pre-Assessment) — a co-designed chatbot (GPT-4.0 mini)
Conducted structured history-taking, generated preliminary diagnostic suggestions, suggested tests, and produced referral reports for specialists.
Groups:
PreA-only (autonomous use)
PreA-human (with staff assistance)
No-PreA (usual care)
Primary outcomes:
Consultation duration
Physician-rated care coordination
Patient-reported ease of communication
Secondary outcomes:
Physician workload (patients per shift)
Patient satisfaction, attentiveness, acceptability
Clinical decision-making patterns
📊 Results
Consultation time: PreA-only reduced specialist consultation duration by 28.7% (P < 0.001).
Care coordination (physician-rated): 113% relative improvement with PreA vs usual referrals, P < 0.001.
Patient-reported communication ease: 16% relative improvement with PreA, P < 0.001
Autonomy: No significant differences between PreA-only and PreA-human groups → confirms autonomous operation.
Physician workload: Physicians saw 15.3% more patients per shift without increased waiting times.
Clinical safety:
No detectable changes in physician clinical decision-making patterns.
Referral quality:
History-taking: 65.8%
Diagnoses: 66.7%
Test ordering: 70.7%
Co-designed models outperformed locally fine-tuned models across all clinical domains.

🔑 Key Takeaways
LLMs chatbot can reduce specialist workload while improving patient experience.
Co-design with local clinical stakeholders outperforms passive training.
Efficiency gains did not compromise clinical reasoning or care.
Results are most applicable to high-volume, resource-constrained health systems
🔗 Tao X, Zhou S, Ding K, et al. An LLM chatbot to facilitate primary-to-specialist care transitions: a randomized controlled trial. Nature Medicine. 2026. doi:10.1038/s41591-025-04176-7
🦾TechTools
Is a voice-based AI platform that converts clinician–patient conversations into clinical documentation.
It’s designed to support workflows within existing clinical systems.
Reduces manual documentation effort while keeping clinicians fully in the loop.
Platform focused on secure aggregation and sharing of longitudinal medical records.
Enables consent-driven access to patient data across care settings.
Designed to improve information availability for care coordination.
AI-powered conversational search engine.
Built as a privacy-first, ad-free alternative to commercial search engines.
Helps you explore research and ideas before moving to primary sources.
🧬AIMedily Snaps
Horizon 1000: Gates Foundation and OpenAI committing to $50 million to strengthen primary healthcare for 1,000 African clinics (Link).
OpenEvidence partners with Rwanda Biomedical Center to adapt clinical decision support AI tool for use in low- and middle-income countries (Link).
Guiding Principles of Good AI Practice in Drug Development (Link).
Top healthcare AI trends in 2026 (Link).
NVIDIA and Lilly Announce Co-Innovation AI Lab to Reinvent Drug Discovery (Link)
European Medicines Agency and FDA set common principles for AI in medicine development (Link).
🧪 Research Signals
Nature: An artificial intelligence-powered learning health system to improve sepsis detection and quality of care: a before-and-after study (Link).
NEJM: Assessment of Short-Answer Questions by ChatGPT in a Medical School Course (Link).
NEJM: The Paradoxical Challenge of High-Value Medical Artificial Intelligence (Link)
JAMA: Ambient Artificial Intelligence Scribes and Physician Financial Productivity (Link).
NEJM: Grading LLMs on the Ability to Grade (Link).
🧩TriviaRX
Which everyday medical tool was originally introduced to detect anesthesia-related hypoxia, not to monitor patients continuously?
A) Blood pressure cuff
B) Pulse oximeter
C) Capnography
D) ECG
Now, time for the answer from last week TriviaRX. What was one of the first real clinical uses of machine learning in medicine?
✅ B) ECG interpretation
Computer-assisted ECG analysis was used clinically as early as the 1960s.
Thank you for taking the time to read.
You’re already ahead of the curve in medical AI — don’t keep it to yourself. Forward AIMedily to your colleagues who’d appreciate the insights.
Until next Wednesday.
Itzel Fer, MD PM&R
Forwarded this email? Sign up here
P.S. Enjoying AIMedily? 👉 Write a review here (it takes less than a minute).








