In partnership with

Your career will thank you.

Over 4 million professionals start their day with Morning Brew—because business news doesn’t have to be boring.

Each daily email breaks down the biggest stories in business, tech, and finance with clarity, wit, and relevance—so you're not just informed, you're actually interested.

Whether you’re leading meetings or just trying to keep up, Morning Brew helps you talk the talk without digging through social media or jargon-packed articles. And odds are, it’s already sitting in your coworker’s inbox—so you’ll have plenty to chat about.

It’s 100% free and takes less than 15 seconds to sign up, so try it today and see how Morning Brew is transforming business media for the better.

Hi!

Today is LLM Friday.

Reseach on large language models is growing exponentially; every week more papers are being published about this topic. But there is uncertainty on how they will be integrated into clinical workflows.

How do we ensure safety, accuracy, and ethical use? How do we regulate their use? When should we transition from testing on benchmarks to real case scenarios?

There are several questions no one knows the answer to yet, but I believe the best we can do is to keep learning, testing, and gradually integrating them under supervision.

What is your take on this?

For now, let’s dive into today’s issue.

LLMs

This commentary from Nature Medicine analyzes GPT-5’s reliability, safety, and reasoning in clinical settings.

🔬 Methods

What was evaluated?: GPT-5's two variants:

  • GPT-5-Thinking

  • GPT-5-Main

Compared with GPT-4o, o3, and prior LLMs.

Benchmarks used: HealthBench (5,000 clinician-curated cases), MedQA NOTA variant, and stress tests.

📊 Results

  • GPT-5-thinking made 65–78% fewer major errors than o3.

    GPT-5-main made 44% fewer than GPT-4o.

  • GPT-5 Fails in over 50% of difficult clinical scenarios.

  • <1% of generated responses included a safety disclaimer.

  • GPT-5-thinking maintained safety rules 99% of the time; GPT-5-main dropped to 79%.

🔑 Key Takeaways

  • GPT-5's hallucinate less, but don't eliminate clinical risk, particularly in complex or adversarial scenarios.

  • GPT- 5s reasoning and writing skills have improved; it still predicts — but it doesn’t reason like a doctor.

  • Disclaimers disappearing is risky in clinical use.

  • Integration into Electronic Health Records should be done in secure, auditable environments.

  • Experts should test and secure LLMs before clinical deployment.

🔗 Handler R, Sharma S, Hernandez-Boussard T. The fragile intelligence of GPT-5 in medicine. Nat Med. 2025. doi:10.1038/s41591-025-04008-8

Researchers developed a 750-question script concordance test (SCT) to evaluate how LLMs adjust clinical judgments with new information, comparing 10 models against 1,070 medical students, 193 residents, and 300 attending physicians across multiple specialties.

🔬 Methods

  • Benchmark: 750 SCT questions from 10 international datasets across multiple specialties.

  • Participants: 1070 medical students, 193 residents, 300 attending physicians.

  • LLMs tested: GPT-4o, o3, o1, 04-mini, Gemini 1.5 Pro, Claude 3.5, NotebookLM, DeepSeek, Llama Mistral).

  • Goal: Compare LLM reasoning with real clinicians and analyze how prompting and architecture affect accuracy.

📊 Results

  • Top performer: OpenAI’s o3 (67.8 ± 1.2 %), followed by GPT-4o (63.9 ± 1.3 %).

  • Human comparison: Models matched medical-student performance but fell short of senior residents and attending physicians.

  • Overconfidence pattern: Reasoning-tuned models are extremely confident and rarely choose neutral responses, showing limited flexibility under uncertainty.

  • Statistical tests: Significant model differences (p < 0.001). OpenAI’s o3 outperformed all others.

  • Reasoning Models Underperformed: OpenAI-o1, DeepSeek R1, and Gemini 2.5 scored lowest.

🔑 Key Takeaways

  • Script Concordance Test reveals gaps in adaptive reasoning that traditional multiple-choice benchmarks miss.

  • Reasoning-optimized models do not always reason better — sometimes, explicit reasoning prompts reduce accuracy.

  • LLMs tend to be overconfident and struggle to judge when new data should not change a decision.

  • The new public SCT benchmark (concor.dance) can be used to measure clinical reasoning before real-world deployment.

🔗 McCoy LG, Swamy R, Sagar N, et al. Assessment of Large Language Models in Clinical Reasoning: A Novel Benchmarking Study. NEJM AI. 2025;2(10). doi:10.1056/AIdbp2500120.

🦾TechTools

To continue discussing how to improve the use of general LLMs in medicine, let’s talk about hallucinations.

What it means:
Hallucinations happen when an AI generates confident but false medical information.

Even small errors—like wrong dosages or fake references—can be risky in healthcare

How do we lower the risk?

  • Use verified medical sources (PubMed, guidelines) when prompting.

  • Ask AI to show its references and confirm they are real.

  • Provide the LLMs with documents or data you can trust, so answers come from real data.

  • Always review.

🧬AIMedily Snaps

  • A smart stethoscope + AI assistant (Link)

  • AI adoption in clinical practice; can we rely on it? (Link).

  • Medicine, machines, and magic - podcast with Dr. Jonathan Chen from Stanford NEJM AI (Link).

That’s all for now.

You’re already ahead of the curve in medical LLMs — don’t keep it to yourself. Forward AIMedily to a colleague who’d appreciate the insights.

Thank you!

Until next Wednesday.

Itzel Fer, MD PM&R

Follow me on LinkedIn | Substack | X | Instagram

Forwarded this email? Sign up here

How did you like today's newsletter?

Login or Subscribe to participate

Seeking impartial news? Meet 1440.

Every day, 3.5 million readers turn to 1440 for their factual news. We sift through 100+ sources to bring you a complete summary of politics, global events, business, and culture, all in a brief 5-minute email. Enjoy an impartial news experience.