In partnership with

Finally, a powerful CRM—made simple.

Attio is the AI-native CRM built to scale your company from seed stage to category leader. Powerful, flexible, and intuitive to use, Attio is the CRM for the next-generation of teams.

Sync your email and calendar, and Attio instantly builds your CRM—enriching every company, contact, and interaction with actionable insights in seconds.

With Attio, AI isn’t just a feature—it’s the foundation.

Instantly find and route leads with research agents
Get real-time AI insights during customer conversations
Build AI automations for your most complex workflows
Join fast growing teams like Flatfile, Replicate, Modal, and more.

Start for free today.

Hi!

Welcome to AIMedily’s LLM Fridays.

A day to share Research, Tools, and News on Large Language Models (AKA ChatGPT, Claude, etc).

This post will take you less than 5 min to read.

But before we start, let’s talk a bit about AI medical agents, systems trained to perform specific tasks autonomously.

For example: the ability to detect abnormalities in X-rays, flag urgent cases, suggest possible diagnoses or treatment options based on patient data, take notes, monitor wearable data, message patients, schedule follow-ups, and update care plans.

They facilitate task automation and improve workflows.

But even though they are integrated in administrative roles, in clinical work, they still need close supervision.

The main challenges to overcome are accuracy, integration, and regulation.

Now, let’s dive into today’s issue. Are you ready?

✨LLMs

How ChatGPT-4 Scored on 15 Emergency Cases

Researchers asked GPT-4 to manage 15 standardized emergency scenarios.

A board-certified emergency physician managed the same cases and scored GPT-4 performance.

🔬 Methods

Materials: 15 validated emergency scenarios.

The cases included diverse clinical complexity, urgency, and different medical specialties.

Evaluator: Board-certified emergency physician who evaluated the cases and scored GPT-4 concordance as:

High (5/5)
Moderate (4/5)
Low (≤3/5)

Areas evaluated (0–5 each):

Diagnosis: Was the main and differential diagnosis correct and complete?
Investigations: Were the tests appropriate and listed in the right order?
Treatment: Was the first treatment safe, correct, and in line with common practice?
Safety and applicability: Could these suggestions be safely used in a real emergency department?
Complexity of decision-making: Does it show understanding of the case’s challenges?

📊 Results

High concordance (5/5): 8/15 cases (53.3%)
Moderate concordance (4/5): 4/15 (26.7%)
Low concordance (≤3/5): 3/15 (20.0%)
The best results were in guideline-driven scenarios, for example, asthma, diabetic ketoacidosis, and ST-elevation myocardial infarction.
The lowest concordance was found in complex cases like stroke with unknown onset, traumatic hemorrhagic shock, and mixed acid-base disturbance.
Errors included not prioritizing the airway in a trauma case and recommending thrombolysis without clear timing.

🔑 Key Takeaways

ChatGPT-4 was more accurate in structured, guideline-based scenarios.
Performance drops with complex and multifactorial cases.
ChatGPT-4 is useful as a supportive and educational tool. But it cannot replace the clinical reasoning of human physicians, especially in complex or unpredictable real-life scenarios.
Future studies should assess AI tools using real-time patient data to better understand their utility and limitations.

🔗 Gün M. Can AI match emergency physicians in managing common emergency cases? A comparative performance evaluation. BMC Emerg Med. 2025;25(1):142. https://doi.org/10.1186/s12873-025-01303-y

🦾TechTools

You can click on the names to test them (if you haven’t already).

DeepSeek:

This Chinese LLM is low-cost and multilingual.
Good for summarizing documents.
Has shown high accuracy in medical clinical reasoning.
It’s open source (the code and model weights are openly available).
Has privacy data concerns.

Gemini:

Google LLMs can be integrated into Google Docs, Sheets, and Gmail.
Has multimodal capabilities like images, audio, and video.
Has web access by default.
Is great for creative work.
Susceptible to hallucinations.

Grok:

Elon Musk's LLM is good for coding and managing large amounts of data.
Consistent with research citations.
Best use for everyday tasks and people who use X (Twitter).
Limited medical focus.
Lack of safety features.

That’s all for today.

I have a big favor to ask; please share AIMedily with people who work in healthcare.

You can:

↪️Forward this email or 📲copy this link to send it on your phone.

Thank you!

See you next week (enjoy the weekend 😎).

Itzel Fer, MD PM&R

Follow me on LinkedIn | Substack | X | Instagram

Join my Newsletter 👉 AIMedily.com

Forwarded this email? Sign up here

How did you like today's newsletter?

Fact-based news without bias awaits. Make 1440 your choice today.

Overwhelmed by biased news? Cut through the clutter and get straight facts with your daily 1440 digest. From politics to sports, join millions who start their day informed.