In partnership with

The ops hire that onboards in 30 seconds.

Viktor is an AI coworker that lives in Slack, right where your team already works.

Message Viktor like a teammate: "pull last quarter's revenue by channel," or "build a dashboard for our board meeting."

Viktor connects to your tools, does the work, and delivers the actual report, spreadsheet, or dashboard. Not a summary. The real thing.

There’s no new software to adopt and no one to train.

Most teams start with one task. Within a week, Viktor is handling half of their ops.

Add Viktor to Slack for free.

Hi!

Welcome back to AIMedily.

I don’t know about you, but this year is moving incredibly fast. I can’t believe it’s already May.

This week was graduation week at the University of Michigan, and the Robotics department celebrated with a robot parade, including the lab where I work. You can check it out here.

Now, let’s get into today’s issue.

🤖 AIBytes

LLM Beat Physicians on Clinical Reasoning

Researchers tested OpenAI’s o1 model against physicians in 6 clinical reasoning experiments. These included published teaching cases and real emergency department patients.

They compared diagnosis, test selection, management reasoning, and second-opinion performance using real clinical data.

🔬 Methods

The model was tested on published diagnostic and management cases and on 76 real emergency department cases.
In the emergency department study, o1, GPT-4o, and 2 attending physicians each gave differential diagnoses.
Two other attending physicians scored the answers without knowing whether they came from a model or a physician.

📊 Results

In the published diagnosis cases, o1 often included the correct or a very close diagnosis.
In the NEJM case series, it included the correct diagnosis in 78.3% of cases. When very close answers were also counted, this increased to 97.9%.
For next-test selection, o1 chose the correct test in 87.5% of cases.
In structured reasoning tasks, o1 outperformed GPT-4 and physician comparison groups in several experiments. For example, in the NEJM Healer cases, it achieved a perfect score in 78 of 80 cases.
In the emergency department study, o1 identified the exact or very close diagnosis in 67.1% of cases at triage, 72.4% during physician evaluation, and 81.6% at hospital or ICU admission.
It performed better than both attending physicians at each stage.

🔑 Key Takeaways

In this study, o1 outperformed physician baselines on several text-based clinical reasoning tasks.
The biggest gap appeared early in the emergency department, when information was limited and decisions were urgent.
This does not mean the model is ready for independent clinical use. The authors call for prospective trials, monitoring, and safer clinician-AI integration.
The study was limited to text-based tasks and focused mainly on internal medicine and emergency medicine, so the results may not apply to all specialties or real-world multimodal care.

🔗 Brodeur PG, Buckley TA, Kanjee Z, et al. Performance of a large language model on the reasoning tasks of a physician. Science. 2026;392(6797). doi:10.1126/science.adz4433

One AI Model Used the Whole Patient Record

Researchers built APOLLO, a multimodal foundation model that turns a patient’s full medical record into one patient representation.

It was trained on 25.3 billion medical events from 7.2 million patients and tested across disease prediction, disease progression, treatment response, adverse events, hospital outcomes, and patient retrieval.

🔬 Methods

APOLLO used structured data, clinical notes, reports, and some images from one large U.S. health system.
The model was tested on 322 tasks.

📊 Results

APOLLO outperformed the age-sex baseline in:
- 74 of 95 new disease tasks
- 53 of 78 disease progression tasks
- 30 of 59 treatment response tasks
- 12 of 17 adverse event tasks
- 9 of 12 hospital operations tasks
Strong examples included:
- Schizophrenia risk: 0.92 vs 0.65
- Acute myocardial infarction risk: 0.82 vs 0.68
- Heart failure risk: 0.88 vs 0.77
- Trastuzumab survival in breast cancer: 0.93 vs 0.66
- Acute kidney injury after NSAIDs: 0.91 vs 0.80
The model also supported patient retrieval using full records, text, or image queries.
Image from original paper

🔑 Key Takeaways

APOLLO was built as a general patient model, not for one disease only.
Its main strength is that it uses the whole medical record, not just structured EHR data.
The results are promising, but this was still one health system, so outside validation is still needed.
The model is associational, not causal. It predicts risk, but it does not show which treatment is best.

🔗 Zhang A, Ding T, Wagner SJ, et al. A multimodal and temporal foundation model for virtual patient representations at healthcare system scale. arXiv. 2026. doi:10.48550/arXiv.2604.18570

🦾TechTools

Mediwhale

Uses AI and retinal scans to estimate cardiometabolic risk quickly and without radiation.
Its platform includes Dr. Noon CVD, which helps assess future cardiovascular disease risk from a retinal image.
Designed for real clinical workflows, with results available in minutes.

Tempus ECG-AI

AI-enabled ECG platform that helps identify patients at risk for hidden cardiovascular disease.
Includes FDA-cleared tools for atrial fibrillation risk and low ejection fraction detection.
Built to fit into existing ECG and EHR workflows.

📈 Productivity Tool of the Week:

Arc Search

AI browser that can search the web and generate one clear answer instead of making you open many tabs.
Includes features like Browse for Me, page summarization, reader mode, and translation.
Useful for quickly comparing tools, reading articles faster, and reducing research clutter on mobile.

🧬AIMedily Snaps

OpenEvidence collaborates with NCCN to integrate canonical oncology treatment angowithms at the Point-of-Care (Link).
Eric Topol: The Paradox of Medical AI Implementation (Link).
Google Deepmind: Enabling a new model for healthcare with AI co-clinician (Link).
Stanford AIMI session: Multimodal foundation models for precision oncology (Link).
AI Tool that Estimates Biological Age from Face Photos Could Serve as Prognostic Biomarker for Cancer (Link).
Evaluating Claude’s bioinformatics research capabilities (Link).

🧪Research Signals

Nature: Towards generalizable AI in medicine via Generalist–Specialist Collaboration (Link).
NPJ: The absence of full lifecycle risk management for AI-based medical devices in radiology (Link).
Nature: Show us the evidence for the value of medical AI (Link).
CheXthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest X-ray interpretation (Link).
JMIR: Clinical AI is Not (Yet) Trustworthy—But It Could Be (Link).
NEJM: A Typology of Generative Health Care AI — Definitions and Policy Implications (Link).

🧩TriviaRX

Which procedure was performed in 1667, briefly seemed successful, and was then banned for years?
A) Cesarean section
B) Smallpox inoculation
C) Blood transfusion
D) Appendectomy

Now, the answer from last week’s TriviaRX ✅ B) Smallpox
Smallpox became the first human disease eradicated worldwide through vaccination, culminating in WHO’s eradication declaration in 1980.

That’s it for today.

As always, thank you for taking the time to read.

You’re already ahead of the curve in medical AI — don’t keep it to yourself. Forward AIMedily to your colleagues who’d appreciate the insights.

Until next week.

Itzel Fer, MD PM&R

Follow me on LinkedIn | Substack | X | Instagram

Forwarded this email? Sign up here

P.S. Enjoying AIMedily? 👉 Write a review here (it takes less than a minute).