Journal: New England Journal of Medicine
Year: 2024
DOI: 10.1056/NEJMc2405343
Key Finding: GPT-4 achieved diagnostic accuracy comparable to attending physicians on standardized clinical vignettes.

Study Design

This study evaluated GPT-4’s performance on a standardized set of clinical vignettes drawn from published case challenges and board-style questions. The vignettes covered a range of specialties and diagnostic complexity.

Physician performance (attending level) was used as the benchmark comparator. Both the AI and physician cohort answered the same vignettes under controlled conditions. Accuracy was measured as the proportion of correct diagnoses generated within the top three differentials.

Key Findings

GPT-4 achieved diagnostic accuracy statistically comparable to attending physicians on the overall vignette set
Performance was strongest in common diagnoses in high-representation specialties (internal medicine, cardiology)
The model showed relative weakness in rare diseases and diagnostics requiring synthesis of physical exam findings not explicitly stated
Physicians outperformed GPT-4 on vignettes requiring integration of nonverbal cues and clinical gestalt
When allowed to ask clarifying questions (simulated dialogue mode), GPT-4’s accuracy improved significantly

Clinical Implications

This study is frequently cited as evidence that LLMs are “ready for clinical practice” — but that framing overstates the findings. A few important caveats:

Vignettes are not real patients. Clinical vignettes present curated, structured information. Real patients present with ambiguous, incomplete, and conflicting data.
Accuracy ≠ safety. A model that gets the right answer 85% of the time may be dangerous in high-stakes diagnostic contexts where 15% error is unacceptable.
The physician comparator matters. “Comparable to attending physicians” tells you less than it sounds like — physician diagnostic accuracy varies widely by specialty, experience, and fatigue.
Output quality varies with prompt quality. The study used structured queries. Poorly formatted prompts to the same model yield substantially worse performance.

My Takeaway

This is a landmark study, but it’s best read as proof of concept rather than practice-changing evidence. The finding that GPT-4 can match attending-level accuracy on vignettes is genuinely impressive — it means LLMs have crossed a threshold of clinical usefulness.

For clinical practice, the most honest translation is: an LLM can function as a capable differential generator and cognitive second opinion, especially for common diagnoses. It should never replace clinical judgment, but used well, it can reduce anchoring, prompt consideration of less common diagnoses, and speed up documentation.

The key word is “used well.” That requires understanding what these models are good at, where they fail, and how to prompt them effectively — which is exactly what this site is for.