Nur für Besitzer und Autoren: Kontrollseite des Eintrags

Knoedler, Leonard

; Alfertshofer, Michael ; Knoedler, Samuel ; Hoch, Cosima C. ; Funk, Paul F. ; Cotofana, Sebastian ; Maheta, Bhagvat J. ; Frank, Konstantin ; Brébant, Vanessa

; Prantl, Lukas

; Lamby, Philipp

Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis

Knoedler, Leonard

, Alfertshofer, Michael, Knoedler, Samuel, Hoch, Cosima C., Funk, Paul F., Cotofana, Sebastian, Maheta, Bhagvat J., Frank, Konstantin, Brébant, Vanessa

, Prantl, Lukas

und Lamby, Philipp (2024) Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis. JMIR Medical Education 10, e51148.

Veröffentlichungsdatum dieses Volltextes: 19 Jan 2024 07:32
Artikel
DOI zum Zitieren dieses Dokuments: 10.5283/epub.55352

Veröffentlichte Version
Download ( PDF | 392kB)

Lizenz: Creative Commons Namensnennung 4.0 International

Zusammenfassung

Background: The United States Medical Licensing Examination (USMLE) has been critical in medical education since 1992, testing various aspects of a medical student’s knowledge and skills through different steps, based on their training level. Artificial intelligence (AI) tools, including chatbots like ChatGPT, are emerging technologies with potential applications in medicine. However, comprehensive studies analyzing ChatGPT’s performance on USMLE Step 3 in large-scale scenarios and comparing different versions of ChatGPT are limited.

Objective: This paper aimed to analyze ChatGPT’s performance on USMLE Step 3 practice test questions to better elucidate the strengths and weaknesses of AI use in medical education and deduce evidence-based strategies to counteract AI cheating.

Methods: A total of 2069 USMLE Step 3 practice questions were extracted from the AMBOSS study platform. After including 229 image-based questions, a total of 1840 text-based questions were further categorized and entered into ChatGPT 3.5, while a subset of 229 questions were entered into ChatGPT 4. Responses were recorded, and the accuracy of ChatGPT answers as well as its performance in different test question categories and for different difficulty levels were compared between both versions.

Results: Overall, ChatGPT 4 demonstrated a statistically significant superior performance compared to ChatGPT 3.5, achieving an accuracy of 84.7% (194/229) and 56.9% (1047/1840), respectively. A noteworthy correlation was observed between the length of test questions and the performance of ChatGPT 3.5 (ρ=–0.069; P=.003), which was absent in ChatGPT 4 (P=.87). Additionally, the difficulty of test questions, as categorized by AMBOSS hammer ratings, showed a statistically significant correlation with performance for both ChatGPT versions, with ρ=–0.289 for ChatGPT 3.5 and ρ=–0.344 for ChatGPT 4. ChatGPT 4 surpassed ChatGPT 3.5 in all levels of test question difficulty, except for the 2 highest difficulty tiers (4 and 5 hammers), where statistical significance was not reached.

Conclusions: In this study, ChatGPT 4 demonstrated remarkable proficiency in taking the USMLE Step 3, with an accuracy rate of 84.7% (194/229), outshining ChatGPT 3.5 with an accuracy rate of 56.9% (1047/1840). Although ChatGPT 4 performed exceptionally, it encountered difficulties in questions requiring the application of theoretical concepts, particularly in cardiology and neurology. These insights are pivotal for the development of examination strategies that are resilient to AI and underline the promising role of AI in the realm of medical education and diagnostics.

Alternative Links zum Volltext

Beteiligte Einrichtungen

Medizin > Zentren des Universitätsklinikums Regensburg > Zentrum für Plastische-, Hand- und Wiederherstellungschirurgie
Browse Publikationen

Details

Dokumentenart

Artikel

Titel eines Journals oder einer Zeitschrift

JMIR Medical Education

Verlag:

JMIR

Open Access Art:

Gold (mit APC - bezahlt UR)

Band:

Seitenbereich:

e51148

Datum

5 Januar 2024

Institutionen

Medizin > Zentren des Universitätsklinikums Regensburg > Zentrum für Plastische-, Hand- und Wiederherstellungschirurgie

Identifikationsnummer

Wert	Typ
10.2196/51148	DOI

Stichwörter / Keywords

ChatGPT; United States Medical Licensing Examination; artificial intelligence; USMLE; USMLE Step 1; OpenAI; medical education; clinical decision-making

Dewey-Dezimal-Klassifikation

600 Technik, Medizin, angewandte Wissenschaften > 610 Medizin

Status

Veröffentlicht

Begutachtet

Ja, diese Version wurde begutachtet

An der Universität Regensburg entstanden

URN der UB Regensburg

urn:nbn:de:bvb:355-epub-553526

Dokumenten-ID

55352

Bibliographische Daten exportieren

Nur für Besitzer und Autoren: Kontrollseite des Eintrags

Downloadstatistik

Altmetric

Alternative Statistik (altmetrics)

Weitere Literatur (mittels CORE)

nach oben