Pagano, Stefano

; Strumolo, Luigi ; Michalk, Katrin ; Schiegl, Julia ; Pulido, Loreto C.

; Reinhard, Jan

; Maderbacher, Guenther ; Renkawitz, Tobias ; Schuster, Marie

Evaluating ChatGPT, Gemini and other Large Language Models (LLMs) in orthopaedic diagnostics: A prospective clinical study

Pagano, Stefano

, Strumolo, Luigi, Michalk, Katrin, Schiegl, Julia, Pulido, Loreto C.

, Reinhard, Jan

, Maderbacher, Guenther, Renkawitz, Tobias und Schuster, Marie (2025) Evaluating ChatGPT, Gemini and other Large Language Models (LLMs) in orthopaedic diagnostics: A prospective clinical study. Computational and Structural Biotechnology Journal 28, S. 9-15.

Veröffentlichungsdatum dieses Volltextes: 21 Jan 2025 17:20
Artikel
DOI zum Zitieren dieses Dokuments: 10.5283/epub.74732

Veröffentlichte Version
Download ( PDF | 651kB)

Lizenz: Creative Commons Namensnennung 4.0 International

Zusammenfassung

Background: Large Language Models (LLMs) such as ChatGPT are gaining attention for their potential applications
in healthcare. This study aimed to evaluate the diagnostic sensitivity of various LLMs in detecting hip or
knee osteoarthritis (OA) using only patient-reported data collected via a structured questionnaire, without prior
medical consultation.
Methods: A prospective observational study was conducted at an orthopaedic outpatient clinic specialized in hip
and knee OA treatment. A total of 115 patients completed a paper-based questionnaire covering symptoms,
medical history, and demographic information. The diagnostic performance of five different LLMs—including
four versions of ChatGPT, two of Gemini, Llama, Gemma 2, and Mistral-Nemo—was analysed. Model-generated
diagnoses were compared against those provided by experienced orthopaedic clinicians, which served as the
reference standard.
Results: GPT-4o achieved the highest diagnostic sensitivity at 92.3 %, significantly outperforming other LLMs.
The completeness of patient responses to symptom-related questions was the strongest predictor of accuracy for
GPT-4o (p < 0.001). Inter-model agreement was moderate among GPT-4 versions, whereas models such as
Llama-3.1 demonstrated notably lower accuracy and concordance.
Conclusions: GPT-4o demonstrated high accuracy and consistency in diagnosing OA based solely on patientreported
questionnaires, underscoring its potential as a supplementary diagnostic tool in clinical settings.
Nevertheless, the reliance on patient-reported data without direct physician involvement highlights the critical
need for medical oversight to ensure diagnostic accuracy. Further research is needed to refine LLM capabilities
and expand their utility in broader diagnostic applications.

Alternative Links zum Volltext

Beteiligte Einrichtungen

Medizin > Lehrstuhl für Orthopädie
Browse Publikationen

Details

Dokumentenart

Artikel

Titel eines Journals oder einer Zeitschrift

Computational and Structural Biotechnology Journal

Verlag:

Elsevier

Band:

Seitenbereich:

S. 9-15

Datum

26 Dezember 2025

Institutionen

Medizin > Lehrstuhl für Orthopädie

Identifikationsnummer

Wert	Typ
10.1016/j.csbj.2024.12.013	DOI

Stichwörter / Keywords

Large Language Models (LLMs), GPT-4o, ChatGPT, Gemini, Llama, Gemma 2, Mistral-Nemo, Hip osteoarthritis, Knee osteoarthritis, Diagnostic sensitivity, Musculoskeletal disorders, Orthopaedic diagnostics, Patient-reported data, Artificial intelligence in healthcare

Dewey-Dezimal-Klassifikation

600 Technik, Medizin, angewandte Wissenschaften > 610 Medizin

Status

Veröffentlicht

Begutachtet

Ja, diese Version wurde begutachtet

An der Universität Regensburg entstanden

URN der UB Regensburg

urn:nbn:de:bvb:355-epub-747322

Dokumenten-ID

74732

Bibliographische Daten exportieren

Nur für Besitzer und Autoren: Kontrollseite des Eintrags

Downloadstatistik

Altmetric

Alternative Statistik (altmetrics)

Weitere Literatur (mittels CORE)

nach oben