Diese Publikation ist Teil des DEAL-Vertrags mit Springer.
Zusammenfassung
Introduction/background:
Spinal cord injuries (SCI) present complex challenges for patients, who increasingly turn to online resources for supplementary information. Large language models (LLMs) like ChatGPT and Google Gemini have emerged as potential tools for patient education. However, concerns about the accuracy, clarity, and comprehensiveness of their responses remain, particularly in ...
Zusammenfassung
Introduction/background:
Spinal cord injuries (SCI) present complex challenges for patients, who increasingly turn to online resources for supplementary information. Large language models (LLMs) like ChatGPT and Google Gemini have emerged as potential tools for patient education. However, concerns about the accuracy, clarity, and comprehensiveness of their responses remain, particularly in specialized fields such as SCI. This study aimed to evaluate the performance of ChatGPT 4, ChatGPT 3.5, and Google Gemini in addressing common patient questions about SCI.
Material and methods:
A systematic process was used to identify 10 key patient questions related to SCI from online sources, PubMed, and Google Trends. These questions were submitted to ChatGPT 4, ChatGPT 3.5, and Google Gemini using a standardized prompt and a 150-word response cap to elicit expert-like responses. Eight blinded spine surgeons evaluated the chatbot-generated answers for quality, clarity, empathy, and comprehensiveness using a validated rating system. Responses were categorized as “excellent,” “satisfactory with minimal clarification,” “satisfactory with moderate clarification,” or “unsatisfactory.”
Results:
Across all three models, the majority of responses were rated as either excellent or requiring only minimal clarification. ChatGPT 4 achieved the highest proportion of high-quality responses, with up to almost 90% rated as “excellent” or “minimal clarification required.” ChatGPT 3.5 and Google Gemini performed similarly, with slightly lower percentages of high-quality responses. No statistically significant differences were observed between the models in overall performance.
Conclusion:
In a standardized single turn, 150-word setting, publicly available LLMs produced largely satisfactory answers to common SCI questions with comparable performance across models. LLMs can be recommended as adjuncts for general patient education, while their outputs should be reviewed within clinical care. Further studies should test multi turn interactions, include patient and multidisciplinary evaluators, compare chatbot responses with clinician authored answers and evaluate the performance of domain specific medical LLMs.