The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study

被引：20

作者：

Ohta, Keiichi ^{[1
]}

Ohta, Satomi ^{[2
]}

机构：

[1] Kobe Univ, Sch Med, Kobe, Japan

[2] Dent, Kobe, Japan

来源：

CUREUS JOURNAL OF MEDICAL SCIENCE | 2023年 / 15卷 / 12期

关键词：

japan; national dentist examination; artificial intellinge in dentistry; google bard; chatgpt-3.5; chatgpt-4;

D O I：

10.7759/cureus.50369

中图分类号：

R5 [内科学];

学科分类号：

100201 [内科学];

摘要：

PurposeThis study aims to evaluate the performance of three large language models (LLMs), the Generative Pre -trained Transformer (GPT)-3.5, GPT-4, and Google Bard, on the 2023 Japanese National Dentist Examination (JNDE) and assess their potential clinical applications in Japan.MethodsA total of 185 questions from the 2023 JNDE were used. These questions were categorized by question type and category. McNemar's test compared the correct response rates between two LLMs, while Fisher's exact test evaluated the performance of LLMs in each question category.ResultsThe overall correct response rates were 73.5% for GPT-4, 66.5% for Bard, and 51.9% for GPT-3.5. GPT-4 showed a significantly higher correct response rate than Bard and GPT-3.5. In the category of essential questions, Bard achieved a correct response rate of 80.5%, surpassing the passing criterion of 80%. In contrast, both GPT-4 and GPT-3.5 fell short of this benchmark, with GPT-4 attaining 77.6% and GPT-3.5 only 52.5%. The scores of GPT-4 and Bard were significantly higher than that of GPT-3.5 (p<0.01). For general questions, the correct response rates were 71.2% for GPT-4, 58.5% for Bard, and 52.5% for GPT-3.5. GPT-4 outperformed GPT-3.5 and Bard (p<0.01). The correct response rates for professional dental questions were 51.6% for GPT-4, 45.3% for Bard, and 35.9% for GPT-3.5. The differences among the models were not statistically significant. All LLMs demonstrated significantly lower accuracy for dentistry questions compared to other types of questions (p<0.01).ConclusionsGPT-4 achieved the highest overall score in the JNDE, followed by Bard and GPT-3.5. However, only Bard surpassed the passing score for essential questions. To further understand the application of LLMs in clinical dentistry worldwide, more research on their performance in dental examinations across different languages is required.

引用

页数：6

共 23 条

[1]

Ali Rohaid, 2023, Neurosurgery, V93, P1090, DOI [10.1227/neu.0000000000002551, 10.1227/neu.0000000000002551]

[2]

[Anonymous], 2023, Recent trends in dental health care

[3]

[Anonymous], 2023, The 116th National Dentist Examination

[4]

[Anonymous], 2023, Announcement of successful passage of the 117th National Medical Examination

[5]

Bard, 2023, About us

[6]

Beaulieu-Jones BR, 2023, medRxiv, DOI [10.1101/2023.07.16.23292743, 10.1101/2023.07.16.23292743, DOI 10.1101/2023.07.16.23292743]

[7]

ChatGPT, 2023, About us

[8]

Chen YD, 2023, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, P155

[9]

The future landscape of large language models in medicine [J].

Clusmann, Jan ;

Kolbinger, Fiona R. ;

Muti, Hannah Sophie ;

Carrero, Zunamys I. ;

Eckardt, Jan-Niklas ;

Laleh, Narmin Ghaffari ;

Loeffler, Chiara Maria Lavinia ;

Schwarzkopf, Sophie-Caroline ;

Unger, Michaela ;

Veldhuizen, Gregory P. ;

Wagner, Sophia J. ;

Kather, Jakob Nikolas .

COMMUNICATIONS MEDICINE, 2023, 3 (01)

[10]

Implications of large language models such as ChatGPT for dental medicine [J].

Eggmann, Florin ;

Weiger, Roland ;

Zitzmann, Nicola U. ;

Blatz, Markus B. .

JOURNAL OF ESTHETIC AND RESTORATIVE DENTISTRY, 2023, 35 (07) :1098-1102

← 1 2 3 →