Large language models propagate race-based medicine

被引:112
作者
Omiye J.A. [1 ,2 ]
Lester J.C. [3 ]
Spichak S. [4 ]
Rotemberg V. [5 ]
Daneshjou R. [1 ,2 ]
机构
[1] Department of Dermatology, Stanford School of Medicine, Stanford, CA
[2] Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA
[3] Department of Dermatology, University of California San Francisco, San Francisco, CA
[4] Independent Researcher, Toronto, ON
[5] Dermatology Service, Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, NY
基金
美国国家卫生研究院;
关键词
27;
D O I
10.1038/s41746-023-00939-z
中图分类号
学科分类号
摘要
Large language models (LLMs) are being integrated into healthcare systems; but these models may recapitulate harmful, race-based medicine. The objective of this study is to assess whether four commercially available large language models (LLMs) propagate harmful, inaccurate, race-based content when responding to eight different scenarios that check for race-based medicine or widespread misconceptions around race. Questions were derived from discussions among four physician experts and prior work on race-based medical misconceptions believed by medical trainees. We assessed four large language models with nine different questions that were interrogated five times each with a total of 45 responses per model. All models had examples of perpetuating race-based medicine in their responses. Models were not always consistent in their responses when asked the same question repeatedly. LLMs are being proposed for use in the healthcare setting, with some models already connecting to electronic health record systems. However, this study shows that based on our findings, these LLMs could potentially cause harm by perpetuating debunked, racist ideas. © 2023, Springer Nature Limited.
引用
收藏
相关论文
共 24 条
  • [1] Harskamp R.E., Clercq L.D., Performance of ChatGPT as an AI-assisted decision support tool in medicine: A proof-of-concept study for interpreting symptoms and management of common cardiac conditions (AMSTELHEART-2).
  • [2] Aldridge M.J., Penders R., Artificial intelligence and anaesthesia examinations: exploring ChatGPT as a prelude to the future, Br. J. Anaesth, 131, pp. E36-E37, (2023)
  • [3] Haver H.L., Et al., Appropriateness of breast cancer prevention and screening recommendations provided by ChatGPT, Radiology, 307, (2023)
  • [4] Brown T., Et al., Language models are few-shot learners, In Advances in Neural Information Processing Systems, 33, pp. 1877-1901, (2020)
  • [5] Pichai S., Google AI updates: Bard and new AI features in Search.
  • [6] Vig J., Et al., Investigating gender bias in language models using causal mediation analysis, In Advances in Neural Information Processing Systems, 33, pp. 12388-12401, (2020)
  • [7] Nadeem M., Bethke A., Reddy S., StereoSet: Measuring stereotypical bias in pretrained language models, Proceedings of the 59Th Annual Meeting of the Association for Computational Linguistics and the 11Th International Joint Conference on Natural Language Processing (, 1, pp. 5356-5371, (2021)
  • [8] Delgado C., Et al., A unifying approach for GFR estimation: recommendations of the NKF-ASN task force on reassessing the inclusion of race in diagnosing kidney disease, Am. J. Kidney Dis., 79, pp. 268-288.e1, (2022)
  • [9] Bhakta N.R., Et al., Race and ethnicity in pulmonary function test interpretation: an official American thoracic society statement, Am. J. Respir. Crit. Care Med., 207, pp. 978-995, (2023)
  • [10] Hoffman K.M., Trawalter S., Axt J.R., Oliver M.N., Racial bias in pain assessment and treatment recommendations, and false beliefs about biological differences between blacks and whites, Proc. Natl Acad. Sci., 113, pp. 4296-4301, (2016)