Do Large Language Models Perform Equally Across Languages? A Comparison of Responses to Frequently Asked Questions in Anesthesiology

Hadi Ufuk Yörükoğlu; Can Aksu; Pervez Sultan; Serkan Tulgar

doi:10.12659/MSM.951815

28 February 2026 : Clinical Research

Do Large Language Models Perform Equally Across Languages? A Comparison of Responses to Frequently Asked Questions in Anesthesiology

Hadi Ufuk Yörükoğlu^{ABCEF 1*}, Can Aksu^{AE 2}, Pervez Sultan^{AE 3}, Serkan Tulgar^{ABEF 4}

DOI: 10.12659/MSM.951815

Med Sci Monit 2026; 32:e951815

Authors information Article notes Copyright and License information

Introduction Material and Methods Results Discussion Conclusions References

Related articles Order reprints Share article Share by email

View HTML version

Table 4 Summary of evaluation of responses to frequently asked questions by different large language models: Quality metrics and statistical analysis results.

		ChatGPTEnglishMedian [IQR] (range)	ChatGPTTurkishMedian [IQR] (range)	DeepSeekEnglishMedian [IQR] (range)	DeepSeekTurkishMedian [IQR] (range)	p
Content quality	Accuracy	4 [4–5] (1–5)	4 [3–5] (2–5)	4 [3–5] (1–5)	4 [3–5] (1–5)
	Comprehensiveness	4 [4–5] (2–5)	4 [3–4] (1–5)	4 [3–5] (1–5)	4 [3–5] (1–5)
	Safety	4 [3–5] (1–5)	4 [3–5] (1–5)	4 [3–5] (1–5)	4 [3–5] (1–5)	0.112
Communication quality	Understanding	4 [4–5] (2–5)	4 [3–5] (2–5)	4 [3–5] (1–5)	4 [3–5] (1–5)
	Empathy	4 [4–5] (2–5)	4 [3–5] (1–5)	4 [3–5] (1–5)	4 [3–5] (1–5)	0.185
	Ethics	4 [3–5] (2–5)	4 [3–5] (1–5)	4 [3–5] (1–5)	4 [3–5] (1–5)	0.150
Content quality		4.13 (3.66–4.60)	3.75 (3.23–4.27)	4.00 (3.45–4.55)	3.81 (3.26–4.35)
Communication quality		4.12 (3.60–4.64)	3.81 (3.22–4.39)	3.92 (3.38–4.47)	3.76 (3.19–4.33)	0.059
Overall quality		4.13 (3.65–4.61)	3.78 (3.24–4.32)	3.96 (3.43–4.49)	3.78 (3.24–4.33)
The data of accuracy, comprehensiveness, safety, understanding, empathy, and ethics are presented as median [IQR] (range) to provide comprehensive summary of the distribution. The interquartile range (IQR) represents the range between the 25 and 75 percentiles, and the range indicates the minimum and maximum values. The data of content, communication and overall quality are presented as mean (95% confidence interval).

Back to the Article

Do Large Language Models Perform Equally Across Languages? A Comparison of Responses to Frequently Asked Questions in Anesthesiology

Your Privacy