Do Large Language Models Perform Equally Across Languages? A Comparison of Responses to Frequently Asked Questions in Anesthesiology

Hadi Ufuk Yörükoğlu; Can Aksu; Pervez Sultan; Serkan Tulgar

doi:10.12659/MSM.951815

28 February 2026 : Clinical Research

Do Large Language Models Perform Equally Across Languages? A Comparison of Responses to Frequently Asked Questions in Anesthesiology

Hadi Ufuk Yörükoğlu^{ABCEF 1*}, Can Aksu^{AE 2}, Pervez Sultan^{AE 3}, Serkan Tulgar^{ABEF 4}

DOI: 10.12659/MSM.951815

Med Sci Monit 2026; 32:e951815

Authors information Article notes Copyright and License information

Introduction Material and Methods Results Discussion Conclusions References

Related articles Order reprints Share article Share by email

View HTML version

Table 6 Metric-Based Evaluations and Statistical Significance. The table shows the P values derived from the comparison of responses provided by ChatGPT and DeepSeek for each of the 10 questions in English and Turkish separately. Statistically significant differences are in bold.

Accuracy	Comprehensiveness	Safety
0.564	0.951	1.000
	0.527	0.068
0.194	0.317	0.655
0.705	0.129	0.705
1.000	0.655	0.157
0.564	0.564	1.000
0.739	0.655	0.317
0.112	0.157	0.131
1.000	0.317	0.564
0.129		0.279
1.000	0.157	0.257
1.000	0.257	1.000
0.480	0.180	0.102
0.655	0.739	0.317
0.334	0.265	0.234
0.167	0.705	0.084
0.705	1.000	0.480
0.655	0.257	0.180
0.317	0.317	0.317
1.000	1.000	0.414
0.157	0.160	0.317
0.70	0.655	0.096
0.257	1.000	1.000
0.705	0.102	1.000
0.705	1.000	1.000
0.739	0.763	0.655
0.655	0.564	0.655
0.480	0.564	0.129
0.257	0.480	0.480
0.096	0.096	0.238
0.234	0.564	0.317
0.891	0.414	0.157
0.589	0.157	0.180
0.705	0.655	0.564
0.166	0.096	0.157
0.121	0.121	0.161
0.739	0.109	0.414
0.317	0.257	0.564
	0.480	0.317
0.429	0.257	0.655

Back to the Article

Do Large Language Models Perform Equally Across Languages? A Comparison of Responses to Frequently Asked Questions in Anesthesiology

Your Privacy