28 February 2026 : Clinical Research
Do Large Language Models Perform Equally Across Languages? A Comparison of Responses to Frequently Asked Questions in Anesthesiology
Hadi Ufuk Yörükoğlu ABCEF 1*, Can Aksu AE 2, Pervez Sultan AE 3, Serkan Tulgar ABEF 4DOI: 10.12659/MSM.951815
Med Sci Monit 2026; 32:e951815
Table 6 Metric-Based Evaluations and Statistical Significance. The table shows the P values derived from the comparison of responses provided by ChatGPT and DeepSeek for each of the 10 questions in English and Turkish separately. Statistically significant differences are in bold.
| Question | Accuracy | Comprehensiveness | Safety | |
|---|---|---|---|---|
| 0.564 | 0.951 | 1.000 | ||
| 0.527 | 0.068 | |||
| 0.194 | 0.317 | 0.655 | ||
| 0.705 | 0.129 | 0.705 | ||
| 1.000 | 0.655 | 0.157 | ||
| 0.564 | 0.564 | 1.000 | ||
| 0.739 | 0.655 | 0.317 | ||
| 0.112 | 0.157 | 0.131 | ||
| 1.000 | 0.317 | 0.564 | ||
| 0.129 | 0.279 | |||
| 1.000 | 0.157 | 0.257 | ||
| 1.000 | 0.257 | 1.000 | ||
| 0.480 | 0.180 | 0.102 | ||
| 0.655 | 0.739 | 0.317 | ||
| 0.334 | 0.265 | 0.234 | ||
| 0.167 | 0.705 | 0.084 | ||
| 0.705 | 1.000 | 0.480 | ||
| 0.655 | 0.257 | 0.180 | ||
| 0.317 | 0.317 | 0.317 | ||
| 1.000 | 1.000 | 0.414 | ||
| 0.157 | 0.160 | 0.317 | ||
| 0.70 | 0.655 | 0.096 | ||
| 0.257 | 1.000 | 1.000 | ||
| 0.705 | 0.102 | 1.000 | ||
| 0.705 | 1.000 | 1.000 | ||
| 0.739 | 0.763 | 0.655 | ||
| 0.655 | 0.564 | 0.655 | ||
| 0.480 | 0.564 | 0.129 | ||
| 0.257 | 0.480 | 0.480 | ||
| 0.096 | 0.096 | 0.238 | ||
| 0.234 | 0.564 | 0.317 | ||
| 0.891 | 0.414 | 0.157 | ||
| 0.589 | 0.157 | 0.180 | ||
| 0.705 | 0.655 | 0.564 | ||
| 0.166 | 0.096 | 0.157 | ||
| 0.121 | 0.121 | 0.161 | ||
| 0.739 | 0.109 | 0.414 | ||
| 0.317 | 0.257 | 0.564 | ||
| 0.480 | 0.317 | |||
| 0.429 | 0.257 | 0.655 |






