28 February 2026 : Clinical Research
Do Large Language Models Perform Equally Across Languages? A Comparison of Responses to Frequently Asked Questions in Anesthesiology
Hadi Ufuk Yörükoğlu ABCEF 1*, Can Aksu AE 2, Pervez Sultan AE 3, Serkan Tulgar ABEF 4DOI: 10.12659/MSM.951815
Med Sci Monit 2026; 32:e951815
Table 4 Summary of evaluation of responses to frequently asked questions by different large language models: Quality metrics and statistical analysis results.
| ChatGPTEnglishMedian [IQR] (range) | ChatGPTTurkishMedian [IQR] (range) | p | DeepSeekEnglishMedian [IQR] (range) | DeepSeekTurkishMedian [IQR] (range) | p | ||
|---|---|---|---|---|---|---|---|
| Content quality | Accuracy | 4 [4–5] (1–5) | 4 [3–5] (2–5) | 4 [3–5] (1–5) | 4 [3–5] (1–5) | ||
| Comprehensiveness | 4 [4–5] (2–5) | 4 [3–4] (1–5) | 4 [3–5] (1–5) | 4 [3–5] (1–5) | |||
| Safety | 4 [3–5] (1–5) | 4 [3–5] (1–5) | 4 [3–5] (1–5) | 4 [3–5] (1–5) | 0.112 | ||
| Communication quality | Understanding | 4 [4–5] (2–5) | 4 [3–5] (2–5) | 4 [3–5] (1–5) | 4 [3–5] (1–5) | ||
| Empathy | 4 [4–5] (2–5) | 4 [3–5] (1–5) | 4 [3–5] (1–5) | 4 [3–5] (1–5) | 0.185 | ||
| Ethics | 4 [3–5] (2–5) | 4 [3–5] (1–5) | 4 [3–5] (1–5) | 4 [3–5] (1–5) | 0.150 | ||
| Content quality | 4.13 (3.66–4.60) | 3.75 (3.23–4.27) | 4.00 (3.45–4.55) | 3.81 (3.26–4.35) | |||
| Communication quality | 4.12 (3.60–4.64) | 3.81 (3.22–4.39) | 3.92 (3.38–4.47) | 3.76 (3.19–4.33) | 0.059 | ||
| Overall quality | 4.13 (3.65–4.61) | 3.78 (3.24–4.32) | 3.96 (3.43–4.49) | 3.78 (3.24–4.33) | |||
| The data of accuracy, comprehensiveness, safety, understanding, empathy, and ethics are presented as median [IQR] (range) to provide comprehensive summary of the distribution. The interquartile range (IQR) represents the range between the 25 and 75 percentiles, and the range indicates the minimum and maximum values. The data of content, communication and overall quality are presented as mean (95% confidence interval). | |||||||






