28 February 2026: Database Analysis
Do Large Language Models Perform Equally Across Languages? A Comparison of Responses to Frequently Asked Questions in Anesthesiology
Hadi Ufuk Yörükoğlu ABCEF 1*, Can Aksu AE 2, Pervez Sultan AE 3, Serkan Tulgar ABEF 4
DOI: 10.12659/MSM.951815
Med Sci Monit 2026; 32:e951815
Abstract
BACKGROUND: With the increasing use of large language model (LLM) chatbots in healthcare, evaluating their ability to provide reliable and understandable information in multiple languages is critical, particularly in fields such as anesthesia, where patient education is essential. The study primarily aimed to compare the quality of ChatGPT 4.0’s and DeepSeek V3’s English responses, with secondary aims to evaluate content and communication differences between English and Turkish responses.
MATERIAL AND METHODS: Anesthesiologists proficient in both languages were recruited as experts. Ten frequently asked questions in anesthesia were selected and translated for evaluation. Responses from ChatGPT 4.0 and DeepSeek V3 in both English and Turkish were assessed for overall quality and content quality (accuracy, comprehensiveness, and safety) and communication quality (understanding, empathy/tone, and ethics), and Turkish and English responses were compared by the evaluators.
RESULTS: Eleven experts evaluated the responses. English responses of ChatGPT 4.0 were superior to the English responses of DeepSeek V3 in overall (P<0.001). English responses of ChatGPT 4.0 were superior to the Turkish responses in the terms of overall, content, and communication quality (P<0.001 each) and English responses of DeepSeek V3 were superior to the Turkish responses in the terms of overall (P<0.001), content (P<0.001) and communication (P=0.001) quality.
CONCLUSIONS: ChatGPT 4.0 performed better than DeepSeek V3 in the English language in terms of overall quality of responses to 10 frequently asked questions in the field of anesthesia and the English responses provided by ChatGPT 4.0 and DeepSeek V3 outperformed the Turkish responses.
Keywords: Anesthesia, Artificial Intelligence, Surveys and Questionnaires, Anesthesiology, Natural Language Processing, Multilingualism, Patient Education as Topic, Quality of Health Care, Comparative Study
Introduction
Anesthesia is administered in diverse settings, from light sedation to general and regional techniques [1–3]. Anesthesia can be a source of anxiety [4]. Therefore, patients are increasingly seeking information from online sources; however, these sources can be misleading and can increase anxiety further [5].
Since the introduction of ChatGPT, various large language model (LLM) chatbots have been introduced and integrated into many areas of life, including medicine and the subspecialty of anesthesia [6]. DeepSeek is a newer LLM chatbot [7], which due to differences in its working principles has attracted significant attention and quickly gained popularity from its release because of the quality of its responses [8]. While studies have evaluated the responses of different LLM chatbots to anesthesia-related frequently asked questions (FAQs) in terms of scientific content and communication quality [9–13], studies have not compared ChatGPT responses to DeepSeek. Several studies have demonstrated superior responses from LLMs when composed in English compared to Spanish or Japanese [14,15]. However, no studies have compared LLM responses using the Turkish language, a language spoken by 90 million people worldwide, compared to English responses.
We compared the responses of ChatGPT 4.0 and DeepSeek V3 LLM chatbots to 10 FAQs related to anesthesia in English and Turkish. The primary aim of the study was to compare the overall quality of ChatGPT’s and DeepSeek’s responses in English. Secondary aims were to compare the content and communication of the answers between English and Turkish responses, and to show the internal correlation of the responses.
Material and Methods
STUDY DESIGN:
Rather than independently collecting new questions, we adopted the 10 anesthesia FAQs previously defined by Nguyen et al [16], which were originally obtained from the patient FAQ pages of the top 10 US anesthesia programs ranked by US News [17]. These selected questions comprehensively address topics related to preoperative evaluation, anesthesia practices, and postoperative care. After identification of the questions, 2 authors (ST and HUY) translated the questions into Turkish. After identification and translation of the questions, the equivalence of the English and Turkish versions of the questions was confirmed by the other authors. These questions (Tables 1, 2) were later put to ChatGPT 4.0 and DeepSeek V3 separately in both English and Turkish. DeepSeek V3 was free at that time (March 2025), but the paid version of ChatGPT 4.0 was used. New sessions were used for each question and each language/model combination to avoid contamination by prior context. As a result, the influence of longer conversation history or memory should be minimal in our design.
PARTICIPANTS:
The evaluator panel was established in collaboration with the Turkish Regional Anesthesia Society (RAD). On July 31, 2024, an email invitation was sent to all 739 members of the society, requesting them to complete a survey designed to assess their qualifications for comparing English responses with Turkish responses. Participants were given a 15-day period to respond to the survey. The qualifications to participate included proficiency in English language examinations, professional experience as anesthesiologists in English-speaking countries, roles as examiners in anesthesia board examinations conducted in English, successful completion of anesthesia board examinations in English, and experience presenting at international conferences in English.
From the 30 respondents, 20 met the criteria for inclusion and were selected as evaluators. To avoid potential bias, this study did not include evaluators as authors. This decision was made to ensure that the evaluators could assess the responses independently, maintaining the objectivity of the evaluation process.
MEASUREMENTS:
The primary aim of this study was to compare the overall quality of responses provided by ChatGPT 4.0 and DeepSeek V3 to FAQs in the field of anesthesia in both English and Turkish. The secondary aims were to compare the responses in terms of content quality (accuracy, comprehensiveness, and safety), communication quality (understanding, empathy/tone, and ethics) between languages and to determine which language version participants preferred [14,16].
A survey was created using Google Forms. For each response provided by the LLMs, the English version was presented first, followed by the Turkish version. The answers to the questions were presented in a randomized order to enhance the blindness of the evaluation process.
Each response was assessed using a 5-point Likert scale (strongly disagree, somewhat disagree, neither disagree nor agree, somewhat agree, strongly agree) based on 6 evaluation criteria: 3 related to content quality – accuracy, comprehensiveness, and safety – and 3 related to communication quality – clarity of understanding, tone/empathy, and ethical considerations. In addition, the English and Turkish responses to each question were compared using the 5-point Likert scale (Turkish answer is much better, Turkish answer is slightly better, no difference, English answer is slightly better, English answer is much better). Evaluators were asked to rate each domain based on their clinical expertise and judgement; however, no formal written definitions of the quality domains were provided. This may have contributed to variability in scoring. An email was sent to evaluators on March 26. A reminder was sent to those who had not responded on April 2.
SAMPLE SIZE AND STATISTICAL ANALYSIS:
Most studies assessing LLM chatbot responses to medical-related inquiries have used a panel of 5 to 8 experts [14–16]. Considering the previous studies, we determined that a minimum of 5 experts would be essential. However, in our study, which focused on evaluating the responses that were generated by LLMs for the FAQs in different languages, we opted to involve at least 8 evaluators, with a preference for a significantly larger cohort.
All statistical analyses were performed using SPSS version 26.0 (SPSS, Chicago, IL, USA). The normality of the data distribution was evaluated using the Shapiro-Wilk test. Subsequently, descriptive statistics were calculated, including the median (inter-quartile range [IQR]) and minimum–maximum or mean (95% confidence interval) as deemed appropriate. The Wilcoxon test was used to compare the evaluation results between the 2 groups. The correlation among the evaluation results provided by the LLMs was analyzed through Spearman’s correlation test. A significance level of P<0.05 was set to determine if there was a statistically significant difference. The agreement among the 11 evaluators was assessed using a two-way random effects intraclass correlation coefficient (ICC[2,1]) for absolute agreement. We also calculated ICC(2,k) to estimate the reliability of the mean rating across all evaluators.
Results
Out of the 20 individuals contacted, 11 participated in our survey. All 11 who agreed to participate completed all scoring items for each LLM and FAQ. Two of the evaluators had 1–5 years of experience, 5 had 11–15 years of experience, and 4 had over 16 years of experience as an anesthesia specialist. One of the evaluators was working as an anesthesia specialist in the United States and 2 in England. Five evaluators working in Turkey had board certificates with English language exams, while 2 were examiners in more than 1 board exam conducted in English. All the evaluators had duties in congresses and meetings where the language of presentation was English. All of the evaluators had advanced English proficiency and had attended courses organized by Turkish or European anesthesia associations. Evaluators participated on a voluntary basis and did not receive any financial or material compensation.
A comparison the overall quality of responses provided by ChatGPT 4.0 and DeepSeek V3 in both English and Turkish is presented in Table 3. English responses of ChatGPT 4.0 were superior to the English responses of DeepSeek V3, but there was no difference between the Turkish responses of ChatGPT 4.0 and DeepSeek V3.
Table 4 shows a summary of the evaluation of responses. The English responses of ChatGPT 4.0 outperformed the Turkish responses of ChatGPT 4.0 across all parameters. The English responses of DeepSeek were also superior in terms of content quality and overall quality. However, there was no difference between communication quality. No significant differences were observed in the safety, empathy, and ethics parameters.
Results of expert comparisons between English and Turkish responses of ChatGPT 4.0 and DeepSeek V3 are provided in Table 5. Table 6 shows the metric-based evaluations and statistical significance of responses provided by ChatGPT and DeepSeek for each of the 10 questions in English and Turkish separately. The results indicate that overall, the English responses were superior to the Turkish responses of ChatGPT 4.0 and DeepSeek V3 except for the 5th question.
Table 7 shows the proportion of responses rated as acceptable and satisfactory (Likert scale 4–5) across the English and Turkish responses of ChatGPT 4.0 and DeepSeek V3. ChatGPT 4.0 and DeepSeek V3 had rates over 70% in the English responses, but the English responses of ChatGPT 4.0 had the highest satisfaction rates. In contrast, satisfaction rates were below 70% for the Turkish responses. The single-rater ICC (2,1) for overall quality scores was 0.03, indicating substantial variability between individual evaluators. The reliability of the mean score across 11 evaluators (ICC[2,k]) was 0.23. Correlation analysis was performed to evaluate the relationships among the responses of the different AI models – English ChatGPT 4.0, Turkish ChatGPT 4.0, English DeepSeek V3, and Turkish DeepSeek V3 – based on scores assigned using a Likert scale (Figure 1).
Discussion
Our study found that English responses from both chatbots were superior to Turkish ones. The English responses of both ChatGPT 4.0 and DeepSeek V3 were satisfactory (77.8% and 70.6% respectively). However, the English version of ChatGPT 4.0 was superior to the English version of DeepSeek V3. There was no significant difference between the Turkish responses.
There are few studies comparing LLM chatbot responses to anesthesia questions [9–11,16]. Lee et al [10] showed that ChatGPT and Bard have similar accuracy. Questions about labor epidural were asked, and Bard had longer answers. However, while ChatGPT’s responses were at a university level, Bard’s responses were at a high school level and were more readable. In the study of Nguyen et al [16], ChatGPT and Bard were comparable, and both outperformed Bing Chat. Since we aimed to compare responses in different languages, we decided to use ChatGPT for comparison with DeepSeek, which had not been compared before, because ChatGPT was superior to other LLMs at the time the study was conducted [16]. Furthermore, ChatGPT had previously been compared with other languages, and is more widely used in Turkey [14,15].
Considering the English responses, we observed that both ChatGPT 4.0 and DeepSeek V3 provided satisfactory and consistent answers (77.8% and 70.6% satisfaction rates, respectively). However, ChatGPT 4.0’s responses outperformed DeepSeek V3’s responses.
Although there are various evaluation criteria, we found the parameters defined by Nguyen et al [17] to be simple yet comprehensive; therefore, we used the same parameters. As there is no previous study evaluating the Turkish responses of LLM chatbots in the field of anesthesia, we chose to focus on FAQs in anesthesia to ensure broader coverage rather than narrowing down to a specific topic. Moreover, the questions used in Nguyen’s study already cover a large portion of anesthesia practices, so we used the same set of questions in our study.
Our findings also highlight the need for continued development of LLMs in under-represented languages. Improving Turkish-language performance will require larger and more diverse non-English training datasets, as well as model refinement tailored to regional clinical communication needs. Future work should explore language-specific fine-tuning and medical specialty-oriented LLMs to ensure more equitable, high-quality AI-generated information across different linguistic populations. Language diversity is an important factor in the field of medicine, and patients should be able to receive the same quality of medical information and be able to ask their questions with equal clarity, regardless of their ethnicity or native language [18]. With globalization, these concerns have increased [19]. Therefore, LLM chatbot models can help bridge communication gaps and may play a role in addressing these concerns. However, studies investigating the responses of LLM chatbots to anesthesia-related questions in different languages remain limited. In a study by Ando et al [14], which also used the same 10 questions as in our study, ChatGPT’s responses in English and Japanese were compared, and English responses were found to be superior. However, that study used the ChatGPT-3.5 version, as they thought the free version was more accessible. In contrast, we used the paid version, ChatGPT 4.0, as we believed the more advanced model might perform better in another language. Nonetheless, the English responses generated by ChatGPT 4.0 were still superior to the Turkish responses. Likewise, in a study by Gonzales Fiol et al [15], ChatGPT’s responses in English were found to be superior to those in Spanish when answering commonly asked patient questions about labor epidurals. Similarly, DeepSeek’s responses in English demonstrated superior performance in terms of accuracy, comprehensiveness, safety, and overall understanding. One possible explanation for the higher quality of English responses is the predominance of English-language biomedical sources in the training data of most LLMs. When generating Turkish answers, these models may internally retrieve information primarily from English texts and then translate or reformulate it into Turkish. This process may reduce clarity or introduce unnatural phrasing, which could negatively affect perceived quality. Nonetheless, the responses were free from unsafe or inaccurate recommendations.
This study had some limitations. Since the 4 different responses to the 10 selected questions were already lengthy, we did not include the English and Turkish responses from other LLM chatbot models other than ChatGPT and DeepSeek, to avoid making the evaluation process overly complex and exhausting, which may have reduced response rates. The evaluation of responses from other LLM chatbots in different languages could be the subject of future studies.
Although the answers were not presented in the same order in the evaluation form, the presence of emojis in ChatGPT’s responses may have compromised this randomization. Additionally, the English response was always shown before the Turkish response. However, reviewers were able to go back and change their scores. The low inter-rater reliability observed in our study indicates substantial variability in how individual experts scored the LLM-generated responses. This level of heterogeneity is expected in subjective quality assessments, but it also suggests that future studies may benefit from providing standardized domain definitions, calibration sessions for evaluators, or using larger evaluator panels to improve reliability. Another limitation is that response word counts were not extracted, preventing us from analyzing whether longer answers were subconsciously perceived as higher quality. Lastly, the content of the responses to the fifth question, which addressed medicolegal issues, may have been influenced by differences in national laws across countries. However, when examining the evaluations of the remaining responses, we do not believe that this issue significantly affected the overall results.
Our findings have several practical implications. First, the consistently higher scores of English responses suggest that patients and clinicians who can access English language explanations may benefit from more accurate, complete, and comprehensible LLM-generated information. In contrast, the lower performance in Turkish highlights a risk of unequal access to high-quality educational content for non-English speakers. This gap underscores the need for further optimization of LLMs in under-represented languages and for targeted improvements in domain-specific Turkish medical language processing.
Conclusions
ChatGPT 4.0 outperformed DeepSeek V3 in English response quality, and both English responses surpassed Turkish ones. The lower Turkish performance risks inequitable access and underscores the need for future research to optimize under-represented languages, including domain-specific Turkish medical language processing.
Tables
Table 1. English 10 frequently asked questions.
Table 2. Turkish 10 frequently asked questions.
Table 3. Comparison of the overall quality of responses by ChatGPT and DeepSeek in English and Turkish.
Table 4. Summary of evaluation of responses to frequently asked questions by different large language models: Quality metrics and statistical analysis results.
Table 5. Comparison of the English vs Turkish responses.
Table 6. Metric-Based Evaluations and Statistical Significance. The table shows the P values derived from the comparison of responses provided by ChatGPT and DeepSeek for each of the 10 questions in English and Turkish separately. Statistically significant differences are in bold.
Table 7. Proportion of responses rated as acceptable and satisfactory (Likert scale 4–5) across the large language models and evaluation metrics.
References
1. Du AL, Robbins K, Waterman RS, National trends in nonoperating room anesthesia: Procedures, facilities, and patient characteristics: Curr Opin Anaesthesiol, 2021; 34(4); 464-69
2. Ardon AE, Prasad A, McClain RL, Regional anesthesia for ambulatory anesthesiologists: Anesthesiol Clin, 2019; 37(2); 265-87
3. LE Guen M, Liu N, Chazot T, Fischler M, Closed-loop anesthesia: Minerva Anestesiol, 2016; 82(5); 573-81
4. Friedrich S, Reis S, Meybohm P, Preoperative anxiety: Curr Opin Anaesthesiol, 2022; 35(6); 674-78
5. Kassahun WT, Mehdorn M, Wagner TC, The effect of preoperative patient-reported anxiety on morbidity and mortality outcomes in patients undergoing major general surgery: Sci Rep, 2022; 12; 6312
6. Schulman J, Zoph B, Kim C, ChatGPT: Optimizing language models for dialogue: OpenAI Blog, 2022 Available at: https://openai.com/blog/chatgpt
7. Normile D, Chinese firm’s large language model makes a splash: Science, 2025; 387(6731); 238
8. Gibney E, China’s cheap, open AI model DeepSeek thrills scientists: Nature, 2025; 638(8049); 131
9. Lim DYZ, Ke YH, Sng GGR, Large language models in anaesthesiology: Use of ChatGPT for American Society of Anesthesiologists physical status classification: Br J Anaesth, 2023; 131; e73-e75
10. Choi J, Oh AR, Park J, Evaluation of the quality and quantity of artificial intelligence-generated responses about anesthesia and surgery: using ChatGPT 3.5 and 4.0: Front Med (Lausanne), 2024; 11; 1400153
11. Ismaiel N, Nguyen TP, Guo N, The evaluation of the performance of ChatGPT in the management of labor analgesia: J Clin Anesth, 2024; 98; 111582
12. Lee D, Brown M, Hammond J, Readability, quality and accuracy of generative artificial intelligence chatbots for commonly asked questions about labor epidurals: A comparison of ChatGPT and Bard: Int J Obstet Anesth, 2025; 61; 104317
13. Mootz AA, Carvalho B, Sultan P, The accuracy of ChatGPT-generated responses in answering commonly asked patient questions about labor epidurals: A survey-based study: Anesth Analg, 2024; 138(5); 1142-44
14. Ando K, Sato M, Wakatsuki S, A comparative study of English and Japanese ChatGPT responses to anaesthesia-related medical questions: BJA Open, 2024; 10; 100296
15. Gonzalez Fiol A, Mootz AA, He Z, Accuracy of Spanish and English-generated ChatGPT responses to commonly asked patient questions about labor epidurals: A survey-based study among bilingual obstetric anesthesia experts: Int J Obstet Anesth, 2025; 61; 104290
16. Nguyen TP, Carvalho B, Sukhdeo H, Comparison of artificial intelligence large language model chatbots in answering frequently asked questions in anaesthesia: BJA Open, 2024; 10; 100280
17. : The best medical schools for anesthesiology, ranked Available from: https://www.usnews.com/best-graduateschools/top-medicalschools/anesthesiology-rankings
18. Yu K-H, Beam AL, Kohane IS, Artificial intelligence in healthcare: Nat Biomed Eng, 2018; 2; 719.e31
19. Douglass K, Narayan L, Allen R, Language diversity and challenges to communication in Indian emergency departments: Int J Emerg Med, 2021; 14(1); 57
Tables
Table 1. English 10 frequently asked questions.
Table 2. Turkish 10 frequently asked questions.
Table 3. Comparison of the overall quality of responses by ChatGPT and DeepSeek in English and Turkish.
Table 4. Summary of evaluation of responses to frequently asked questions by different large language models: Quality metrics and statistical analysis results.
Table 5. Comparison of the English vs Turkish responses.
Table 6. Metric-Based Evaluations and Statistical Significance. The table shows the P values derived from the comparison of responses provided by ChatGPT and DeepSeek for each of the 10 questions in English and Turkish separately. Statistically significant differences are in bold.
Table 7. Proportion of responses rated as acceptable and satisfactory (Likert scale 4–5) across the large language models and evaluation metrics.
Table 1. English 10 frequently asked questions.
Table 2. Turkish 10 frequently asked questions.
Table 3. Comparison of the overall quality of responses by ChatGPT and DeepSeek in English and Turkish.
Table 4. Summary of evaluation of responses to frequently asked questions by different large language models: Quality metrics and statistical analysis results.
Table 5. Comparison of the English vs Turkish responses.
Table 6. Metric-Based Evaluations and Statistical Significance. The table shows the P values derived from the comparison of responses provided by ChatGPT and DeepSeek for each of the 10 questions in English and Turkish separately. Statistically significant differences are in bold.
Table 7. Proportion of responses rated as acceptable and satisfactory (Likert scale 4–5) across the large language models and evaluation metrics. In Press
Clinical Research
Institutional and Regional Variations in Access to Clinical Trials and Next-Generation Sequencing in Turkis...Med Sci Monit In Press; DOI: 10.12659/MSM.951027
Clinical Research
Low-Intensity Blood Flow-Restricted Multi-Joint Exercise Improves Muscle Function in Patients With Patellof...Med Sci Monit In Press; DOI: 10.12659/MSM.950516
Review article
Musculoskeletal Ultrasound and MRI in the Evaluation of Chemotherapy-Induced Peripheral Neuropathy: A ReviewMed Sci Monit In Press; DOI: 10.12659/MSM.951283
Clinical Research
Sensory Processing, Dissociation, and Affective Symptoms in Misophonia: A Cross-Sectional Study of 35 AdultsMed Sci Monit In Press; DOI: 10.12659/MSM.950938
Most Viewed Current Articles
17 Jan 2024 : Review article 10,187,196
Vaccination Guidelines for Pregnant Women: Addressing COVID-19 and the Omicron VariantDOI :10.12659/MSM.942799
Med Sci Monit 2024; 30:e942799
13 Nov 2021 : Clinical Research 3,708,487
Acceptance of COVID-19 Vaccination and Its Associated Factors Among Cancer Patients Attending the Oncology ...DOI :10.12659/MSM.932788
Med Sci Monit 2021; 27:e932788
14 Dec 2022 : Clinical Research 2,341,643
Prevalence and Variability of Allergen-Specific Immunoglobulin E in Patients with Elevated Tryptase LevelsDOI :10.12659/MSM.937990
Med Sci Monit 2022; 28:e937990
16 May 2023 : Clinical Research 706,524
Electrophysiological Testing for an Auditory Processing Disorder and Reading Performance in 54 School Stude...DOI :10.12659/MSM.940387
Med Sci Monit 2023; 29:e940387







