Performance of AI Chatbots in Preliminary Diagnosis of Maxillofacial Pathologies

Ridvan Guler; Emine Yalcin

doi:10.12659/MSM.949076

09 July 2025: Clinical Research

Performance of AI Chatbots in Preliminary Diagnosis of Maxillofacial Pathologies

Ridvan Guler

^{ABDEF 1*}, Emine Yalcin

^{BCEF 1}

DOI: 10.12659/MSM.949076

Med Sci Monit 2025; 31:e949076

Authors information Article notes Copyright and License information

0 Comments

Add Comment

Abstract

0:00

BACKGROUND: Artificial intelligence (AI) has shown significant potential in transforming healthcare by enabling accurate, data-driven decision-making. This study compared the performance of the AI chatbots ChatGPT, Grok, Blackbox, and Claude AI in preliminary diagnosis of maxillofacial pathologies.

MATERIAL AND METHODS: This study included 23 patients (9 cysts, 14 neoplasms) who underwent operations at Dicle University Faculty of Dentistry between 2017 and 2024 and had their diagnoses histopathologically confirmed. For each case, 4 differential diagnosis options were prepared in question format and directed to the AI platforms. The accuracy of the answers given by the chatbots was analyzed by comparing them with the definitive histopathological diagnoses of the cases. Statistical analysis used the chi-square ad Fisher-Freeman-Halton tests to compare performance among the chatbots. Statistical significance was set at p<0.05.

RESULTS: ChatGPT answered 15 out of 23 questions correctly, achieving a success rate of 65.2%. Grok and Blackbox AI each achieved a success rate of 52.17%, while Claude AI achieved the lowest success rate, at 30.43%. When cases were categorized into cysts and neoplasms, Blackbox AI showed the highest accuracy for cyst cases (66.6%), while ChatGPT had the highest accuracy for neoplasm cases (71.4%). No statistically significant difference was observed in the distribution of correct and incorrect answers among the chatbots (p=0.125). No statistically significant difference was observed in the distribution of cysts and neoplasms answers among the chatbots (p=0.654).

CONCLUSIONS: Although all 4 AI chatbots achieved certain levels of accuracy, ChatGPT showed superior performance compared to other chatbots. The development of these chatbots could be beneficial for diagnostic accuracy and treatment recommendations in dentistry.

Keywords: Artificial Intelligence, Oral and Maxillofacial Surgeons, Pathology, Oral, Humans, Female, Male, Diagnosis, Differential, Middle Aged, adult, Cysts, Generative Artificial Intelligence

Introduction

Artificial intelligence (AI) refers to the use of machines and technology to acquire, process, and apply skills and knowledge gained through education or experience. AI is typically an integrated platform capable of performing tasks similar to those associated with human intelligence [1]. AI aims to build systems that can learn, reason, and adapt based on data and feedback. The field of AI encompasses various branches such as machine learning, natural language processing (NLP), deep learning, and computer vision [2].

NLP is a branch of AI that studies the interaction between human language and computers. NLP enables the analysis, understanding, and generation of human language using computer programs and algorithms [3]. Large-language models (LLMs) are a type of NLP model capable of performing various language tasks, such as text completion, summarization, translation, and question answering [4].

AI-powered chatbots are conversational agents that mimic human interaction using machine learning (ML) and NLP techniques through written, verbal, and visual communication forms [5]. One AI-powered chatbot, Chat Generative Pre-Trained Transformer (ChatGPT 4.0), is an LLM released by OpenAI (San Francisco, California) on November 30, 2022, and quickly gained public attention for its ability to interact with users in a human-like manner, generate large amounts of data, and answer questions accurately [6]. Anthropic (California, USA) has developed large-language AI models named Claude 3 Opus and Claude 3.5 Sonnet, capable of reading and analyzing both text and image data [7]. xAI has recently introduced Grok, an advanced AI-powered chatbot equipped with real-time access to online information sources, capable of generating contextually relevant textual responses to posts on the X platform [8]. Leveraging a sophisticated LLM architecture, Grok combines natural language processing with visual understanding, thereby enabling multimodal interaction and enhanced comprehension of both text and images [9]. Other advanced AI chatbot models, such as Blackbox AI and YouChat, are also available in various versions.

The impact of AI and deep learning in medicine and oral maxillofacial surgery should not be underestimated. Modern medical imaging allows clinicians to visualize anatomy and pathology, and AI’s contribution to diagnostic and prognostic predictability can lead to revolutionary changes in clinical practice [10]. AI applications in dentistry and healthcare offer multiple services for healthcare professionals, including better diagnosis, decision support, digital data recording, image analysis, disease prevention, treatment planning, reduction of treatment errors, and facilitation of discovery and research [11]. AI can assist dentists, particularly when they need to make critical decisions quickly. By eliminating human errors in decision-making, AI can reduce the workload of dentists while providing superior and consistent medical treatment [12]. AI can assist dentists in various dental tasks such as treatment planning and clinical diagnosis, as it can detect and aid in identifying oral maxillofacial defects and abnormalities [13,14].

In dentistry, clinical preliminary diagnosis is crucial. The inability to make an early diagnosis due to a lack of clinical experience or delays in obtaining pathology test results can negatively affect the patient’s treatment duration and quality. In maxillofacial surgery, more accurate and faster preliminary diagnosis predictions for maxillofacial pathologies, especially benign and malignant tumors, can greatly contribute to faster treatment planning. While numerous studies have explored the use of chatbots in various areas of healthcare, a review of the existing literature revealed a notable gap in research specifically focused on maxillofacial pathologies. This study aimed to address this underexplored area, with the intention of contributing meaningfully to both the current body of knowledge and the development of future research in this domain.

We believe AI applications such as ChatGPT 4.0, Grok 2, Blackbox 3, and Claude 3 AI can accelerate the decision-making process, reduce the workload of dentists, and provide superior and consistent medical treatment during the preliminary diagnosis process. Therefore, we compared the performance of these applications to identify the most effective tool for preliminary diagnosis. We compared the performance of the AI chatbots ChatGPT 4.0, Grok 2, Blackbox 3, and Claude 3 AI in preliminary diagnosis of maxillofacial pathologies. The study focused on the question, “To what extent do different AI chatbots provide accurate preliminary diagnoses of maxillofacial pathologies?”.

Material and Methods

ETHICAL CONSIDERATIONS:

This comparative, cross-sectional study compared the performance of AI chatbots ChatGPT 4.0, Grok 2, Blackbox 3 and Claude 3 AI in evaluating information for the preliminary diagnosis of maxillofacial pathologies. The study was conducted in compliance with the principles of the Declaration of Helsinki. Ethical approval was obtained from the Dicle University Faculty of Dentistry Clinical Research Ethics Committee (Date: November 27, 2024, Protocol code: 2024–29). Informed written consent was obtained from all participants before their records were included. Participants’ privacy was ensured through use of anonymized data, and personal information was kept confidential. All data were used for research purposes only.

STUDY DESIGN AND DATA COLLECTION METHODS:

Four different AI chatbots were tested in the study: ChatGPT 4.0, Grok 2, Blackbox 3, and Claude 3 AI. Patients who applied to Dicle University Faculty of Dentistry, Oral and Maxillofacial Surgery Clinic between 2017 and 2024 and were observed to have maxillofacial pathology (eg, cyst, neoplasm) were included in this study. Patients participating in the study were aged 18–65 years. The operations for these patients were performed in our clinic, and definitive diagnoses were determined histopathologically.

We excluded patients with no pathology detected on panoramic radiographs between 2017 and 2024 and cases with unclear radiographic images. We also excluded cases with more than 1 pathology detected on the same radiograph, cases with artifacts in the images, and those without histopathological diagnosis.

An a priori power analysis was conducted using the findings of the study (w=0.61). Using the observed effect size and the study data, the minimum recommended sample size to achieve 80% power at a 5% type I error rate was calculated to be 22. A total of 23 different cases (9 cysts and 14 neoplasms) were randomly selected, and 4 differential diagnosis options were prepared for each case in a question format. The patient sample included 12 males and 11 females. These case questions were directed to the AI platforms. The current study evaluated the subscription AI tools ChatGPT 4.0, Grok 2, Blackbox 3 and Claude 3 AI from November 2024 to December 2024. The questions were analyzed through a conversation session using the phrase “What is the most likely preliminary diagnosis?” To prevent memory bias, each question was analyzed in a new conversation session and incognito mode. The same process was repeated twice for each chatbot. All chatbots provided identical responses to the questions during the second trial. Results from all 4 AI tools were copied and pasted into a Microsoft Word file (Microsoft Corporation, Redmond, Washington, USA). The responses provided by each chatbot were then compared with the histopathological results in our archive. The accuracy of the answers given by the chatbots was analyzed by comparing them with the definitive histopathological diagnoses of the cases. Answers that were an exact match were considered correct. No validated measuring devices were used in the study. Primary data analysis involved calculating the percentage of total correct answers provided by each chatbot. Secondary data analysis involved the percentage of correct answers provided by the chatbots for cyst and neoplasm cases. While it is important to acknowledge that the reasoning provided by AI tools is often detailed, accurate, and clearly articulated, and that the responses adequately address the topic, it is equally essential to consider their limitations, such as the potential for inaccuracies, ambiguity, and incompleteness, and the inability to provide scientifically reliable guidance. Therefore, to avoid bias, the reasons provided by the AI tools for their responses were not analyzed.

To minimize selection bias, cases were randomly selected. To avoid measurement bias, all chatbots were tested by the same person, and evaluations were performed using a standardized scoring system. In this study, the exposure variables were different AI chatbots (ChatGPT, Grok, Claude, and Blackbox). The outcome variable was the accuracy of their preliminary diagnoses regarding maxillofacial pathologies.

STATISTICAL ANALYSIS:

Categorical variables are expressed as n (%). In Table 1, the chi-square test was used to evaluate the relationship between the variables. Since the majority of the expected cell frequencies in this table were above 5, the assumptions of the chi-square test were met and the test was applied with confidence. In Table 2, since some of the cell frequencies were below 5, the necessary assumptions for the chi-square test could not be met. Therefore, the Fisher-Freeman-Halton test, which can give reliable results in smaller samples and low cell frequencies, was used. There were no missing data in the dataset; therefore, no imputation or other missing data handling techniques were necessary. Statistical analyses were conducted using SPSS (IBM Corp., released 2017, IBM SPSS Statistics for Windows, Version 25.0. Armonk, NY: IBM Corp.), and P<0.05 was considered statistically significant.

Results

The research focused on comparing the diagnostic accuracy of 4 AI chatbots in identifying maxillofacial pathologies. Among the patients included in the study, 12 were male (52%) and 11 were female (48%). The mean age of the patients was 36.1±10 years. Radiographic images were uploaded to the AI chatbots, and 4 differential diagnoses were provided for the preliminary diagnosis of the case. Each chatbot was asked the question, ‘What is the most probable preliminary diagnosis based on the radiographic and intraoral pathological images?’ for each case, and the responses were recorded (Tables 1, 3). ChatGPT 4.0 answered 15 out of 23 questions correctly, achieving a success rate of 65.2%, while Grok 2 AI and Blackbox 3 AI achieved success rates of 52.17%, with 12 correct answers each. Claude 3 AI, with 7 correct answers, achieved a lower success rate of 30.43% compared to the other chatbots (Table 2). We believe that Claude 3 AI’s relatively low performance may be attributed to its limited natural language processing capabilities and the constraints of its training datasets. It was observed that none of the chatbots provided correct responses for the cases of odontoma and dentigerous cyst. We believe this may be due to the radiographic images resembling those of other pathologies listed in the differential diagnosis options. Overall, Blackbox 3 AI, Grok 2, and ChatGPT 4.0 appeared to agree with one another, whereas Claude 3 AI was found to disagree with the others in most cases. The results indicate that ChatGPT 4.0 achieved the highest level of accuracy and delivered more reliable outcomes compared to the other chatbots. This finding is consistent with the initial research objective, as providing accurate diagnoses was a central aim of the study.

Upon analysis of the responses, it was generally observed that the chatbots exhibited limited agreement. In the Stafne bone cyst case, all chatbots correct identified the pathology (Figure 1). However, in a case histopathologically diagnosed as a dentigerous cyst, ChatGPT 4.0, Grok 2, and Claude 3 AI incorrectly diagnosed it as ‘ameloblastoma’, while Blackbox 3 AI identified it as ‘keratocyst’ (Figure 2). Regarding a case of fluoride cemento-osseous dysplasia, ChatGPT 4.0, Grok 2, and Blackbox 3 AI correctly recognized the condition, while Claude 3 AI incorrectly identified it as ‘Periapical Cemental Dysplasia’ (Figure 3).

When the cases were evaluated by dividing them into 2 categories – cysts and neoplasms – Blackbox 3 AI had the highest correct rate for cyst cases (66.6%), while ChatGPT 4.0 had the highest correct rate for neoplasm cases (71.4%). The AI chatbots with the lowest correct rate for cyst cases were Grok 2 (44.4%) and Claude 3 AI (44.4%), while for neoplasm cases, Claude 3 AI had the lowest correct rate (21.4%). Blackbox 3 AI performed better in cyst cases, while ChatGPT 4.0 achieved significantly higher success compared to other chatbots in neoplasm cases (Table 4).

We found no statistically significant difference in the distribution of correct and incorrect responses given by the chatbots to the case questions (P=0.125), showing that their general diagnostic performance may be similar. However, the lack of statistical significance could also be influenced by the sample size or variability in case complexity. Moreover, focusing solely on preliminary diagnosis of lesions may not reflect chatbot performance in broader clinical contexts. Additionally, we found no statistically significant difference in the distribution of correct responses given by the chatbots based on the differentiation of cysts and neoplasms (P=0.654). This result shows that the systems may share a common limitation in making subtle diagnostic distinctions. This highlights the importance of evaluating not only the accuracy of diagnoses, but also how the chatbots arrive at their decisions. Therefore, the generalizability and applicability of our findings should be interpreted with these factors in mind.

Discussion

This study evaluated the diagnostic performance of 4 AI chatbots in assisting dentists with preliminary diagnoses based on case images. ChatGPT 4.0 achieved the highest accuracy (65.2%), followed by Grok 2 AI and Blackbox 3 AI (52.17%), while Claude 3 AI performed significantly worse (30.43%), possibly due to differing training data. Notably, Blackbox 3 AI showed better performance in cyst cases (66.6%), whereas ChatGPT 4.0 excelled in neoplasm cases (71.4%). ChatGPT 4.0 may have benefited from more advanced natural language processing capabilities and comprehensive training datasets, enabling it to provide more accurate and relevant responses. Its ability to correctly identify malignant lesions could potentially shorten the diagnostic and treatment process, thereby improving patient prognosis.

Oral and maxillofacial surgery, despite being a practical specialty where surgeons primarily work in operating rooms, is increasingly incorporating AI into all phases of the field, including screening, diagnosis, treatment decision-making, surgical procedures, and follow-up [15].

AI has been used in various branches of dentistry, such as oral and maxillofacial surgery, orthodontics, periodontology, endodontics, and prosthodontics, for diagnosis and treatment planning [16]. AI has also been applied to dental radiology to enhance image interpretation. Two-dimensional digital radiography consists of thousands of pixels. Within the grid, each pixel unit represents a different brightness level. All programs “learn” to analyze digital images based on these features [17]. Despite its potential in clinical applications, such as diagnosis, treatment, and public health initiatives, AI technologies are still underutilized in clinical practice [18]. AI can assist in decision-making, reduce human error in diagnosis, and alleviate the stress dentists face during decision-making processes [19].

In dentistry, intraoral radiographs, panoramic images, cephalograms, and CT scans are routinely used in clinical practice for diagnosis, treatment planning, and treatment evaluation. AI’s ability to analyze radiographic images produced by X-rays is of great importance for the fields of medicine and dentistry, as these images can be digitally coded and easily converted into computational language [20]. Due to the significant biological differences in pathological formations in the maxillofacial area, different treatment strategies exist. Therefore, determining and distinguishing these pathologies before surgical intervention is crucial. Panoramic radiographs are the most frequently used and practical imaging methods prior to surgery. Ariji et al used AI to classify 4 types of radiolucent lesions in the mandible (ameloblastoma, odontogenic keratocysts, dentigerous cysts, and radicular cysts) while simultaneously identifying the location of lesions in the jaws [21]. Lee et al used AI to detect and diagnose 3 types of odontogenic cystic lesions (odontogenic keratocysts, dentigerous cysts, and periapical cysts) [22]. Similar to the studies conducted by Ariji et al and Lee et al, we investigated the extent to which different AI chatbots provide accurate information in the preliminary diagnosis of maxillofacial pathologies.

Liu et al compared Claude 3 Opus and ChatGPT (GPT-4) in diagnosing melanoma from dermatoscopic images and found that Claude 3 Opus outperformed ChatGPT in distinguishing malignant from benign lesions, despite both models showing limitations in accuracy [23]. Ueda et al conducted a study using ChatGPT’s GPT-4 model to answer exam questions related to radiology diagnoses based on textual information obtained from clinical history and imaging findings. They found that ChatGPT correctly answered 54% (170 out of 313) of the exam questions [24]. In our study, ChatGPT 4.0 demonstrated the highest accuracy (65.2%) in identifying preliminary diagnoses from radiographic images, while Claude 3 AI showed the worst performance (30.43%) among the evaluated chatbots. Our findings appear to be consistent with the study by Ueda et al, but do not agree with the results reported by Liu et al.

In 2023, Gomes et al conducted a study using an InceptionV3-based framework to train and validate a CNN-based model that automatically categorizes oral lesion types from photographs. This study produced significant results in diagnosing oral ulcers [25]. Ranjan et al compared the performance of ChatGPT 3.5 and Gemini in answering 200 microbiology questions, and found similar overall accuracy (71% vs 70.5%), with Gemini performing better in general microbiology and immunology, and ChatGPT 3.5 excelling in applied microbiology [26]. In the present study, ChatGPT 4.0 achieved the highest performance in preliminary diagnoses, with a 65.2% success rate (15/23), followed by Grok 2 AI and Blackbox 3 AI, both scoring 52.17% (12/23). Claude 3 AI showed the worst performance, with a 30.43% success rate (7/23). Notably, none of the chatbots correctly identified cases of dentigerous cyst or odontoma. ChatGPT 4.0 was the only chatbot to provide accurate diagnoses for clear cell odontogenic carcinoma and ameloblastoma, while Claude 3 AI underperformed in multiple cases, including calcifying odontogenic tumor and squamous cell carcinoma. Our findings appear to be consistent with the study by Ranjan et al.

Sarangi et al conducted a study to evaluate the accuracy and reliability of Microsoft Bing, Claude, ChatGPT 3.5, and Perplexity in pulmonary embolism imaging using various case scenarios. Perplexity achieved the highest accuracy (0.83) in open-ended questions, while Claude had the lowest accuracy (0.58), and both Bing and ChatGPT scored 0.75. In questions where all valid answers were selected, Bing had the highest accuracy (0.96), Perplexity had the lowest (0.56), and both Claude and ChatGPT scored 0.6 [27]. In our study, ChatGPT showed better performance in interpreting images to identify cases compared to other applications. However, no chatbots achieved a success rate of 90% or higher.

Many studies have explored the use of AI in dentistry. Fatima et al found that Bing provided the best responses to questions on orthognathic surgery [28]. Talaat et al demonstrated that an AI model for temporomandibular joint osteoarthritis diagnosis had comparable or better accuracy than oral radiologists [29]. Mahmoud et al evaluated 4 different AI chatbots (GPT-4.0, GPT-3.5, Gemini, and Copilot) in oral and maxillofacial surgery exams and found that ChatGPT 4.0 outperformed other models in accuracy and error correction, showing its potential in enhancing surgical education [30]. Our findings appear to be consistent with the study by Mahmoud et al. ChatGPT 4.0 may have exhibited improved performance in generating accurate and contextually appropriate responses, owing to advancements in natural language processing algorithms and the incorporation of more extensive and diverse training datasets.

Several recent studies have emphasized the growing role of AI chatbots in dental diagnostics. Mohammad-Rahimi et al compared 6 chatbots and found that Claude performed best in oral pathology and medicine, while GPT-4 achieved the highest scores in oral radiology and had the best overall performance [31]. Grinberg et al demonstrated that ChatGPT-4 provided consistent and reproducible differential diagnoses for oral mucosal lesions, showing success in identifying potentially malignant cases [32]. Similarly, Cuevas-Nunez et al evaluated GPT-4’s performance in diagnosing oral and maxillofacial lesions histopathologically, and reported a moderate accuracy rate of 59.8% [33]. In line with these findings, our study showed that ChatGPT 4.0 outperformed the other chatbots in interpreting radiographic images and identifying maxillofacial pathology cases, reinforcing its potential as a diagnostic support tool in dentistry.

Although chatbots do not always reduce the time required to interpret cases, AI assistance can facilitate the more effective processing of complex information and potentially accelerate efficient diagnoses [32]. Moreover, these systems are particularly valuable for trainees and in resource-limited settings. Additionally, they provide accurate differential diagnoses for both common and specific conditions [32]. Artificial intelligence is acknowledged for assisting in disease diagnosis, prognosis prediction, and the development of patient-specific treatment approaches. Despite their erroneous responses, chatbots show promise in helping dentists make critical decisions quickly [34]. In this context, comparing the image-based diagnostic performance of AI applications like ChatGPT 4.0, Grok 2, Blackbox 3, and Claude 3 AI can provide valuable insights into their strengths and weaknesses, guiding the selection and optimization of AI-assisted diagnostic tools in dentistry. We believe that AI can particularly guide specialist physicians working in limited environments and those who do not encounter a sufficient number of cases during their training.

Beyond clinical practice, AI chatbots may be integrated into undergraduate education to assist medical and dental students, as well as residents, in enhancing their knowledge and perspectives on oral lesions. To better understand the potential of AI chatbots in maxillofacial pathology, larger-scale studies involving a wider variety of oral lesions are needed. Furthermore, the future development of a dedicated tool specifically designed for the diagnosis of oral lesions would be a welcome advancement, and additional research would be beneficial in shaping such a tool.

Our study has several limitations. These findings highlight the potential of chatbots in aiding diagnostic processes. However, this study is not without limitations. The relatively small sample size of 23 clinical cases and the exclusive focus on preliminary diagnosis of lesions may limit the generalizability of the findings. The small sample size may have reduced the statistical power of the analysis, possibly masking meaningful differences among the chatbots. Additionally, focusing solely on preliminary diagnosis of lesions may not reflect chatbot performance in broader clinical contexts. ChatGPT 4.0, Grok 2, Blackbox 3, and Claude 3 AI were selected for this study due to their widespread availability, advanced natural language processing capabilities, and growing popularity in various domains including healthcare and diagnostics. However, the selection of chatbots based on popularity and accessibility, while practical, may exclude other potentially high-performing models. Future research should explore a broader spectrum of medical conditions and expand the dataset to provide a more comprehensive evaluation of chatbots in clinical decision-making. To maximize their clinical utility, these applications need further research and improvement. Exploring the evolving role of AI chatbots in diagnostic imaging requires additional research and validation, to which our study contributes.

Conclusions

This study demonstrated that AI chatbots, particularly ChatGPT 4.0, hold promise as supportive tools in the preliminary diagnosis of maxillofacial pathologies. Among the 4 models evaluated, ChatGPT 4.0 showed the highest diagnostic accuracy, especially in neoplasm cases, while Claude 3 AI had the worst overall performance. These findings suggest that ChatGPT 4.0 may be more suitable for specific diagnostic tasks in dental settings. However, all chatbots made diagnostic errors, indicating that their current capabilities are not sufficient to replace professional clinical judgment. Although chatbots can provide accurate information and guidance, they are unable to offer personalized care, emotional support, or perform physical treatments. As such, chatbots should be regarded not as a replacement for face-to-face care in dentistry, but as a complementary tool. Recognizing their limitations, further validation studies with multicenter and more diverse datasets are essential to improve their reliability and integration into clinical workflows. Over time, ongoing advancements in AI could enhance diagnostic accuracy and support treatment planning more effectively in dentistry.

Figures

Figure 1. All chatbots gave correct answers to the Stafne bone cyst case.

Figure 2. All chatbots gave incorrect answers to the dentigerous cyst case.

Figure 3. ChatGPT 4.0, Grok 2, and Blackbox 3 AI chatbots gave correct answers to the fluoride cemento-osseous dysplasia case.

Tables

Table 1. Answers given by chatbots for preliminary diagnosis of each cyst case.

Table 2. True-false distributions of chatbots’ answers to case questions.

Table 3. Answers given by chatbots for preliminary diagnosis of each neoplasm case.

Table 4. Distribution of correct answers given by chatbots according to cyst and neoplasm distinction.

References

1. Revilla-León M, Gómez-Polo M, Vyas S, Artificial intelligence applications in implant dentistry: A systematic review: J Prosthet Dent, 2023; 129(2); 293-300

2. Mese I, Taslicay CA, Sivrioglu AK, Improving radiology workflow using ChatGPT and artificial intelligence: Clin Imaging, 2023; 103; 109993

3. Garcia EV, Integrating artificial intelligence and natural language processing for computer-assisted reporting and report understanding in nuclear cardiology: J Nucl Cardiol, 2023; 30(3); 1180-90

4. Hadi A, Tran E, Nagarajan B, Kirpalani A, Evaluation of ChatGPT as a diagnostic tool for medical learners and clinicians: PLoS One, 2024; 19(7); e0307383

5. Aggarwal A, Tam CC, Wu D, Artificial intelligence-based chatbots for promoting health behavioral changes: Systematic review: J Med Internet Res, 2023; 25; e40789

6. Jacobs T, Shaari A, Gazonas CB, Is ChatGPT an accurate and readable patient aid for third molar extractions?: J Oral Maxillofac Surg, 2024; 82(10); 1239-45

7. Kurokawa R, Ohizumi Y, Kanzawa J, Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology’s “Diagnosis Please” cases: Jpn J Radiol, 2024; 42(12); 1399-402

8. Shiraishi M, Sowa Y, Tomita K, Performance of artificial intelligence chatbots in answering clinical questions on japanese practical guidelines for implant based breast reconstruction: Aesthetic Plast Surg, 2025; 49(7); 1947-53

9. Kumar R, Waisberg E, Ong J, Optimizing autonomous artificial intelligence diagnostics for neuro-ocular health in space missions: Life Sci Space Res (Amst), 2025; 44; 64-66

10. Rekawek P, Rajapakse CS, Panchal N, Artificial intelligence: The future of maxillofacial prognosis and diagnosis?: J Oral Maxillofac Surg, 2021; 79(7); 1396-97

11. Eggmann F, Weiger R, Zitzmann NU, Implications of large language models such as ChatGPT for dental medicine: J Esthet Restor Dent, 2023; 35(7); 1098-102

12. Tiwari A, Gupta N, Singla D, Artificial intelligence’s use in the diagnosis of mouth ulcers: A systematic review: Cureus, 2023; 15(9); e45187

13. Ahmed N, Abbasi MS, Zuberi F, Artificial intelligence techniques: Analysis, application, and outcome in dentistry – a systematic review: Biomed Res Int, 2021; 2021; 9751564

14. Gao S, Wang X, Xia Z, Artificial intelligence in dentistry: A narrative review of diagnostic and therapeutic applications: Med Sci Monit, 2025; 31; e946676

15. Puladi B, Gsaxner C, Kleesiek J, The impact and opportunities of large language models like ChatGPT in oral and maxillofacial surgery: A narrative review: Int J Oral Maxillofac Surg, 2024; 53(1); 78-88

16. Thurzo A, Urbanová W, Novák B, Where is the artificial intelligence applied in dentistry? Systematic review and literature analysis: Healthcare (Basel), 2022; 10(7); 1269

17. Chen YW, Stanley K, Att W, Corrigendum: Artificial intelligence in dentistry: Current applications and future perspectives: Quintessence Int, 2020; 51(5); 430

18. Pagallo U, O’Sullivan S, Nevejans N, The underuse of AI in the health sector: Opportunity costs, success stories, risks and recommendations: Health Technol (Berl), 2024; 14(1); 1-14

19. Patil S, Albogami S, Hosmani J, Artificial intelligence in the diagnosis of oral diseases: Applications and pitfalls: Diagnostics (Basel), 2022; 12(5); 1029

20. Lee JH, Kim DH, Jeong SN, Diagnosis of cystic lesions using panoramic and cone beam computed tomographic images based on deep learning neural network: Oral Dis, 2020; 26(1); 152-58

21. Ariji Y, Yanashita Y, Kutsuna S, Automatic detection and classification of radiolucent lesions in the mandible on panoramic radiographs using a deep learning object detection technique: Oral Surg Oral Med Oral Pathol Oral Radiol, 2019; 128(4); 424-30

22. Liu Z, Liu J, Zhou Z, Differential diagnosis of ameloblastoma and odontogenic keratocyst by machine learning of panoramic radiographs: Int J Comput Assist Radiol Surg, 2021; 16(3); 415-22

23. Liu X, Duan C, Kim MK, Claude 3 Opus and ChatGPT With GPT-4 in dermoscopic image analysis for melanoma diagnosis: Comparative performance analysis: JMIR Med Inform, 2024; 12; e59273

24. Ueda D, Mitsuyama Y, Takita H, ChatGPT’s diagnostic performance from patient history and imaging findings on the diagnosis please quizzes: Radiology, 2023; 308(1); e231040

25. Gomes RF, Schmith J, Figueiredo RM, Use of artificial intelligence in the classification of elementary oral lesions from clinical images: Int J Environ Res Public Health, 2023; 20(5); 3894

26. Ranjan J, Ahmad A, Subudhi M, Assessment of artificial intelligence platforms with regard to medical microbiology knowledge: An analysis of ChatGPT and Gemini: Cureus, 2024; 16(5); e60675

27. Sarangi PK, Datta S, Swarup MS, Radiologic decision-making for imaging in pulmonary embolism: accuracy and reliability of large language Models-Bing, Claude, ChatGPT, and Perplexity: Indian J Radiol Imaging, 2024; 34(4); 653-60

28. Fatima K, Singh P, Amipara H, Accuracy of artificial intelligence-based virtual assistants in responding to frequently asked questions related to orthognathic surgery: J Oral Maxillofac Surg, 2024; 82(8); 916-21

29. Talaat WM, Shetty S, Al Bayatti S, An artificial intelligence model for the radiographic diagnosis of osteoarthritis of the temporomandibular joint: Sci Rep, 2023; 13(1); 15972

30. Mahmoud R, Shuster A, Kleinman S, Evaluating artificial intelligence chatbots in oral and maxillofacial surgery board exams: Performance and potential: J Oral Maxillofac Surg, 2025; 83(3); 382-89

31. Mohammad-Rahimi H, Khoury ZH, Alamdari MI, Performance of AI chatbots on controversial topics in oral medicine, pathology, and radiology: Oral Surg Oral Med Oral Pathol Oral Radiol, 2024; 137(5); 508-14

32. Grinberg N, Whitefield S, Kleinman S, Assessing the performance of an artificial intelligence based chatbot in the differential diagnosis of oral mucosal lesions: Clinical validation study: Clin Oral Investig, 2025; 29(4); 188

33. Cuevas-Nunez M, Silberberg VIA, Arregui M, Diagnostic performance of ChatGPT-4.0 in histopathological description analysis of oral and maxillofacial lesions: A comparative study with pathologists: Oral Surg Oral Med Oral Pathol Oral Radiol, 2025; 139(4); 453-61

34. Idrees M, Farah CS, Shearston K, A machine-learning algorithm for the reliable identification of oral lichen planus: J Oral Pathol Med, 2021; 50(9); 946-53

Introduction Material and Methods Results Discussion Conclusions References

Related articles Order reprints Share article Share by email

Figures

Figure 1. All chatbots gave correct answers to the Stafne bone cyst case.

Figure 2. All chatbots gave incorrect answers to the dentigerous cyst case.

Figure 3. ChatGPT 4.0, Grok 2, and Blackbox 3 AI chatbots gave correct answers to the fluoride cemento-osseous dysplasia case.

Tables

Table 1. Answers given by chatbots for preliminary diagnosis of each cyst case.

Table 2. True-false distributions of chatbots’ answers to case questions.

Table 3. Answers given by chatbots for preliminary diagnosis of each neoplasm case.

Table 4. Distribution of correct answers given by chatbots according to cyst and neoplasm distinction.

Table 1. Answers given by chatbots for preliminary diagnosis of each cyst case.

Table 2. True-false distributions of chatbots’ answers to case questions.

Table 3. Answers given by chatbots for preliminary diagnosis of each neoplasm case.

Table 4. Distribution of correct answers given by chatbots according to cyst and neoplasm distinction.

In Press

Clinical Research
Analysis of the Clinical Characteristics and Endoscopic Features of Phytobezoar-Induced Ulcers and Gastric ...

Med Sci Monit In Press; DOI: 10.12659/MSM.952191

Clinical Research
Effect of Indirect Co-Culture With Gingival Mesenchymal Stem Cells on Cytokine Secretion in Primary Oral Sq...

Med Sci Monit In Press; DOI: 10.12659/MSM.952439

Clinical Research
Comparison of Sleep Architecture in Individuals Aged 65 to 80 Years With and Without Mild Cognitive Impairm...

Med Sci Monit In Press; DOI: 10.12659/MSM.952493

Clinical Research
Effects of Single-Bout Endurance Exercise Intensity on Peripheral Neurotrophic Factors in Patients With Isc...

Med Sci Monit In Press; DOI: 10.12659/MSM.952089

Most Viewed Current Articles

17 Jan 2024 : Review article 14,176,514
Vaccination Guidelines for Pregnant Women: Addressing COVID-19 and the Omicron Variant

DOI :10.12659/MSM.942799

Med Sci Monit 2024; 30:e942799

0:00

13 Nov 2021 : Clinical Research 3,760,677
Acceptance of COVID-19 Vaccination and Its Associated Factors Among Cancer Patients Attending the Oncology ...

DOI :10.12659/MSM.932788

Med Sci Monit 2021; 27:e932788

0:00

14 Dec 2022 : Clinical Research 2,466,264
Prevalence and Variability of Allergen-Specific Immunoglobulin E in Patients with Elevated Tryptase Levels

DOI :10.12659/MSM.937990

Med Sci Monit 2022; 28:e937990

0:00

16 May 2023 : Clinical Research 708,906
Electrophysiological Testing for an Auditory Processing Disorder and Reading Performance in 54 School Stude...

DOI :10.12659/MSM.940387

Med Sci Monit 2023; 29:e940387

0:00

Performance of AI Chatbots in Preliminary Diagnosis of Maxillofacial Pathologies

Abstract

Introduction

Material and Methods

Results

Discussion

Conclusions

Figures

Tables

References

Figures

Tables

In Press

Clinical Research Analysis of the Clinical Characteristics and Endoscopic Features of Phytobezoar-Induced Ulcers and Gastric ...

Clinical Research Effect of Indirect Co-Culture With Gingival Mesenchymal Stem Cells on Cytokine Secretion in Primary Oral Sq...

Clinical Research Comparison of Sleep Architecture in Individuals Aged 65 to 80 Years With and Without Mild Cognitive Impairm...

Clinical Research Effects of Single-Bout Endurance Exercise Intensity on Peripheral Neurotrophic Factors in Patients With Isc...

Most Viewed Current Articles

17 Jan 2024 : Review article 14,176,514 Vaccination Guidelines for Pregnant Women: Addressing COVID-19 and the Omicron Variant

13 Nov 2021 : Clinical Research 3,760,677 Acceptance of COVID-19 Vaccination and Its Associated Factors Among Cancer Patients Attending the Oncology ...

14 Dec 2022 : Clinical Research 2,466,264 Prevalence and Variability of Allergen-Specific Immunoglobulin E in Patients with Elevated Tryptase Levels

16 May 2023 : Clinical Research 708,906 Electrophysiological Testing for an Auditory Processing Disorder and Reading Performance in 54 School Stude...

Your Privacy

Clinical Research
Analysis of the Clinical Characteristics and Endoscopic Features of Phytobezoar-Induced Ulcers and Gastric ...

Clinical Research
Effect of Indirect Co-Culture With Gingival Mesenchymal Stem Cells on Cytokine Secretion in Primary Oral Sq...

Clinical Research
Comparison of Sleep Architecture in Individuals Aged 65 to 80 Years With and Without Mild Cognitive Impairm...

Clinical Research
Effects of Single-Bout Endurance Exercise Intensity on Peripheral Neurotrophic Factors in Patients With Isc...

17 Jan 2024 : Review article 14,176,514
Vaccination Guidelines for Pregnant Women: Addressing COVID-19 and the Omicron Variant

13 Nov 2021 : Clinical Research 3,760,677
Acceptance of COVID-19 Vaccination and Its Associated Factors Among Cancer Patients Attending the Oncology ...

14 Dec 2022 : Clinical Research 2,466,264
Prevalence and Variability of Allergen-Specific Immunoglobulin E in Patients with Elevated Tryptase Levels

16 May 2023 : Clinical Research 708,906
Electrophysiological Testing for an Auditory Processing Disorder and Reading Performance in 54 School Stude...