AI-Powered Clinical Decision Support in Dentistry: Comparative Evaluation of Large Language Models for Oral Medicine and Periodontal Diagnosis

Rayan Mohammedfarooq Meer; Abdullah Alqarni; Basem Mohammed Akily; Hattan Zaki; Mostafa Ibrahim Fayad; Mohammed Hosny H. AbdElaziz; Mohamed Omar Elboraey

doi:10.12659/MSM.951721

01 April 2026: Clinical Research

AI-Powered Clinical Decision Support in Dentistry: Comparative Evaluation of Large Language Models for Oral Medicine and Periodontal Diagnosis

Rayan Mohammedfarooq Meer^{ABCE 1}, Abdullah Alqarni^{CDE 2}, Basem Mohammed Akily^{BEF 3}, Hattan Zaki^{DEF 3}, Mostafa Ibrahim Fayad

^{CDEF 4}, Mohammed Hosny H. AbdElaziz

^{DEF 4}, Mohamed Omar Elboraey^{ABCEF 1,5*}

DOI: 10.12659/MSM.951721

Med Sci Monit 2026; 32:e951721

Authors information Article notes Copyright and License information

0 Comments

Add Comment

Abstract

0:00

BACKGROUND: This study evaluates the diagnostic performance of 3 prominent artificial intelligence (AI)-powered large language models (LLMs) – ChatGPT, Copilot, and Gemini – as AI assistants for the diagnosis of oral lesions and periodontal conditions using comprehensive statistical analysis.

MATERIAL AND METHODS: A retrograde study was conducted on 385 cases with definite diagnoses from the College of Dentistry, Taibah University, Saudi Arabia. Clinical and radiographic images were presented to each AI model, and the diagnostic performance of the LLMs was evaluated using a 5-point Likert scale across 8 criteria: diagnostic concordance, time efficiency, ease of use, clarity of explanation, comprehensiveness, ability to answer questions, reliability, and diagnostic range. Statistical analysis included descriptive statistics with 95% confidence intervals, Friedman tests, post-hoc pairwise comparisons, correlation analysis, effect size calculations, and reliability assessment using Cronbach’s alpha.

RESULTS: ChatGPT demonstrated superior performance with an overall score of 4.846±0.075, followed by Copilot (4.433±0.163) and Gemini (4.234±0.088). Friedman tests revealed statistically significant differences across all evaluation criteria (P<0.001). Post-hoc analyses showed ChatGPT significantly outperformed both Gemini and Copilot in all criteria. Internal consistency was excellent for all systems (Cronbach α: 0.801-0.911).

CONCLUSIONS: The LLMs, particularly ChatGPT, demonstrate significant potential as reliable AI assistants for oral and periodontal diagnosis. The comprehensive statistical analysis confirms the superior performance of ChatGPT across multiple evaluation dimensions, supporting its potential integration into clinical practice.

Keywords: Oral Medicine, periodontal diseases, Image Interpretation, Computer-Assisted, Dentistry, Artificial Intelligence, Pelvic Neoplasms, periodontitis, Poloxamer, Dentistry, periodontitis, Diagnosis, Combined Modality Therapy

Introduction

The integration of artificial intelligence (AI) into healthcare has emerged as one of the most transformative developments in modern medicine, with applications spanning from diagnostic imaging to clinical decision support systems [1,2]. The rapid advancement of AI technologies, particularly in the realm of natural language processing and machine learning, has opened unprecedented opportunities for enhancing clinical practice and improving patient outcomes [3,4]. Among these technological innovations, large language models (LLMs) have garnered significant attention for their potential to revolutionize medical diagnosis and clinical decision-making processes [5,6].

LLMs, exemplified by systems such as ChatGPT (OpenAI), Google’s Gemini (formerly Bard), and Microsoft’s Copilot, represent a paradigm shift in AI capabilities, demonstrating remarkable proficiency in understanding complex medical terminology, processing clinical information, and generating coherent diagnostic recommendations [7,8]. These models, trained on vast datasets encompassing medical literature, clinical guidelines, and diagnostic protocols, have shown promising results in various medical specialties, including internal medicine, radiology, and pathology [9–11].

Recent systematic reviews and meta-analyses have highlighted the growing body of evidence supporting the clinical utility of LLMs in healthcare settings. A comprehensive review by Chen et al identified over 200 studies examining LLM applications in medical diagnosis, revealing an overall diagnostic accuracy of 52.1% across various medical specialties [12]. Similarly, the landmark study by Rodriguez-Martinez et al demonstrated that LLMs could achieve diagnostic performance comparable to that of junior physicians in complex clinical scenarios, with ChatGPT showing superior performance in differential diagnosis generation [13].

The application of AI in oral medicine and dentistry has experienced remarkable growth over the past decade, with numerous studies demonstrating the potential of machine learning algorithms and deep learning networks in enhancing diagnostic accuracy for oral pathologies [14–16]. Recent advances in AI-powered diagnostic systems have shown particular promise in the detection of oral cancers, periodontal diseases, and dental caries, with some studies reporting diagnostic accuracies exceeding 90% [17–19]. The recent comprehensive review by Thompson et al highlighted that AI applications in oral medicine have evolved from simple pattern recognition systems to sophisticated diagnostic assistants capable of processing multimodal clinical data [20].

Despite these advances, the clinical diagnosis of oral lesions and periodontal conditions remains a complex and challenging process, often requiring specialized expertise and comprehensive evaluation of clinical and radiographic data. Misdiagnosis or delayed diagnosis can significantly affect patient morbidity and mortality, particularly in the context of potentially malignant disorders and aggressive periodontal disease. Therefore, there is a critical need for advanced decision-support tools that can assist clinicians in improving diagnostic accuracy and efficiency in this specialized field [13].

However, the specific application of conversational AI and LLMs in oral medicine remains relatively underexplored, compared with that in other medical specialties. While traditional AI models in dentistry have primarily focused on image analysis and pattern recognition, LLMs offer unique advantages in processing textual clinical information, generating comprehensive diagnostic reports, and providing explanatory reasoning for their recommendations [21,22]. The recent study by Kumar et al was among the first to evaluate the performance of ChatGPT in oral pathology diagnosis, reporting promising results but highlighting the need for more comprehensive comparative studies [23].

The comparative evaluation of different LLMs has become increasingly important as healthcare institutions consider implementing these technologies in clinical workflows. Recent comparative studies have revealed significant performance variations among different AI models. The multicenter study by Patel et al comparing ChatGPT-4, Microsoft Copilot, and Google Gemini in medical question-answering tasks found that ChatGPT consistently outperformed other models across multiple evaluation metrics [24]. Similarly, the cross-sectional study by Rossettini et al evaluating these 3 LLMs in healthcare sciences entrance examinations reported superior accuracy for ChatGPT-4, followed by Copilot and Gemini [25].

The present study is novel and distinct from previous comparative analyses (eg, Patel et al, Rossettini et al) [24] in 2 critical aspects. First, it is the first to specifically evaluate the performance of these general-purpose LLMs in a specialized, diagnostic task within the field of oral medicine and periodontology. Second, it employs a robust, multidimensional evaluation framework using a 5-point Likert scale across 8 criteria, providing a more comprehensive assessment of clinical utility beyond simple accuracy metrics. This approach addresses the lack of standardized, comprehensive evaluation criteria in the existing literature.

Emerging evidence suggests that the performance of LLMs may vary significantly across different medical specialties and clinical contexts. The comparative analysis by Fattah et al revealed that while ChatGPT demonstrated higher accuracy in general medical inquiries, the performance gap between different LLMs varied depending on the complexity and specificity of medical questions [26]. This finding underscores the importance of specialty-specific evaluations, particularly in fields such as oral medicine, in which the diagnostic challenges may differ substantially from those of general medical practice.

The integration of LLMs into clinical practice also raises important considerations regarding reliability, consistency, and clinical utility. Recent studies have emphasized the importance of comprehensive evaluation frameworks that assess not only diagnostic accuracy but also factors such as explanation quality, user experience, and clinical workflow integration [27,28]. The systematic review by Johnson et al identified 8 key performance dimensions that should be considered when evaluating AI diagnostic assistants, including accuracy, reliability, usability, and comprehensiveness [29].

Furthermore, the rapid evolution of LLM technology necessitates continuous evaluation and comparison of different systems. The longitudinal study by Martinez-Lopez et al demonstrated that model updates and improvements could significantly affect diagnostic performance, with newer versions of ChatGPT showing substantial improvements over earlier iterations [30]. This dynamic nature of LLM technology highlights the importance of conducting regular comparative assessments to inform clinical decision-making and technology adoption strategies.

Despite the growing interest in LLM applications in healthcare, several gaps remain in the current literature. First, there is a paucity of comprehensive comparative studies evaluating multiple LLMs simultaneously across standardized evaluation criteria. Second, most existing studies focus on general medical applications, with limited research specifically addressing oral medicine and periodontal conditions. Third, most published studies lack robust statistical analysis and effect size calculations, limiting the ability to draw meaningful conclusions about clinical significance.

The present study aims to address these gaps by providing a comprehensive comparative evaluation of 3 leading LLMs – ChatGPT, Copilot, and Gemini – as diagnostic assistants in oral medicine. By using a rigorous methodology with standardized evaluation criteria and advanced statistical analysis in this research, we seek to provide evidence-based insights into the relative performance of these AI models in the context of oral and periodontal diagnosis. The findings of this study will contribute to the growing body of literature on AI applications in dentistry and provide practical guidance for clinicians and healthcare institutions considering the implementation of LLM-based diagnostic support systems.

The primary research question addressed by this study is: How does the diagnostic performance of ChatGPT, Copilot, and Gemini compare when evaluated against a standardized, multidimensional framework for the diagnosis of oral lesions and periodontal conditions from clinical and radiographic images?

Material and Methods

STUDY DESIGN AND ETHICAL CONSIDERATIONS:

This study was a cross-sectional, retrograde evaluation of 3 AI LLMs – ChatGPT, Copilot, and Gemini – to be used as AI assistants for diagnosis of different oral lesions and periodontal conditions. The diagnostic performance of the AI models was evaluated using a series of predefined 5-point Likert scale ranging from 1 (poor performance) to 5 (excellent performance).

Ethics approval for this study was obtained from the Institutional Review Board (IRB) of Taibah University, Faculty of Dentistry on May 2, 2024, with IRB number: TUCDREC/020524/MELBORAEY. As the study used anonymized, retrospective clinical and radiographic images, the requirement for individual patient consent was waived by the IRB. All data were de-identified in compliance with the Declaration of Helsinki and local ethical guidelines.

SAMPLE SIZE CALCULATION:

The sample size calculation was done using the following equation: n=Z². p.(1-p)/E², where Z represents the Z-score corresponding to the desired confidence level, p is the estimated proportion of correct diagnoses, and E denotes the margin of error. With a Z-score of 1.96 (for 95% confidence), an estimated proportion (p) of 0.5 (to maximize sample size), and a margin of error (E) of 0.05, the required sample size was calculated as: n=(1.96)² * 0.5 * (1-0.5)/(0.05)²=384.16.

The sample size of 385 cases exceeded the minimum requirement.

STUDY POPULATION AND DATA COLLECTION:

A total of 385 definitively diagnosed oral lesions and periodontal conditions were included in this study. The conditions were diagnosed by expert professionals depending on clinical and radiographic data, with the inclusion of histological findings, if applicable. The cases were selected using a convenience sampling method from the archives of the Oral Medicine and Periodontology Departments at the Faculty of Dentistry, Taibah University, KSA, spanning the period from January 2020 to December 2024.

The inclusion criteria were as follows: (1) availability of high-quality clinical and/or radiographic images; (2) confirmed final diagnosis by at least 2 independent specialists (oral medicine consultant and/or periodontist), supported by histopathological findings where applicable; and (3) cases covering a broad spectrum of common oral and periodontal conditions. The exclusion criteria were as follows: (1) cases with incomplete clinical records or equivocal final diagnoses, and (2) cases with image quality that was deemed insufficient for remote visual diagnosis.

AI MODULE EVALUATION PROCEDURE:

The exposure variable was the specific AI model (ChatGPT, Copilot, Gemini), and the outcome variable was the diagnostic performance as measured by the composite Likert score.

Only clinical and radiographic images were uploaded to the AI models. Images were standardized to a uniform PNG format (1800×2500 pixels) and anonymized by removing all patient identifiers. The images were presented to the AI models using the following consistent, predefined prompt for each case: “Analyze the attached clinical and/or radiographic image(s) and provide the most likely diagnosis, a differential diagnosis, the rationale for your conclusion, and treatment plan”.

To ensure reproducibility and control for version updates, all AI models were accessed and tested on the same day (Jan 20, 2025) using their then-current, publicly available versions (ChatGPT-4, Copilot Pro with GPT-4, and Gemini Advanced). The same prompts and image formats were used for each AI model.

EVALUATION AND SCORING PROCEDURE:

Eight core criteria were used to establish frameworks for standard assessment of the AI models’ capabilities. Scores were assigned for the 8 criteria: diagnostic concordance, time efficiency, ease of use, clarity of explanation, comprehensiveness, ability to answer questions, reliability, and diagnostic range.

DIAGNOSTIC CONCORDANCE:

The diagnostic concordance of the model was assessed using a 5-point scale ranging from 1 (very inaccurate) to 5 (very accurate). A score of 5 was assigned when the model accurately identified the definitive and precise diagnosis of the disease using the provided image and specified clinical guidance, reflecting a very accurate performance with almost no errors. A score of 4 indicated that the correct diagnosis appeared as the first option in the differential diagnosis proposed by the intelligent model, representing a mostly accurate result with few errors. A score of 3 indicated a moderately accurate performance, in which the correct diagnosis was located within the upper third of the differential diagnostic options generated, except when it was the first option, which corresponded to a score of 4. A score of 2 indicated generally inaccurate performance, with the correct diagnosis appearing within the middle third of the differential diagnostic list. Finally, a score of 1 indicated a very inaccurate outcome, assigned when the correct diagnosis was found within the lower third of the differential diagnosis list or when only the main disease category was correctly identified.

TIME EFFICIENCY:

This criterion assessed the speed at which the AI module generated diagnostic results. Scores were defined as follows: a score of 1 indicated a very slow result causing significant delays in diagnosis, generated over 8 seconds; 2 indicated a slow result with noticeable delays, generated from 6 to 8 seconds; 3 indicated moderated speed with acceptable delays, generated from 4 to 6 seconds; 4 indicated a fast result with minimal delays, generated from 2 to 4 seconds; and 5 indicated a very fast result with no perceptible delays, generated in under 2 seconds.

The time efficiency criterion evaluated the user-perceived responsiveness of each model within a simulated clinical workflow. Several external factors can influence time efficiency, such as network latency, server load, and platform infrastructure. However, the inclusion of this criterion reflects the practical efficiency an oral health professional would experience when integrating these tools into daily clinical practice, rather than focusing solely on the intrinsic computational speed of the models.

EASE OF USE:

The ease of use of the AI models’ web interface was evaluated using a 5-point scale based on 5 predefined usability criteria. The assessment followed the sequential order of the 5 evaluated criteria: (1) ease of access to the web page, (2) fast loading time, (3) clarity of icons and user interface elements, (4) absence of lag or server downtime during operation, and (5) lack of mandatory login requirements to access the AI module’s web page. A score of 5 was assigned when all 5 criteria were fully met, while the score decreased by 1 point for each missing criterion. Specifically, a score of 4 indicated that 1 criterion was not fulfilled, 3 indicated that 2 criteria were missing, 2 indicated that 3 criteria were missing, and 1 indicated that 4 criteria were missing.

CLARITY OF EXPLANATION:

This scale assessed the degree of clarity in the explanation provided by each AI module. The scale ranged from 1 to 5. Each AI model was required to provide a final diagnosis, a differential diagnosis, treatment options, and a diagnostic rationale or explanation. A score of 5 was assigned when the explanation was very clear and easy to understand, presented in simple and precise language, and supported by a valid diagnosis with comprehensive justification. A score of 4 was given when the explanation was clear and understandable but 1 of the aforementioned elements was incomplete or insufficiently justified. A score of 3 was given when an acceptable explanation that could be clearer, where more than 1 component was summarized or partially addressed. A score of 2 was given when the explanation was unclear, the justification was inadequate, and multiple key elements were missing. Finally, a score of 1 was given when the explanation was very unclear and overly concise, with many essential points omitted.

COMPREHENSIVENESS:

Each AI model was given the same clinical image and standardized prompt, and its response was evaluated using a 5-point comprehensiveness scale. The assessment considered inclusion of 5 key elements: definitive diagnosis, disease category, differential diagnosis, treatment options, and additional relevant details such as precautions or systemic associations. A score of 5 indicated complete and accurate coverage of all components, while 1 point was deducted for each missing or insufficient element, with 1 representing minimal or incomplete information. Scores of 2, 3, and 4 reflected increasing levels of completeness, from partial coverage with major gaps to nearly comprehensive responses with minor omissions.

ABILITY TO ANSWER QUESTIONS:

This criterion was used to evaluate each AI model’s capacity to provide clarifications or respond. Scores were defined as follows: a score of 1 indicated the AI model was unable to answer questions related to the explanation; 2 indicated it answered some questions but inadequately; 3 indicated it answered most questions adequately; 4 indicated it answered all questions clearly and understandably; and 5 indicated it answered all questions excellently and in detail.

RELIABILITY:

The reliability of each AI model was evaluated by testing its stability and reproducibility when presented with identical inputs. Each model was queried 3 consecutive times using the same clinical image and standardized prompt. Responses were compared for consistency across key elements: diagnosis, disease category, differential diagnosis, treatment plan, and additional relevant details. A consistency index (CI) was calculated as: CI = (number of consistent elements across trials/total number of evaluated elements)×100.

Models were then rated on a 5-point scale as follows: 1 indicated very inconsistent: CI <40%; 2 indicated generally inconsistent: CI=40–59%; 3 indicated moderately consistent: CI=60–79%; 4 indicated mostly consistent: CI=80–94%; and 5 indicated very consistent: CI ≥95%.

DIAGNOSTIC RANGE:

To evaluate the diagnostic range of each AI model, a structured approach was applied. First, a diverse case set was compiled, including common conditions and rare or complex cases to ensure broad coverage and focus on the model’s ability to handle varied case complexity rather than absolute precision.

The diagnostic range score was calculated as follows: score (%)=(accuracy scale/5)×100.

Models were rated on a 5-point scale, where 1 (<20%) indicated very limited range: very few cases diagnosed; 2 (20–39%) indicated limited range: some cases diagnosed; 3 (40–59%) indicated moderate range; 4 (60–79%) indicated broad range: most cases diagnosed; and 5 (≥80%) indicated very broad range: covered nearly all common and rare cases.

SCORING PROCEDURE:

For each criterion, 3 independent, blinded evaluators (2 oral medicine consultants and 1 periodontist) independently assigned scores based on the predefined 5-point Likert scale. The evaluators were blinded to which AI model produced each diagnosis, to minimize potential bias. Discrepancies in scores were resolved through consensus discussions among the reviewers. The final score for each criterion was calculated as the mean of the 3 reviewers’ scores. A composite score was also computed by averaging the scores across all criteria to provide an overall performance rating for each AI model.

STATISTICAL ANALYSIS:

Data were analyzed using Python 3.11 with pandas, numpy, matplotlib, seaborn, and scipy libraries. Descriptive statistics were calculated, including means, standard deviations, and 95% confidence intervals using the t-distribution. For continuous data, normality was assessed using the Kolmogorov-Smirnov test.

For comparing the 3 AI models across multiple criteria, Friedman tests were performed as the non-parametric equivalent of repeated measures ANOVA. Post-hoc pairwise comparisons were conducted using Wilcoxon signed-rank tests with appropriate corrections for multiple comparisons. An exploratory analysis using Cohen’s d was conducted to assess the proximity of the models’ performance to the maximum expert rating (5.0), and effect size r was calculated for pairwise comparisons. Cronbach’s alpha was used to assess the internal consistency of the composite performance construct. In this context, the alpha coefficient reflects the reliability of the combined evaluation domains in capturing overall model performance, rather than suggesting redundancy or unidimensionality among the 8 heterogeneous criteria.

There were no missing data, as the study was based on complete, confirmed case files. For statistical analysis, all selected cases were included.

To minimize methodological bias, potential sources of error were identified and systematically addressed. Selection bias was mitigated by including a large and demographically diverse sample of 385 clinical cases representing oral and periodontal conditions, as detailed in Table 1. This ensured adequate variability and improved the representativeness of the study population.

Measurement bias was reduced by blinding all evaluators to the AI model (ChatGPT, Copilot, or Gemini) and by using a standardized, predefined 5-point Likert scale with clear operational definitions for each criterion, thereby enhancing objectivity and reproducibility.

All performance metrics derived from Likert scales are reported as raw mean scores out of 5.0. To avoid methodological misinterpretation, no percentage-based transformations were used to describe diagnostic accuracy; instead, these scores were interpreted as the degree of proximity to the ideal expert response.

All statistical analyses were conducted at a significance level of P<0.05, establishing 95% confidence that the observed differences among AI models were unlikely to have occurred by chance.

Results

DESCRIPTIVE STATISTICS AND PERFORMANCE OVERVIEW:

The comprehensive statistical analysis revealed significant differences in performance among the 3 AI models. ChatGPT demonstrated superior performance with a mean composite Likert score of 4.846±0.075 out of 5.0. This was followed by Copilot (4.433±0.163) and Gemini (4.234±0.088). Figure 1 shows the overall comparative performance of the 3 AI models across all criteria.

A detailed breakdown of the mean scores for each AI module across all 8 evaluation criteria is presented in Table 2. The comparison of mean Likert scores of ChatGPT, Copilot, and Gemini across 8 evaluation criteria is presented in Figure 2. A performance radar chart for the 3 AI models is presented in Figure 3.

COMPARATIVE STATISTICAL ANALYSIS:

Friedman tests (Table 3) revealed statistically significant differences across all evaluation criteria (P<0.001), indicating that the performance of the 3 AI models was not equivalent.

Post-hoc analyses using Wilcoxon signed-rank tests (Table 4) showed that ChatGPT significantly outperformed both Gemini and Copilot in all criteria (P<0.001 for all pairwise comparisons). Copilot also significantly outperformed Gemini across all criteria (P<0.01 for all pairwise comparisons).

RELIABILITY ANALYSIS:

Internal consistency reliability shown by Cronbach’s alpha coefficient was 0.911 for ChatGPT, 0.845 for Copilot, and 0.801 for Gemini.

CORRELATION ANALYSIS:

Correlation analysis was performed to examine the relationships between the 8 evaluation criteria within each AI model. While a detailed presentation of the correlation matrix is beyond the scope of the primary findings, the key insight was that for all models, diagnostic concordance was highly correlated with clarity of explanation (r >0.85).

Correlation analysis revealed interesting patterns within each AI model. Gemini showed the strongest inter-criterion correlations, with significant correlations between time efficiency and ease of use (r=0.755), and between time efficiency and ability to answer questions (r=0.702).

Regarding the models’ alignment with the ideal score, the exploratory effect size calculations (Figure 4) indicated varying degrees of proximity to the maximum rating, with ChatGPT showing the highest concordance.

SENSITIVITY ANALYSIS ON TIME EFFICIENCY:

Sensitivity analysis was performed to address concerns regarding the time efficiency metric. The analysis involved recalculating the composite score without including the time efficiency criterion. The results showed that the overall ranking and the significance of the differences remained unchanged: ChatGPT (4.86±0.07), Copilot (4.45±0.16), and Gemini (4.25±0.09). The Friedman test result remained highly significant (P<0.001).

Discussion

COMPARISON WITH EXISTING LITERATURE AND NOVELTY:

The findings of this comprehensive comparative evaluation provide robust evidence for the differential performance of 3 leading LLMs in oral medicine diagnosis. The results demonstrate that ChatGPT significantly outperformed both Copilot and Gemini across all evaluation criteria, demonstrating superior performance with a mean composite Likert score of 4.846±0.075, corresponding to approximately 97% of the maximum attainable rating on the evaluation scale. This score reflects high proximity to the maximum expert rating rather than a statistical percentage of diagnostic correctness. This finding aligns with recent comparative studies in other medical specialties and provides important insights into the potential clinical implementation of LLM-based diagnostic assistants in oral medicine.

The superior performance of ChatGPT in the present study agrees with the findings of recent comparative studies, such as the results of Patel et al and Rossettini et al, which reported ChatGPT’s consistent advantage in medical question-answering and entrance examinations, respectively [24,25], with ChatGPT achieving accuracy of 78.2%, followed by Copilot (71.4%) and Gemini (65.8%) [25]. Similarly, the comprehensive analysis by Fattah et al examined the performance of ChatGPT and Gemini across a wide range of medical inquiries [26]. The authors reported that the strong performance of the LLMs may be attributed to factors such as advanced capabilities in processing complex medical information, generating accurate diagnostic recommendations, and the benefits of continuous model updates and fine-tuning processes [30,31]. However, the present study advances the literature by focusing on a multimodal interpretation of the provided clinical and radiographic images to generate diagnostic suggestions in oral medicine, an area previously underexplored. The use of a multidimensional Likert scale, which assesses aspects such as clarity of explanation and reliability in addition to diagnostic concordance, provides a more nuanced measure of clinical utility than the simple diagnostic concordance metrics used in many existing papers. This methodological distinction, coupled with the specialty-specific focus, establishes the novelty and main contribution of this study, which directly made a comprehensive comparison.

Furthermore, a key finding is the high correlation between diagnostic concordance and clarity of explanation (r >0.85) across all models. This suggests that the models capable of generating more accurate diagnoses are also capable of providing clearer, more defensible reasoning, a critical factor for clinician trust and responsible AI use.

The present study extends beyond previous research by providing a comprehensive evaluation specifically focused on oral medicine applications. The diagnostic concordance achieved by ChatGPT in the study (4.717±0.633 on a 5-point scale) is notably higher than the general medical diagnostic concordance reported in recent systematic reviews. Chen et al reported an overall diagnostic concordance of 52.1% for LLMs across various medical specialties [12], while the present study achieved approximately 94% diagnostic concordance for ChatGPT in oral medicine contexts. This superior performance may be attributed to the more structured nature of oral pathology presentations and the availability of visual clinical information that enhances diagnostic reasoning.

The performance gap between ChatGPT and other LLMs observed in the study is more pronounced than that reported in some previous comparative studies. While Patel et al found moderate differences between LLMs in general medical question-answering tasks [24], the present study revealed substantial and statistically significant differences across all evaluation criteria. This finding suggests that the performance differential between LLMs may be more pronounced in specialized medical domains such as oral medicine, in which specific clinical knowledge and diagnostic reasoning patterns are required.

The narrow confidence intervals observed for ChatGPT (ranging from 0.066 to 0.133 across criteria) indicate consistent and reliable performance, which is crucial for clinical applications. This finding aligns with the reliability analysis conducted by Johnson et al, who emphasized the importance of consistency in AI diagnostic systems for clinical use [29]. The excellent internal consistency reliability further supports the potential clinical utility of ChatGPT as a diagnostic assistant.

Interestingly, ChatGPT showed the highest internal consistency (Cronbach’s α=0.911), followed by Copilot (α=0.845), while Gemini showed the lowest, although still acceptable, reliability (α=0.801). This high internal consistency suggests that the 8 evaluation criteria collectively provide a reliable measure of the composite performance construct for each AI model. This finding is consistent with the observations by Kumar et al, who noted that different LLMs can exhibit varying strengths and weaknesses in medical applications [23].

The diagnostic concordance achieved by all 3 LLMs in the present study agrees with results of recent studies evaluating AI applications in oral medicine. For instance, the comprehensive review by Thompson et al reported diagnostic accuracies ranging from 75% to 95% for various AI models in oral pathology detection [20]. The findings place ChatGPT at the upper end of this range, while Copilot and Gemini achieved performance levels comparable to many existing AI diagnostic systems.

However, it is important to contextualize these results within the broader landscape of AI diagnostic performance in dentistry. Recent studies have shown that specialized AI models trained specifically for dental applications can achieve even higher diagnostic concordance. For example, the study by Rodriguez et al evaluating AI-powered radiographic analysis systems reported diagnostic accuracies exceeding 95% for specific dental conditions [32]. This comparison highlights that while general-purpose LLMs show promising performance, specialized AI models may still maintain advantages in specific diagnostic tasks.

The recent systematic review by Anderson et al examining AI applications in periodontal disease detection reported similar findings, with AI models achieving diagnostic accuracies ranging from 85% to 98% depending on the specific application and methodology [33]. The present study’s results for periodontal condition diagnosis align well with these findings, particularly for ChatGPT, which achieved performance levels comparable to those of specialized periodontal AI models.

Our study used a more comprehensive evaluation framework than that used in many previous studies in this field. While most existing research focuses primarily on diagnostic diagnostic concordance, the 8-criterion evaluation approach provides a more holistic assessment of LLM performance. This methodology aligns with the recommendations by Williams et al, who advocated for multi-dimensional evaluation frameworks in AI diagnostic model assessment [34].

The exploratory measures of effect size further support the primary findings, illustrating that ChatGPT’s responses had higher concordance with expert-defined standards than did those of Gemini or Copilot. This finding is particularly important given the recent emphasis on effect size reporting in medical AI research, as highlighted by the systematic review by Davis et al [35].

The correlation analysis, revealing stronger inter-criterion relationships in Gemini compared with other LLMs, provides novel insights into the performance patterns of different AI models. This finding suggests that while Gemini may have lower absolute performance, it demonstrates more predictable and consistent behavior across related evaluation criteria, which could be valuable for specific clinical applications where consistency is prioritized over peak performance.

Recent evidence has shown that healthcare organizations are increasingly exploring the use of LLMs such as ChatGPT in clinical decision support applications [36]. However, successful clinical implementation requires consideration of workflow integration, clinician acceptability, and regulatory compliance, in addition to diagnostic accuracy. In an early result in real-world contexts, the pilot study by Taylor et al found positive user feedback and increased confidence in diagnosis after deploying ChatGPT in dental clinical settings, but also noted enduring issues around legal culpability, data privacy, and the continued role of human oversight within AI-assisted diagnostic applications [37].

The present study adds to the discussion regarding the use of LLMs in the field of dental education and professional training. Recent studies have shown that LLMs can be an auxiliary educational tool in the training of dental students and residents, potentially improving their clinical decision-making skills [38]. While the utility of such tools is not to be underestimated, there is a warning from regulatory bodies regarding their approval in the healthcare industry solely on the grounds of their performance level. The guidelines clearly address the need for rigorous prospective clinical validation before the approval for the use of AI in a more critical approach for diagnostically oriented purposes [39].

The ethical implications of LLM use in clinical diagnosis have been extensively discussed in recent literature. The comprehensive analysis in 2024 by the Ethics Committee of the American Dental Association highlighted concerns about algorithmic bias, transparency, and the potential for over-reliance on AI models [40]. The study’s finding of performance differences between LLMs underscores the importance of careful system selection and validation to minimize potential biases and ensure equitable healthcare delivery.

A crucial point to consider is the use of general-purpose LLMs for image-based diagnostic reasoning. While LLMs like ChatGPT-4, Copilot Pro, and Gemini are not purely text-based and possess multimodal capabilities, their core architecture is fundamentally different from that of dedicated vision models trained exclusively on imaging data. In the present study, clinical and radiographic images were processed by the respective models’ integrated multimodal components. This means the models were not merely interpreting a text description of the image but were performing a form of visual analysis. However, the diagnostic output is still filtered through the LLM’s text generation capability. For instance, subtle visual cues or highly specialized radiographic features might be overlooked, compared with in a dedicated deep learning model. This indicates the use of these tools as assistants, not replacements, for their own diagnostic judgment. This distinction is critical for the responsible interpretation of the findings and for guiding future research toward hybrid models that combine the visual precision of dedicated AI with the reasoning and explanatory power of LLMs.

The gold standard for the 385 cases was established by the consensus of 2 independent, board-certified specialists, and was supported by histopathology when available. This consensus standard represents a high-level of diagnostic certainty. The fact that ChatGPT achieved a mean diagnostic concordance score of 4.846 out of 5.0, where 5.0 represents a diagnosis matching the gold standard, strongly suggests that its performance has a strong alignment to that of an expert consultant. Future studies should formally compare the LLMs against a panel of human experts to provide a more direct metric of comparative performance.

STUDY LIMITATIONS AND GENERALIZABILITY:

Despite the rigorous methodology, this study has several limitations. First, the retrograde nature of the study, using cases from a single institution (Taibah University), can limit the generalizability of the findings to other populations or clinical settings, compared with multi-center studies like that by Lee et al which underscore the need for diverse datasets to enhance external validity [41]. While the case diversity (Table 1) was broad, future multicenter studies are needed to validate these results. Second, the evaluation was based on a predefined set of static images and prompts. The performance of these dynamic LLMs in a real-time, interactive clinical setting can differ. Further, our focus was on oral medicine and periodontal conditions and not generalized to other dental specialties, whereas Garcia et al noted significant performance variations across subspecialties [42]. Finally, the time efficiency metric, while robustly tested in the sensitivity analysis, remains subject to external factors, such as network latency and server load, which are beyond the control of the AI model itself. The rapid evolution of LLMs, as evidenced by Brown et al, indicates that our findings reflect a temporary snapshot, necessitating regular re-evaluations to account for model advancements [43].

FUTURE RESEARCH DIRECTIONS:

The results of the present study create a basis for further analysis of the relevance of LLMs in the areas of oral medicine and periodontology. Despite the differences in performance between the analyzed models, further studies are needed to assess their feasibility for application in a real-world environment. These studies should focus on the conduct of prospective studies and the evaluation of the performance of the model in an interacting process, as identified in the frameworks for methodological dental AI studies [44].

Future studies should explore the best strategies to incorporate LLMs into existing clinical workflows, such as their interfaces with the electronic health record and clinical decision support systems [45]. Furthermore, domain-specific fine-tuning of LLMs for oral medicine and periodontal applications may be worth investigating to see whether such methods can improve consistency and task-specific performance [46].

As current LLMs employ multimodal extensions on top of LLMs, rather than using specialized image-analysis architectures, as technology progresses, systems may benefit from a combination of specialized vision-based AI with the reasoning and explanation capabilities of LLMs. This will remain an area to monitor as technologies evolve, so that performance evaluation remains timely and evidence-based.

Conclusions

This study provides a comparative evaluation of ChatGPT, Copilot, and Gemini as diagnostic-support LLMs in oral medicine and periodontology. There were statistically significant differences across all examined evaluation criteria, with ChatGPT achieving significantly higher composite scores than either of the other models. The findings represent relative differences in the models’ performance within the defined evaluation framework, not in their ultimate clinical diagnostic accuracy.

While the findings demonstrate the potential of LLMs to produce structured diagnostic outputs and rationales when applied to static clinical and radiographic images, the intended role remains unclear. It should be viewed as guidance rather than a determinant. The observed differences in performance highlight the need for ongoing methodological evaluations, cautious interpretation, and further validation before wider clinical deployment. Taken together, these findings provide additional insight into the capabilities and limitations of LLMs in specialized dental domains, with a basis for future, more comprehensive research.

Data Availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Figures

Figure 1. Overall performance comparison of AI models.

Figure 2. Performance comparison across all evaluation criteria.

Figure 3. performance radar chart for the 3 AI models.

Figure 4. Effect size heatmap (Cohen’s d) compared with perfect score.

Tables

Table 1. Characteristics and diversity of the study population (n=385).

Table 2. Descriptive statistics with 95% CIs and effect sizes.

Table 3. Friedman test results for all evaluation criteria.

Table 4. Significant pairwise comparisons (Wilcoxon signed-rank test) (P<0.05).

References

1. Topol EJ, High-performance medicine: The convergence of human and artificial intelligence: Nat Med, 2019; 25(1); 44-56

2. Yu KH, Beam AL, Kohane IS, Artificial intelligence in healthcare: Nat Biomed Eng, 2018; 2(10); 719-31

3. Rajkomar A, Dean J, Kohane I, Machine learning in medicine: N Engl J Med, 2019; 380(14); 1347-58

4. Esteva A, Robicquet A, Ramsundar B, A guide to deep learning in healthcare: Nat Med, 2019; 25(1); 24-29

5. Su H, Sun Y, Li R, Large language models in medical diagnostics: scoping review with bibliometric analysis: J Med Internet Res, 2025; 27; e72062

6. Rodriguez-Martinez P, Kim J, Lee H, Large language model influence on diagnostic reasoning: A randomized clinical trial: JAMA Network Open, 2024; 7(10); e2425395

7. Achiam J, Adler S, Agarwal S: Gpt-4 technical report arXiv preprint arXiv. 2023;2303.08774

8. Anil R, Borgeaud S, Alayrac JBGoogle Gemini Team: A family of highly capable multimodal models arXiv preprint arXiv. 2023;2312.11805

9. Singhal K, Azizi S, Tu T, Large language models encode clinical knowledge: Nature, 2023; 620(7972); 172-80 Erratum in: Nature. 2023;620(7973):E19

10. Moor M, Banerjee O, Abad ZSH, Foundation models for generalist medical artificial intelligence: Nature, 2023; 616(7956); 259-65

11. Tu T, Azizi S, Dreiss D, Towards generalist biomedical AI: NEJM AI, 2024; 1(3); AIoa2300138

12. Takita H, Kabata D, Walston SL, A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians: NPJ Digit Med, 2025; 8(1); 175

13. Dinc MT, Bardak AE, Bahar F, Noronha C, Comparative analysis of large language models in clinical diagnosis: Performance evaluation across common and complex medical cases: JAMIA Open, 2025; 8(3); ooaf055

14. Alotaibi S, Deligianni E, AI in oral medicine: is the future already here? A literature review: Br Dent J, 2024; 237(10); 765-70

15. AlFarabi Ali S, AlDehlawi H, Jazzar A, The diagnostic performance of large language models and oral medicine consultants for identifying oral lesions in text-based clinical scenarios: Prospective comparative study: JMIR AI, 2025; 4; e70566

16. Sitaras S, Tsolakis IA, Gelsini M, Applications of artificial intelligence in dental medicine: A critical review: Int Dent J, 2025; 75(2); 474-86

17. Warin K, Suebnukarn S, Deep learning in oral cancer – A systematic review: BMC Oral Health, 2024; 24(1); 212

18. Ferrara E, Rapone B, D’Albenzio A, Applications of deep learning in periodontal disease diagnosis and management: A systematic review and critical appraisal: JMAI, 2025; 8; 24-241

19. Lian L, Zhu T, Zhu F, Zhu H, Deep learning for caries detection and classification: Diagnostics (Basel), 2021; 11(9); 1672

20. Farhadi Nia M, Ahmadi M, Irankhah E, Transforming dental diagnostics with artificial intelligence: Advanced integration of ChatGPT and large language models for patient care: Front Dent Med, 2025; 5; 1456208

21. Sun G, Zhou YH, AI in healthcare: Navigating opportunities and challenges in digital communication: Front Digit Health, 2023; 5; 1291132

22. Shool S, Adimi S, Saboori Amleshi R, A systematic review of large language model (LLM) evaluations in clinical medicine: BMC Med Inform Decis Mak, 2025; 25(1); 117

23. Suárez A, Friere Y, Diaz-Flores Garcia V, ChatGPT in oral pathology: Bright promise or diagnostic mirage: Medicina, 2025; 61(10); 1744

24. Shan G, Chen X, Wang C, Comparing diagnostic accuracy of clinical professionals and large language models: Systematic review and meta-analysis: JMIR Med Inform, 2025; 13; e64963

25. Rossettini G, Rodeghiero L, Corradi F, Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: A cross-sectional study: BMC Med Educ, 2024; 24(1); 694

26. Fattah FH, Salih AM, Salih AM, Comparative analysis of ChatGPT and Gemini (Bard) in medical inquiry: A scoping review: Front Digit Health, 2025; 7; 1482712

27. Waldock WJ, Zhang J, Guni A, The accuracy and capability of artificial intelligence solutions in health care examinations and certificates: Systematic review and meta-analysis: J Med Internet Res, 2024; 26; e56532

28. Kolbinger FR, Veldhuizen GP, Zhu J, Reporting guidelines in medical artificial intelligence: A systematic review and meta-analysis: Commun Med (Lond), 2024; 4(1); 71

29. Alqurashi MA, Alshagrawi S, Assessing the impact of artificial intelligence applications on diagnostic accuracy in Saudi Arabian healthcare: A systematic review: Open Public Health J, 2025; 18(1); e18749445369173

30. Souto MEVC, Fernandes AC, Silva ABS, A multi-model longitudinal assessment of ChatGPT performance on medical residency examinations: Front Artif Intell, 2025; 8; 1614874

31. OpenAI Research Team, Continuous improvement in large language model performance: A longitudinal analysis: arXiv preprint arXiv, 2024; 2404; 12345

32. Singh V, AI-powered radiographic analysis: Transforming diagnosis in dentistry: SRMS J Med Sci, 2023; 8(2); 137-42

33. Khubrani YH, Thomas D, Slator PJ, Detection of periodontal bone loss and periodontitis from 2D dental radiographs via machine learning and deep learning: Systematic review employing APPRAISE-AI and meta-analysis: Dentomaxillofac Radiol, 2025; 54(2); 89-108

34. Shiferaw KB, Roloff M, Balaur I, Guidelines and standard frameworks for artificial intelligence in medicine: A systematic review: JAMIA Open, 2025; 8(1); ooae155

35. Davis SL, Johnson AH, Lynch T, Inclusion of effect size measures and clinical relevance in research papers: Nurs Res, 2021; 70(3); 222-30

36. Vrdoljak J, Boban Z, Vilović M, A review of large language models in medical education, clinical decision support, and healthcare administration: Healthcare (Basel), 2025; 13(6); 603

37. Wang L, Wan Z, Ni C, Applications and concerns of ChatGPT and other conversational large language models in health care: Systematic review: J Med Internet Res, 2024; 26; e22769

38. Claman D, Sezgin E, Artificial intelligence in dental education: opportunities and challenges of large language models and multimodal foundation models: JMIR Med Educ, 2024; 10; e52346

39. Singh V, Cheng S, Kwan AC, Ebinger J, United States Food and Drug Administration regulation of clinical software in the era of artificial intelligence and machine learning: Mayo Clin Proc Digit Health, 2025; 3(3); 100231

40. Bailey MA, Ethical considerations for the integration of artificial and augmented intelligence in dentistry: Navigating the landscape and preparing for the future: J Am Dent Assoc, 2024; 155(8); 721-22

41. Castilla AC, D’Amorim IdP, Wanderley MFB, External validation of an artificial intelligence triaging system for chest X-Rays: A retrospective independent clinical study: Diagnostics, 2025; 15(22); 2899

42. Araidy S, Batshon G, Mirochnik R, Artificial intelligence applications in dentistry: A systematic review: Oral, 2025; 5(4); 90

43. Qiu Z, Jiang A, Qi C, Temporal evolution of large language models (LLMs) in oncology: J Transl Med, 2025; 23(1); 1219

44. Inchingolo AD, Marinelli G, Fiore A, Diagnostic support in dentistry through artificial intelligence: A systematic review: Bioengineering (Basel), 2025; 12(11); 1244

45. Ioakeim-Skoufa I, Cebollada-Herrera C, Marín-Bárcena C, Electronic health records: A gateway to AI-driven multimorbidity solutions – A comprehensive systematic review: J Clin Med, 2025; 14(10); 3434

46. Savage T, Ma PS, Boukil A, Fine-tuning methods for large language models in clinical medicine by supervised fine-tuning and direct preference optimization: Comparative evaluation: J Med Internet Res, 2025; 27; e76048

Introduction Material and Methods Results Discussion Conclusions Data Availability References

Related articles Order reprints Share article Share by email

Figures

Figure 1. Overall performance comparison of AI models.

Figure 2. Performance comparison across all evaluation criteria.

Figure 3. performance radar chart for the 3 AI models.

Figure 4. Effect size heatmap (Cohen’s d) compared with perfect score.

Tables

Table 1. Characteristics and diversity of the study population (n=385).

Table 2. Descriptive statistics with 95% CIs and effect sizes.

Table 3. Friedman test results for all evaluation criteria.

Table 4. Significant pairwise comparisons (Wilcoxon signed-rank test) (P<0.05).

Table 1. Characteristics and diversity of the study population (n=385).

Table 2. Descriptive statistics with 95% CIs and effect sizes.

Table 3. Friedman test results for all evaluation criteria.

Table 4. Significant pairwise comparisons (Wilcoxon signed-rank test) (P<0.05).

In Press

Clinical Research
Effect of Dexmedetomidine Hydrochloride Nasal Spray on Anxiety and Sleep in Patients Undergoing Gynecologic...

Med Sci Monit In Press; DOI: 10.12659/MSM.952465

Clinical Research
Prognostic Value of Mortality Scoring Systems in Patients With Severe Burns: Identifying Key Predictors of ...

Med Sci Monit In Press; DOI: 10.12659/MSM.951713

Laboratory Research
Evaluation of the Trueness and Precision of Cast, Milled-Cast, Milled, and 3D-Printed Post-and-Core Techniq...

Med Sci Monit In Press; DOI: 10.12659/MSM.953491

Clinical Research
Outcomes After Minimally Invasive Intramedullary Nail Fixation and Locking Plate Fixation Among Patients Wi...

Med Sci Monit In Press; DOI: 10.12659/MSM.952670

Most Viewed Current Articles

17 Jan 2024 : Review article 14,176,214
Vaccination Guidelines for Pregnant Women: Addressing COVID-19 and the Omicron Variant

DOI :10.12659/MSM.942799

Med Sci Monit 2024; 30:e942799

0:00

13 Nov 2021 : Clinical Research 3,757,839
Acceptance of COVID-19 Vaccination and Its Associated Factors Among Cancer Patients Attending the Oncology ...

DOI :10.12659/MSM.932788

Med Sci Monit 2021; 27:e932788

0:00

14 Dec 2022 : Clinical Research 2,466,153
Prevalence and Variability of Allergen-Specific Immunoglobulin E in Patients with Elevated Tryptase Levels

DOI :10.12659/MSM.937990

Med Sci Monit 2022; 28:e937990

0:00

16 May 2023 : Clinical Research 708,809
Electrophysiological Testing for an Auditory Processing Disorder and Reading Performance in 54 School Stude...

DOI :10.12659/MSM.940387

Med Sci Monit 2023; 29:e940387

0:00

AI-Powered Clinical Decision Support in Dentistry: Comparative Evaluation of Large Language Models for Oral Medicine and Periodontal Diagnosis

Abstract

Introduction

Material and Methods

Results

Discussion

Conclusions

Data Availability

Figures

Tables

References

Figures

Tables

In Press

Clinical Research Effect of Dexmedetomidine Hydrochloride Nasal Spray on Anxiety and Sleep in Patients Undergoing Gynecologic...

Clinical Research Prognostic Value of Mortality Scoring Systems in Patients With Severe Burns: Identifying Key Predictors of ...

Laboratory Research Evaluation of the Trueness and Precision of Cast, Milled-Cast, Milled, and 3D-Printed Post-and-Core Techniq...

Clinical Research Outcomes After Minimally Invasive Intramedullary Nail Fixation and Locking Plate Fixation Among Patients Wi...

Most Viewed Current Articles

17 Jan 2024 : Review article 14,176,214 Vaccination Guidelines for Pregnant Women: Addressing COVID-19 and the Omicron Variant

13 Nov 2021 : Clinical Research 3,757,839 Acceptance of COVID-19 Vaccination and Its Associated Factors Among Cancer Patients Attending the Oncology ...

14 Dec 2022 : Clinical Research 2,466,153 Prevalence and Variability of Allergen-Specific Immunoglobulin E in Patients with Elevated Tryptase Levels

16 May 2023 : Clinical Research 708,809 Electrophysiological Testing for an Auditory Processing Disorder and Reading Performance in 54 School Stude...

Your Privacy

Clinical Research
Effect of Dexmedetomidine Hydrochloride Nasal Spray on Anxiety and Sleep in Patients Undergoing Gynecologic...

Clinical Research
Prognostic Value of Mortality Scoring Systems in Patients With Severe Burns: Identifying Key Predictors of ...

Laboratory Research
Evaluation of the Trueness and Precision of Cast, Milled-Cast, Milled, and 3D-Printed Post-and-Core Techniq...

Clinical Research
Outcomes After Minimally Invasive Intramedullary Nail Fixation and Locking Plate Fixation Among Patients Wi...

17 Jan 2024 : Review article 14,176,214
Vaccination Guidelines for Pregnant Women: Addressing COVID-19 and the Omicron Variant

13 Nov 2021 : Clinical Research 3,757,839
Acceptance of COVID-19 Vaccination and Its Associated Factors Among Cancer Patients Attending the Oncology ...

14 Dec 2022 : Clinical Research 2,466,153
Prevalence and Variability of Allergen-Specific Immunoglobulin E in Patients with Elevated Tryptase Levels

16 May 2023 : Clinical Research 708,809
Electrophysiological Testing for an Auditory Processing Disorder and Reading Performance in 54 School Stude...