Logo Medical Science Monitor

Call: +1.631.470.9640
Mon - Fri 10:00 am - 02:00 pm EST

Contact Us

Logo Medical Science Monitor Logo Medical Science Monitor Logo Medical Science Monitor

07 May 2026: Lab/In Vitro Research  

Preliminary Evaluation of Large Language Models in Kennedy Classification and Removable Partial Denture Planning: An In Silico Study

Ebru Sümer Ekin ABDEF 1*, Yalçın Değer BDF 2, Berivan Dündar Yilmaz CDE 2, İsmail Yildiz C 3

DOI: 10.12659/MSM.953353

Med Sci Monit 2026; 32:e953353

0 Comments

Abstract

0:00

BACKGROUND: This study evaluated text-based large language models for Kennedy classification and removable partial denture planning in partially edentulous cases, where comparative evidence on accuracy remains limited.

MATERIAL AND METHODS: Twenty-six partially edentulous cases, defined according to the Fédération Dentaire Internationale tooth numbering system and classified as Kennedy Classes I to IV, were included. Case scenarios were constructed by an experienced prosthodontist. Questions were independently submitted to 4 large language models without prior training or prompting: ChatGPT-5.2, Claude Sonnet 4.6, Gemini 3 Flash, and Perplexity. Responses were evaluated by 2 independent prosthodontists (different from the case developer) – blinded to the identity of each model – using predefined criteria for Kennedy classification accuracy and prosthetic planning consistency. Statistical analyses included intergroup comparisons and effect size estimation.

RESULTS: Statistically significant differences were identified among the models in both tasks (P<0.001). For Kennedy classification, the effect size was moderate (Kendall’s W=0.402). Gemini 3 Flash achieved the highest mean score, followed by ChatGPT-5.2, Perplexity, and Claude Sonnet 4.6. Similarly, significant differences were observed in removable partial denture planning performance (Kendall’s W=0.424); Gemini 3 Flash scored significantly higher than the other models (P≤0.001).

CONCLUSIONS: Large language models showed variable performance under zero-shot conditions. Although Gemini 3 Flash achieved higher scores, the moderate effect sizes warrant cautious interpretation. These models may serve as adjunctive decision-support tools in prosthodontics but cannot replace clinical judgment.

Keywords: Artificial Intelligence, Classification, Denture, Partial, Removable, Prosthodontics

Introduction

Removable partial dentures (RPDs) remain a mainstay in the rehabilitation of partial edentulism due to their cost-effectiveness, reversibility, and broad range of clinical indications [1]. Their clinical success depends on accurate diagnosis and classification of the partially edentulous arch, as well as appropriate prosthetic design and treatment planning based on established biomechanical principles [2].

The Kennedy classification system is widely used to describe patterns of partial edentulism in RPD treatment planning. Its consistent application relies on adherence to Applegate’s rules, which standardize classification by defining criteria such as the timing of extractions, consideration of posterior teeth, and identification of modification spaces. These principles improve classification accuracy and reliability in both clinical and research settings [1]. In the present study, cases were classified according to the Kennedy classification system following Applegate’s rules.

Artificial intelligence is increasingly utilized in dentistry for diagnostic support, treatment planning, and outcome prediction [3]. In prosthodontics, artificial intelligence contributes to prosthetic design, implant planning, and digital workflow integration, enabling more efficient and predictable treatment processes [4,5]. Digital technologies such as intraoral scanning and computer-aided design/computer-aided manufacturing systems further enhance prosthesis accuracy and reduce manual errors [6,7].

Large language models are artificial intelligence systems capable of generating contextually coherent responses and performing complex natural language processing tasks, including analytical reasoning and domain-specific problem solving [8,9]. Despite their increasing use, the reproducibility of large-language-model-based evaluations remains a concern because outputs may vary according to prompt structure and contextual inputs. Standardized evaluation frameworks are essential to ensure consistency and comparability across studies.

Recent artificial-intelligence-based approaches have been introduced for dental arch classification and decision support in RPD treatment planning. For example, convolutional neural network models have demonstrated high accuracy in classifying partially edentulous arches [10], and artificial-intelligence-driven systems have been developed to assist clinicians in selecting appropriate prosthetic designs [11]. However, these approaches are primarily image-based and fundamentally differ from text-based large language model systems, highlighting the need for dedicated evaluation of large language model performance in text-driven prosthodontic tasks. Despite these advances, standardized comparative evidence evaluating the performance of large language models in structured, expertise-dependent prosthodontic tasks (eg, Kennedy classification and RPD planning), remains limited.

In the present study, the performance of large language models was evaluated under standardized, task-specific zero-shot conditions to better reflect routine clinical scenarios. The models were required to perform 2 primary tasks using partially edentulous cases defined according to the Fédération Dentaire Internationale tooth numbering system [12]: (1) determination of Kennedy classification and identification of modification spaces, and (2) proposal of RPD treatment plans consistent with established biomechanical principles. ChatGPT-5.2, Claude Sonnet 4.6, Gemini 3 Flash, and Perplexity were evaluated as pretrained large language models. The aim of this study was to comparatively evaluate the performance of large language models in Kennedy classification and RPD planning tasks under standardized, task-specific zero-shot conditions.

Material and Methods

STUDY DESIGN:

This study was designed as an in silico comparative performance evaluation based on the analysis of responses generated by publicly accessible large language models. Given that no biological material derived from human or animal subjects was used, ethical committee approval was not required.

Within the scope of the study, 26 partially edentulous cases were constructed according to the Fédération Dentaire Internationale tooth numbering system and structured within the framework of Kennedy Classes I to IV. For each case, the large language models were required to perform 2 primary tasks: (1) accurately determine the Kennedy classification and the number of modification spaces, and (2) propose an RPD treatment plan consistent with biomechanical principles.

The partially edentulous scenarios were systematically constructed by a prosthodontist with at least 15 years of clinical experience, considering the principles of Kennedy classification and the designation of modification spaces. The scenarios were structured as follows:

With this configuration, 26 distinct partially edentulous scenarios were generated, encompassing various arch types and modification combinations. The prosthodontist responsible for constructing the case scenarios did not participate in evaluation or scoring of the model-generated responses to avoid potential bias. The full list of constructed case scenarios is provided in Table 1 to allow independent assessment of case distribution and reproducibility.

EVALUATION OF LARGE LANGUAGE MODELS:

Prompts were submitted between January 20 and 25, 2026, by a single researcher using newly created user accounts through the publicly accessible free web versions of large-language-model artificial intelligence tools. The models included in the evaluation and their accessed versions were as follows: ChatGPT (GPT-5.2 free web version, OpenAI, USA), Claude (Claude Sonnet 4.6 free web version, Anthropic, USA), Gemini (Gemini 3 Flash free web version, Google, USA), and Perplexity (Perplexity AI free web version; version number not explicitly specified by the platform, Perplexity AI, USA). Free versions were specifically selected to reflect the most widely accessible tools and ensure methodological consistency by avoiding variability associated with different subscription tiers and update cycles, rather than to evaluate performance under premium or optimized conditions.

All models were accessed via standard web browser interfaces without the use of APIs or developer tools. The selected models were chosen to represent widely used, publicly accessible large language models with diverse architectures and design characteristics, enabling comparative evaluation across different system paradigms in clinical text-based tasks. Models were evaluated using their default versions and standard settings under identical user conditions. External browsing, search grounding, and retrieval tools were disabled where available to ensure that outputs reflected base model performance only. All prompts were manually entered in English using the same device and network conditions. The exact prompt template used for all cases is reported below to ensure reproducibility.

To avoid influence from prior interactions, each case was evaluated in a separate chat session; the conversation history was cleared at the beginning of each session. This approach ensured that responses were generated solely from the provided input, with minimal contextual priming and sequential effects [13]. Each model produced only a single response per case; no revisions or additional prompting were allowed. All models were evaluated independently using an identical standardized prompt to reduce prompt-structure-related variability [14]. No additional training, fine-tuning, or example-based prompting was applied.

The standardized prompt used in the study was as follows: “Based on the missing tooth numbers specified according to the FDI (Fédération Dentaire Internationale) system, determine the Kennedy classification and, if present, the modification spaces (state only the number) in accordance with Applegate’s rules. Subsequently, briefly propose a removable partial denture (RPD) treatment plan consistent with established prosthodontic principles.” The same prompt structure was applied to all cases without modification, and no follow-up prompts or clarifications were allowed. Accordingly, the prompting approach used in this study can be defined as a standardized, task-specific zero-shot setting, as the models were provided with structured task instructions without any prior task-specific training, fine-tuning, or iterative prompting.

This analysis was limited to partial edentulism scenarios. In routine clinical practice, additional factors – such as tooth mobility, periodontal status, radiographic findings, and occlusal relationships – may further influence the definitive treatment plan.

EVALUATION OF PROSTHETIC CLASSIFICATION AND RPD PLANNING:

The accuracies of prosthetic classification and modification space identification were evaluated according to Kennedy classification principles. Prosthetic planning was assessed based on major connector selection, direct and indirect retainers, support, stability, and overall biomechanical design. All artificial-intelligence-generated prosthetic plans were evaluated using a component-based scoring system based on structured assessment of key prosthodontic components.

Responses were documented in Microsoft Word, anonymized, and assessed by 2 independent prosthodontists with at least 15 years of clinical experience who were blinded to the identity of each model. Scoring was performed using a predefined rubric. Kennedy classification accuracy was evaluated using a 3-point Likert scale (0=incorrect, 1=partially correct, 2=completely correct) (Table 2), and RPD planning quality was assessed using a 5-point Likert scale (1=very poor, 5=excellent) (Table 3) [15,16]. The rubric was developed based on established prosthodontic principles, including the Kennedy classification and Applegate’s rules; its content validity was evaluated by 3 senior prosthodontists using Lynn’s Content Validity Index method (relevant references are provided in [17–19]). Independent scoring was performed prior to consensus.

Discrepancies between evaluators were recorded and resolved through discussion to reach consensus; however, the initial level of agreement prior to consensus is reflected in the reported kappa statistics. Inter-rater agreement between the 2 prosthodontists was assessed using quadratic weighted kappa statistics. Final statistical analyses were conducted using consensus-based scores. To enhance reproducibility, all experimental conditions, including model versions, access dates, prompt structure, and evaluation criteria, were standardized and explicitly defined. Kennedy classification scores are presented in Table 4. Both evaluators assigned identical scores for all cases; therefore, a single consensus score is reported. Detailed individual and consensus RPD planning scores for each model, as assessed by 2 prosthodontists, are presented in Table 5.

STATISTICAL ANALYSIS:

Statistical analyses were performed using SPSS Statistics version 21.0 (IBM Corp., Armonk, NY, USA). Continuous variables are presented as mean±standard deviation and median (minimum–maximum) values; categorical variables are expressed as frequencies and percentages (%). Comparisons of ordinal data derived from Likert-type scales among the models were conducted using the nonparametric Friedman test because the data did not meet the assumptions of normal distribution and were measured on an ordinal scale. Effect size was reported using Kendall’s coefficient of concordance (Kendall’s W). For statistically significant Friedman test results, pairwise comparisons were performed using the Dunn-Bonferroni procedure. Bonferroni correction was applied to control the type I error rate in multiple comparisons.

Inter-rater agreement between the 2 prosthodontists was assessed using quadratic weighted kappa statistics. All statistical analyses were conducted based on consensus scores derived from the 2 independent prosthodontists to minimize inter-evaluator variability. All analyses were 2-tailed, and P-values <0.05 were considered statistically significant.

Results

Complete inter-rater agreement (quadratic weighted κ=1.00) was observed for Kennedy classification and modification scores. Regarding RPD planning scores, weighted κ values were 0.80 for ChatGPT, 0.84 for Claude, 0.54 for Gemini, and 0.74 for Perplexity, corresponding to substantial, almost perfect, moderate, and substantial agreement, respectively, according to Landis and Koch [20].

The Friedman test for Kennedy classification and modification scores revealed a statistically significant difference among the 4 large language models (N=26; χ2(3)=31.333; P<0.001). The effect size was moderate (Kendall’s W=0.402). Mean scores were as follows: Gemini 3 Flash (1.84±0.37), ChatGPT-5.2 (1.54±0.58), Perplexity (1.04±0.77), and Claude Sonnet 4.6 (0.81±0.80). Dunn-Bonferroni-adjusted pairwise comparisons demonstrated statistically significant differences between Gemini 3 Flash and Claude Sonnet 4.6 (P<0.001), and between Gemini 3 Flash and Perplexity (P=0.005). Additionally, ChatGPT-5.2 scored significantly higher than Claude Sonnet 4.6 (P=0.043). All reported P-values were adjusted for multiple comparisons. No statistically significant differences were observed in the remaining pairwise comparisons (P>0.05). The distribution of classification-modification scores is illustrated in Figure 1.

The Friedman test for RPD planning scores demonstrated a statistically significant difference among the models (N=26; χ2 (3)=33.108; P<0.001). The effect size was moderate (Kendall’s W=0.424). Mean scores were as follows: Gemini 3 Flash (4.23±0.71), ChatGPT-5.2 (3.04±1.18), Perplexity (2.85±0.92), and Claude Sonnet 4.6 (2.73±1.25). Bonferroni-adjusted pairwise comparisons demonstrated that Gemini 3 Flash scored significantly higher than all other models (ChatGPT-5.2, Claude Sonnet 4.6, and Perplexity) (P≤0.001 for all comparisons). No statistically significant differences were observed among the remaining models (P>0.05). The distribution of RPD planning scores is illustrated in Figure 2. Table 6 summarizes the mean and median scores for each model in classification and RPD planning performance.

Discussion

LIMITATIONS:

This study was conducted in silico and has not been validated under clinical conditions. The evaluation was solely based on predefined text-based scenarios and did not incorporate visual or multimodal data, which may influence clinical decision-making. Additionally, only publicly accessible model versions were examined; subsequent updates or architectural modifications may affect future performance.

Given the evolving nature of large language models, including version updates and dynamic parameter optimization, identical prompts may yield variable outputs over time, hindering reproducibility. Furthermore, each prompt was submitted only once in a new chat session for each model, and repeated trials were not performed; accordingly, response variability across multiple trials was not assessed.

Conclusions

Large language models demonstrated differences in both Kennedy classification and RPD planning performance under standardized, task-specific zero-shot conditions. Although Gemini 3 Flash achieved higher scores, the moderate effect sizes indicate that these findings should be interpreted with caution. These models may serve as adjunctive decision-support tools in prosthodontic workflows; however, their outputs are influenced by input structure and evaluation criteria. Therefore, they should not be considered a substitute for professional clinical judgment, and further validation in real clinical settings is required.

References

1. Carr AB, Brown DT: McCracken’s removable partial prosthodontics, 2016, St Louis, Elsevier

2. Mousa MA, Abdullah JY, Jamayet NB, Biomechanics in removable partial dentures: A literature review of FEA-based studies: Biomed Res Int, 2021; 2021; 5699962

3. Karchi RP, Nagaraj E, Kondody RT, Artificial intelligence in prosthodontics: New paradigm shift: J Dent Med Sci, 2023; 22(7); 6-9

4. Pareek M, Kaushik B, Artificial intelligence in prosthodontics: A scoping review on current applications and future possibilities: Int J Adv Med, 2022; 9; 367-70

5. Zubair A, Babu ADKBS, Ganesan L, Artificial intelligence in prosthodontics: Revolutionizing removable partial denture design and fabrication: Oral Health Res Clin Evid, 2025; 2(2); 86-89

6. Modgil S, Hutton TJ, Hammond P, Combining biometric and symbolic models for customised, automated prosthesis design: Artif Intell Med, 2002; 25; 227-45

7. Soni M, Soni P, Soni P, The role of digital workflow in customizing the prosthetic solutions: A literature review: J Pharm Bioallied Sci, 2025; 17(2); 1095-97

8. Deiana G, Dettori M, Arghittu A, Artificial intelligence and public health: Evaluating ChatGPT responses to vaccination myths and misconceptions: Vaccines, 2023; 11(7); 1217

9. Taşyürek M, Adıgüzel Ö, Gündoğan M, Comparative evaluation of the responses from ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.1 chatbots to patient inquiries about endodontic treatment: Int Dent Res, 2025; 15(3); 123-35

10. Takahashi T, Nozaki K, Gonda T, A system for designing removable partial dentures using artificial intelligence. Part 1. Classification of partially edentulous arches using a convolutional neural network: J Prosthodont Res, 2021; 65(1); 115-18

11. Chen Q, Wu J, Li S, An ontology-driven, case-based clinical decision support model for removable partial denture design: Sci Rep, 2016; 6; 27855

12. International Organization for Standardization: ISO 3950: 2016 Dentistry – designation system for teeth and areas of the oral cavity, 2016, Geneva, International Organization for Standardization

13. Antaki F, Touma S, Milad D, Evaluating the performance of ChatGPT in ophthalmology: An analysis of its successes and shortcomings: Ophthalmol Sci, 2023; 3; 100324

14. Kung TH, Cheatham M, Medenilla A, Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models: PLoS Digit Health, 2023; 2(2); e0000198

15. Boone HN, Boone DA, Analyzing Likert data: J Ext, 2012; 50(2); 48

16. Sarıbaş E, Determination of dental anxiety levels in dental faculty students: Int Dent Res, 2023; 13(2); 90-98

17. Polit DF, Beck CT, The content validity index: Are you sure what’s being reported?: Res Nurs Health, 2006; 29(5); 489-97

18. Rutherford-Hemming T, Determining content validity and reporting a content validity index for simulation scenarios: Nurs Educ Perspect, 2015; 36(6); 389-93

19. Francis L, Viswambharan P, Nair VV, Comparative evaluation of accuracy, completeness and readability of common patient queries related to prosthodontic treatment by two artificial intelligence models: Cureus, 2025; 17(12); e98458

20. Landis JR, Koch GG, The measurement of observer agreement for categorical data: Biometrics, 1977; 33(1); 159-74

21. Alkaissi H, McFarlane SI, Artificial hallucinations in ChatGPT: Implications in scientific writing: Cureus, 2023; 15(2); e35179

22. Dashti M, Londono J, Ghasemi S, How much can we rely on artificial intelligence chatbots such as the ChatGPT software program to assist with scientific writing?: J Prosthet Dent, 2025; 133(4); 1082-88

23. Anthropic: Introducing Claude [Internet], 2023 [cited 2026 Feb 24]. Available from: https://www.anthropic.com/claude

24. AI Perplexity: About Perplexity [Internet], 2023 [cited 2026 Feb 25]. Available from: https://www.perplexity.ai

25. Freire Y, Laorden AS, Pérez JO, ChatGPT performance in prosthodontics: Assessment of accuracy and repeatability in answer generation: J Prosthet Dent, 2024; 131(4); 659e1-6

26. Huang C, Wang J, Wang S, A review of deep learning in dentistry: Neurocomputing, 2023; 554; 126629

27. Takahashi T, Nozaki K, Gonda T, Deep learning based detection of dental prostheses and restorations: Sci Rep, 2021; 11(1); 1960

28. Chen Q, Lin S, Wu J, Automatic drawing of customized removable partial denture diagrams: J Oral Sci, 2020; 62(2); 236-38

29. Kutlu IU, Yıldırım A, Is ChatGPT a reliable instrument for prosthetic dentistry?: Gumus Univ Saglik Bilim Derg, 2025; 14(4); 1372-80

30. Gravina AG, Pellegrino R, Palladino G, Charting new AI education in gastroenterology: Dig Liver Dis, 2024; 56(8); 1304-11

31. Taşyürek M, Adıgüzel Ö, Cangül S, Comparative evaluation of responses provided by ChatGPT-5 and other models: J Med Dent Invest, 2025; 6; e25016

32. Er Öİ, Adıgüzel Ö, Taşyürek M, Comparison of chatbot responses to dental avulsion injuries: H Sci Mon, 2025; 3(1); e251201

33. Mutlucan UO, Zortuk Ö, Evaluation of AI models in patient education about lumbar disc herniation: J Med Dent Invest, 2025; 6; e250131

34. Khurshid Z, Waqas M, Hasan S, Deep learning architecture to infer Kennedy classification: Int Dent J, 2025; 75; 223-35

35. Dashti M, Khosraviani F, Azimi T, Assessing ChatGPT-4’s performance on the US prosthodontic exam: BMC Med Educ, 2025; 25(1); 761

In Press

Clinical Research  

Institutional and Regional Variations in Access to Clinical Trials and Next-Generation Sequencing in Turkis...

Med Sci Monit In Press; DOI: 10.12659/MSM.951027  

Clinical Research  

Low-Intensity Blood Flow-Restricted Multi-Joint Exercise Improves Muscle Function in Patients With Patellof...

Med Sci Monit In Press; DOI: 10.12659/MSM.950516  

Review article  

Musculoskeletal Ultrasound and MRI in the Evaluation of Chemotherapy-Induced Peripheral Neuropathy: A Review

Med Sci Monit In Press; DOI: 10.12659/MSM.951283  

Clinical Research  

Sensory Processing, Dissociation, and Affective Symptoms in Misophonia: A Cross-Sectional Study of 35 Adults

Med Sci Monit In Press; DOI: 10.12659/MSM.950938  

Most Viewed Current Articles

17 Jan 2024 : Review article   10,187,196

Vaccination Guidelines for Pregnant Women: Addressing COVID-19 and the Omicron Variant

DOI :10.12659/MSM.942799

Med Sci Monit 2024; 30:e942799

0:00

13 Nov 2021 : Clinical Research   3,708,487

Acceptance of COVID-19 Vaccination and Its Associated Factors Among Cancer Patients Attending the Oncology ...

DOI :10.12659/MSM.932788

Med Sci Monit 2021; 27:e932788

0:00

14 Dec 2022 : Clinical Research   2,341,643

Prevalence and Variability of Allergen-Specific Immunoglobulin E in Patients with Elevated Tryptase Levels

DOI :10.12659/MSM.937990

Med Sci Monit 2022; 28:e937990

0:00

16 May 2023 : Clinical Research   706,524

Electrophysiological Testing for an Auditory Processing Disorder and Reading Performance in 54 School Stude...

DOI :10.12659/MSM.940387

Med Sci Monit 2023; 29:e940387

0:00

Your Privacy

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website, You can decise for yourself which categories you you want to deny or allow. Please note that based on your settings not all functionalities of the site are available. View our privacy policy.

Medical Science Monitor eISSN: 1643-3750
Medical Science Monitor eISSN: 1643-3750