Autors: Vangelova A., Gancheva, V. S.
Title: AI-Based Automated Scoring Layer Using Large Language Models and Semantic Analysis
Keywords: artificial intelligence, automated scoring, Bloom’s taxonomy, large language models, natural language processing, open-ended questions, RAG, semantic analysis

Abstract: Featured Application: This study presents an AI-based scoring layer for automated assessment of open-ended student responses. The proposed framework combines large language models, Retrieval-Augmented Generation (RAG), and analytical rubrics in order to support criterion-based, context-grounded evaluation in e-learning environments. It can be integrated into platforms such as Moodle to assist instructors in grading, improve consistency, reduce scoring time, and support faster and more structured feedback for learners. Automated scoring of open-ended questions is an important research direction in educational technology and artificial intelligence, as manual grading is time-consuming and often subject to inter-rater variation. This paper proposes an AI-based framework for automated scoring that combines large language models (LLMs), Retrieval-Augmented Generation (RAG), analytical rubrics, and structured machine-readable output within a Moodle-supported e-learning environment. The framework is designed to support context-grounded and criterion-based evaluation by combining the student response, retrieved instructional context, and rubric-defined scoring criteria within a controlled assessment workflow. The proposed approach aims to improve the consistency, traceability, and practical applicability of automated scoring for open-ended responses. To examine its performance, an experimental study was conducted in a real university setting involving a five-task open-ended examination. AI-generated scores were compared with independent human scores using agreement, reliability, correlation, and error metrics. The results indicate a strong level of agreement between automated and expert scoring within the tested setting, together with relatively low average deviation. These findings suggest that the proposed framework has practical potential for supporting automated assessment in digital learning environments, while also highlighting the importance of careful interpretation within the scope of the experimental design.

References

  1. Pecuchova J. Benko Ľ. Drlik M. Automated Grading of Open-Ended Questions in Higher Education Using GenAI Models Int. J. Artif. Intell. Educ. 2025 35 3813 3846 10.1007/s40593-025-00517-2
  2. Jauhiainen J. Guerra A.G. Evaluating Students’ Open-Ended Written Responses with LLMs: Using the RAG Framework for GPT-3.5, GPT-4, Claude-3, and Mistral-Large Adv. Artif. Intell. Mach. Learn. 2024 4 3097 3113 10.54364/AAIML.2024.44177
  3. Tang X. Chen H. Lin D. Li K. Harnessing LLMs for Multi-Dimensional Writing Assessment: Reliability and Alignment with Human Judgments Heliyon 2024 10 e34262 10.1016/j.heliyon.2024.e34262 39113951
  4. Yeung S.A. Comparative Study of Rule-Based, Machine Learning and Large Language Model Approaches in Automated Writing Evaluation (AWE) Proceedings of the 15th International Learning Analytics and Knowledge Conference (LAK’25) Dublin, Ireland 3–7 March 2025 984 991 10.1145/3706468.3706566
  5. Lan G. Li Y. Yang J. He X. Investigating a customized generative AI chatbot for automated essay scoring in a disciplinary writing task Assess. Writ. 2025 66 100959 10.1016/j.asw.2025.100959
  6. Grévisse C. LLM-based automatic short answer grading in undergraduate medical education BMC Med. Educ. 2024 24 1060 10.1186/s12909-024-06026-5
  7. Latif E. Zhai X. Fine-tuning ChatGPT for automatic scoring Comput. Educ. Artif. Intell. 2024 6 100210 10.1016/j.caeai.2024.100210
  8. Xu J. Liu J. Lin M. Lin J. Yu S. Zhao L. Shen J. EPCTS: Enhanced Prompt-Aware Cross-Prompt Essay Trait Scoring Neurocomputing 2025 621 129283 10.1016/j.neucom.2024.129283
  9. Mendonça P.C. Quintal F. Mendonça F. Evaluating LLMs for Automated Scoring in Formative Assessments Appl. Sci. 2025 15 2787 10.3390/app15052787
  10. Qiu H. White B. Ding A. Costa R. Hachem A. Ding W. Chen P. SteLLA: A Structured Grading System Using LLMs with RAG arXiv 2025 10.48550/arXiv.2501.09092 2501.09092
  11. Chu S. Kim J. Wong B. Yi M. Rationale Behind Essay Scores: Enhancing S-LLM’s Multi-Trait Essay Scoring with Rationale Generated by LLMs arXiv 2025 10.48550/arXiv.2410.14202 2410.14202
  12. Seßler K. Fürstenberg M. Bühler B. Kasneci E. Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring Proceedings of the 15th International Learning Analytics and Knowledge Conference Dublin, Ireland 3–7 March 2025 462 472 10.1145/3706468.3706527
  13. Papachristou I. Dimitroulakos G. Vassilakis C. Automated Test Generation and Marking Using LLMs Electronics 2025 14 2835 10.3390/electronics14142835
  14. Emirtekin E. Large Language Model-Powered Automated Assessment: A Systematic Review Appl. Sci. 2025 15 5683 10.3390/app15105683
  15. Gao R. Merzdorf H.E. Anwar S. Hipwell M.C. Srinivasa A.R. Automatic assessment of text-based responses in post-secondary education: A systematic review Comput. Educ. Artif. Intell. 2024 6 100206 10.1016/j.caeai.2024.100206
  16. Zlatkin-Troitschanskaia O. Fischer J. Braun H.I. Shavelson R.J. Advantages and challenges of performance assessment of student learning in higher education International Encyclopedia of Education 4th ed. Elsevier Amsterdam, The Netherlands 2023 312 330 10.1016/B978-0-12-818630-5.02055-8
  17. Sun J. Song T. Peng W. Song J. A Survey of Automated Essay Scoring: Challenges, Advances, and Future Neurocomputing 2025 650 130916 10.1016/j.neucom.2025.130916
  18. Dikli S. An Overview of Automated Scoring of Essays J. Technol. Learn. Assess. 2006 5 Available online: https://ejournals.bc.edu/index.php/jtla/article/view/1640/1489 (accessed on 3 March 2025)
  19. Fateen M. Wang B. Mine T. Beyond Scores: A Modular RAG-Based System for Automatic Short Answer Scoring with Feedback IEEE Access 2024 12 185371 185385 10.1109/ACCESS.2024.3508747
  20. Zhuang M. Long S. Martin F. Castellanos-Reyes D. The affordances of Artificial Intelligence (AI) and ethical considerations across the instruction cycle: A systematic review of AI in online higher education Internet High. Educ. 2025 67 101039 10.1016/j.iheduc.2025.101039
  21. Sychev O. Anikin A. Prokudin A. Automatic Grading and Hinting in Open-Ended Text Questions Cogn. Syst. Res. 2020 59 264 272 10.1016/j.cogsys.2019.09.025
  22. Aydın B. Kışla T. Elmas N.T. Bulut O. Automated Scoring in the Era of Artificial Intelligence: An Empirical Study with Turkish Essays System 2025 133 103784 10.1016/j.system.2025.103784
  23. Stephen T.C. Gierl M.C. King S. Automated Essay Scoring (AES) of Constructed Responses in Nursing Examinations: An Evaluation Nurse Educ. Pract. 2021 54 103085 10.1016/j.nepr.2021.103085
  24. Jung J.Y. Tyack L. von Davier M. Towards the implementation of automated scoring in international large-scale assessments: Scalability and quality control Comput. Educ. Artif. Intell. 2025 8 100375 10.1016/j.caeai.2025.100375
  25. Mizumoto A. Eguchi M. Exploring the Potential of Using an AI Language Model for Automated Essay Scoring Res. Methods Appl. Linguist. 2023 2 100050 10.1016/j.rmal.2023.100050
  26. Pack A. Barrett A. Escalante J. Large Language Models and Automated Essay Scoring of English Language Learner Writing: Insights into Validity and Reliability Comput. Educ. Artif. Intell. 2024 6 100234 10.1016/j.caeai.2024.100234
  27. Birla N. Jain M.K. Panwar A. Automated Assessment of Subjective Assignments: A Hybrid Approach Expert Syst. Appl. 2022 203 117315 10.1016/j.eswa.2022.117315
  28. Li X. Chen M. Nie J.-Y. SEDNN: Shared and enhanced deep neural network model for cross-prompt automated essay scoring Knowl.-Based Syst. 2020 210 106491 10.1016/j.knosys.2020.106491
  29. Wang Q. A Multifaceted Architecture to Automate Essay Scoring for Assessing English Article Writing: Integrating Semantic, Thematic, and Linguistic Representations Comput. Electr. Eng. 2024 118 109308 10.1016/j.compeleceng.2024.109308
  30. Bonthu S. Rama Sree S. Krishna Prasad M.H.M. Improving the performance of automatic short answer grading using transfer learning and augmentation Eng. Appl. Artif. Intell. 2023 123 106292 10.1016/j.engappai.2023.106292
  31. Tan L.Y. Hu S. Yeo D.J. Cheong K.H. A Comprehensive Review on Automated Grading Systems in STEM Using AI Techniques Math 2025 13 2828 10.3390/math13172828
  32. Meyer J. Jansen T. Schiller R. Liebenow L.W. Steinbach M. Horbach A. Fleckenstein J. Using LLMs to Bring Evidence-Based Feedback into the Classroom: AI-Generated Feedback Increases Secondary Students’ Text Revision, Motivation, and Positive Emotions Comput. Educ. Artif. Intell. 2024 6 100199 10.1016/j.caeai.2023.100199
  33. Quah B. Zheng L. Sng T.J.H. Yong C.W. Islam I. Reliability of ChatGPT in automated essay scoring for dental undergraduate examinations BMC Med. Educ. 2024 24 962 10.1186/s12909-024-05881-6 39227811
  34. Zhao X. A Hybrid Deep Learning and Fuzzy Logic Framework for Feature-Based Evaluation of English Language Learners Sci. Rep. 2025 15 33657 10.1038/s41598-025-17738-z 41023079
  35. He X. Xiao X. Fang J. Li Y. Li Y. Zhou R. Exercise-Aware higher-order Thinking skills Assessment via fine-tuned large language model Knowl.-Based Syst. 2025 324 113808 10.1016/j.knosys.2025.113808
  36. Firoozi T. Bulut O. Gierl M. Language models in automated essay scoring: Insights for the Turkish language Int. J. Assess. Tools Educ. 2023 10 149 163 10.21449/ijate.1394194
  37. Johnsi R. Kumar G.B. Enhancing automated essay scoring by leveraging LSTM networks with hyper-parameter tuned word embeddings and fine-tuned LLMs Eng. Res. Express 2025 7 025272 10.1088/2631-8695/adcf74
  38. Córdova-Esparza D.-M. AI-Powered Educational Agents: Opportunities, Innovations, and Ethical Challenges Information 2025 16 469 10.3390/info16060469
  39. Tyndall E. Gayheart C. Some A. Genz J. Wagner T. Langhals B. Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents Data Policy 2025 7 e57 10.1017/dap.2025.10024
  40. Kinder A. Briese F.J. Jacobs M. Dern N. Glodny N. Jacobs S. Leßmann S. Effects of adaptive feedback generated by a large language model: A case study in teacher education Comput. Educ. Artif. Intell. 2025 8 100349 10.1016/j.caeai.2024.100349
  41. Villegas-Ch W. Gutierrez R. García-Ortiz J. Guevara V. Explainable educational assistant integrated in Moodle: Automated semantic assessment and adaptive tutoring based on NLP and XAI Discov. Artif. Intell. 2025 5 191 10.1007/s44163-025-00438-y
  42. Oğuz E. Can Generative AI Figure Out Figurative Language? The Influence of Idioms on Essay Scoring by ChatGPT, Gemini, and Deepseek Assess. Writ. 2025 66 100981 10.1016/j.asw.2025.100981
  43. Morris W. Crossley S. Holmes L. Ou C. Dascalu M. McNamara D. Formative Feedback on Student-Authored Summaries in Intelligent Textbooks Using Large Language Models Int. J. Artif. Intell. Educ. 2025 35 1022 1043 10.1007/s40593-024-00395-0
  44. Koo T.K. Li M.Y. A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research J. Chiropr. Med. 2016 15 155 163 10.1016/j.jcm.2016.02.012
  45. Cisneros-González J. Gordo-Herrera N. Barcia-Santos I. Sánchez-Soriano J. JorGPT: Instructor-Aided Grading of Programming Assignments with Large Language Models (LLMs) Future Internet 2025 17 265 10.3390/fi17060265
  46. Ferreira Mello R. Pereira Junior C. Rodrigues L. Pereira F.D. Cabral L. Costa N. Ramalho G. Gasevic D. Automatic Short Answer Grading in the LLM Era: Does GPT-4 with Prompt Engineering beat Traditional Models? Proceedings of the 15th International Learning Analytics and Knowledge Conference Dublin, Ireland 3–7 March 2025 93 103 10.1145/3706468.3706481
  47. Cipriano E. Ferrato A. Limongelli C. Schicchi D. Taibi D. Leveraging Large Language Models to Assist Teachers in Code Grading Artificial Intelligence in Education Cristea A.I. Walker E. Lu Y. Santos O.C. Isotani S. Springer Nature Cham, Switzerland 2025 Volume 15880 204 217 10.1007/978-3-031-98459-4_15
  48. Landis J.R. Koch G.G. The Measurement of Observer Agreement for Categorical Data Biometrics 1977 33 159 174 10.2307/2529310 843571

Issue

Applied Sciences (Switzerland), vol. 16, 2026, Switzerland, https://doi.org/10.3390/app16073537

Вид: статия в списание, публикация в издание с импакт фактор, публикация в реферирано издание, индексирана в Scopus