Autors: Mateev, V. M., Marinova, I. I. Title: Equation Tokenization for LLM Data Processing Keywords: data processing, equations, LSTM, machine learning, neural network, text classification, text tokenizationAbstract: In this work is presented a data processing approach for scientific equation tokenization as a part of the language models predefined stage. This way, equations are considered as a single-line text expressions, that makes possible the usage of 1D function prediction neural networks with reduced complexity. Three scenarios for usage of tokenized equations are considered, they are: next token prediction; probabilistic token aggregation and token appearance correlation with data fitting. These scenarios are inspired by text processing language models. An LSTM neural network example is implemented for next symbol in equation prediction. References - S. Kashid, K. Kumar, P. Saini, A. Negi and A. Saini, "Approach of a Multilevel Secret Sharing Scheme for Extracted Text Data," 2022 IEEE Students Conference on Engineering and Systems (SCES), Prayagraj, India, 2022, pp. 1-5, doi: 10.1109/SCES55490.2022.9887697.
- T. Islam, M. Hossain and M. F. Arefin, "Comparative Analysis of Different Text Summarization Techniques Using Enhanced Tokenization," 2021 3rd International Conference on Sustainable Technologies for Industry 4.0 (STI), Dhaka, Bangladesh, 2021, pp. 1-6, doi: 10.1109/STI53101.2021.9732589.
- S. Ananthi, R. Venkateswaran, G. Vijaya, J. Arun and R. Sathya, "Data Mining and Text Mining Approach: Query based Information Retrieval using Genetic Algorithm," 2024 5th International Conference on Image Processing and Capsule Networks (ICIPCN), Dhulikhel, Nepal, 2024, pp. 561-564, doi: 10.1109/ICIPCN63822.2024.00098.
- N. Chen, "Text Classification Model Based on Long Short-Term Memory with L2 Regularization," 2024 Second International Conference on Data Science and Information System (ICDSIS), Hassan, India, 2024, pp. 1-4, doi: 10.1109/ICDSIS61070.2024.10594621.
- P. Prakrankamanant and E. Chuangsuwanich, "Tokenization-based data augmentation for text classification," 2022 19th International Joint Conference on Computer Science and Software Engineering (JCSSE), Bangkok, Thailand, 2022, pp. 1-6, doi: 10.1109/JCSSE54890.2022.9836268.
- T. Bharadwaj, S. Ajay and S. Ahmad, "Harnessing Text Analysis for Automated Document Understanding," 2024 Second International Conference on Advances in Information Technology (ICAIT), Chikkamagaluru, Karnataka, India, 2024, pp. 1-6, doi: 10.1109/ICAIT61638.2024.10690743.
- A. Pandiaraj, N. Ramshankar and R. Venkatesan, "Unlocking the Power of Long Short-Term Memory Networks: A Text Classification Approach," 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India, 2023, pp. 1-6, doi: 10.1109/ICCCNT56998.2023.10308054.
- H. Liang, "Research on Pre-training Model of Natural Language Processing Based on Recurrent Neural Network," 2021 IEEE 4th International Conference on Information Systems and Computer Aided Education (ICISCAE), Dalian, China, 2021, pp. 542-546, doi: 10.1109/ICISCAE52414.2021.9590748.
- R. Hafeez, S. Khan, I. A. Khan and M. A. Abbas, "Does preprocessing really impact automatically generated taxonomy," 2017 13th International Conference on Emerging Technologies (ICET), Islamabad, Pakistan, 2017, pp. 1-6, doi: 10.1109/ICET.2017.8281710.
- S. Behera, N. Kolipakula, P. Swetha, Y. Yogish and J. R. Prathuri, "Text Classification using Machine Learning Algorithms on Encrypted Data," 2023 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), Bangalore, India, 2023, pp. 1-6, doi: 10.1109/CONECCT57959.2023.10234826.
- A. Pandiaraj, N. Ramshankar and R. Venkatesan, "Unlocking the Power of Long Short-Term Memory Networks: A Text Classification Approach," 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India, 2023, pp. 1-6, doi: 10.1109/ICCCNT56998.2023.10308054.
- H. Liang, "Research on Pre-training Model of Natural Language Processing Based on Recurrent Neural Network," 2021 IEEE 4th International Conference on Information Systems and Computer Aided Education (ICISCAE), Dalian, China, 2021, pp. 542-546, doi: 10.1109/ICISCAE52414.2021.9590748.
- A. Martin-Delgado, "The new SI and the fundamental constants of nature." European Journal of Physics 41.6 (2020): 063003.
Issue
| 8th International Symposium on Innovative Approaches in Smart Technologies, ISAS 2024 - Proceedings, pp. 1-4, 2024, , https://doi.org/10.1109/ISAS64331.2024.10845568 |
Copyright IEEE |