Autors: Gancheva, V. S., Stoev H.
Title: Optimization and Performance Analysis of CAT Method for DNA Sequence Similarity Searching and Alignment
Keywords: bioinformatics, biological data, DNA sequences, metadata, performance analysis, sequence alignment, similarity searching

Abstract: Bioinformatics is a rapidly developing field enabling scientific experiments via computer models and simulations. In recent years, there has been an extraordinary growth in biological databases. Therefore, it is extremely important to propose effective methods and algorithms for the fast and accurate processing of biological data. Sequence comparisons are the best way to investigate and understand the biological functions and evolutionary relationships between genes on the basis of the alignment of two or more DNA sequences in order to maximize the identity level and degree of similarity. This paper presents a new version of the pairwise DNA sequences alignment algorithm, based on a new method called CAT, where a dependency with a previous match and the closest neighbor are taken into consideration to increase the uniqueness of the CAT profile and to reduce possible collisions, i.e., two or more sequence with the same CAT profiles. This makes the proposed algorithm suitable for finding the exact match of a concrete DNA sequence in a large set of DNA data faster. In order to enable the usage of the profiles as sequence metadata, CAT profiles are generated once prior to data uploading to the database. The proposed algorithm consists of two main stages: CAT profile calculation depending on the chosen benchmark sequences and sequence comparison by using the calculated CAT profiles. Improvements in the generation of the CAT profiles are detailed and described in this paper. Block schemes, pseudo code tables, and figures were updated according to the proposed new version and experimental results. Experiments were carried out using the new version of the CAT method for DNA sequence alignment and different datasets. New experimental results regarding collisions, speed, and efficiency of the suggested new implementation are presented. Experiments related to the performance comparison with Needleman–Wunsch were re-executed with the new version of the algorithm to confirm that we have the same performance. A performance analysis of the proposed algorithm based on the CAT method against the Knuth–Morris–Pratt algorithm, which has a complexity of O(n) and is widely used for biological data searching, was performed. The impact of prior matching dependencies on uniqueness for generated CAT profiles is investigated. The experimental results from sequence alignment demonstrate that the proposed CAT method-based algorithm exhibits minimal deviation, which can be deemed negligible if such deviation is considered permissible in favor of enhanced performance. It should be noted that the performance of the CAT algorithm in terms of execution time remains stable, unaffected by the length of the analyzed sequences. Hence, the primary benefit of the suggested approach lies in its rapid processing capabilities in large-scale sequence alignment, a task that traditional exact algorithms would require significantly more time to perform.

References

  1. EMBL’s European Bioinformatics Institute (EMBL-EBI) Available online: https://www.ebi.ac.uk/about/our-impact (accessed on 30 November 2023)
  2. Luo R. Liu B. Xie Y. Li Z. Huang W. Yuan J. He G. Chen Y. Pan Q. Liu Y. et al. SOAPdenovo2: An empirically improved memory-efficient short-read de novo assembler Gigascience 2012 1 18 10.1186/2047-217X-1-18 23587118
  3. Needleman S.B. Wunsch C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins J. Mol. Biol. 1970 48 443 453 10.1016/0022-2836(70)90057-4 5420325
  4. Smith T.F. Waterman M.S. Identification of common molecular subsequences J. Mol. Biol. 1981 147 195 197 10.1016/0022-2836(81)90087-5 7265238
  5. Regnier M. Knuth-Morris-Pratt algorithm: An analysis Mathematical Foundations of Computer Science 1989—Proceedings of the Porabka-Kozubnik, Poland, August 28–September 1, 1989. Proceedings Lecture Notes in Computer Science Kreczmar A. Mirkowska G. Springer Berlin/Heidelberg, Germany 1989 Volume 379 10.1007/3-540-51486-4_90
  6. Altschul S.F. Gish W. Miller W. Myers E.W. Lipman D.J. Basic local alignment search tool J. Mol. Biol. 1990 215 403 410 10.1016/S0022-2836(05)80360-2 2231712
  7. Altschul S.F. Madden T.L. Schäffer A.A. Zhang J. Zhang Z. Miller W. Lipman D.J. Gapped BLAST and PSIBLAST: A new generation of protein database search programs Nucleic Acids Res. 1997 25 3389 3402 10.1093/nar/25.17.3389 9254694
  8. Borovska P. Gancheva V. Landzhev N. Massively parallel algorithm for multiple biological sequences alignment In Proceeding of the 36th IEEE International Conference on Telecommunications and Signal Processing (TSP) Rome, Italy 2–4 July 2013 638 642 10.1109/TSP.2013.6614014
  9. Gancheva V. Stoev H. An algorithm for pairwise DNA sequences alignment Bioinformatics and Biomedical Engineering—Proceedings of the 10th International Work-Conference, IWBBIO 2023, Meloneras, Gran Canaria, Spain, July 12–14, 2023, Proceedings, Part I Springer Cham, Switzerland 2023 Volume 13919 10.1007/978-3-031-34953-9_4
  10. Liu Y. Yan Y. Ren J. Marshall S. Sequence similarity alignment algorithm in bioinformatics: Techniques and challenges Advances in Brain Inspired Cognitive Systems—Proceedings of the 10th International Conference, BICS 2019, Guangzhou, China, July 13–14, 2019, Proceedings Lecture Notes in Computer Science Ren J. Hussain A. Zhao H. Huang K. Zheng J. Cai J. Chen R. Xiao Y. Springer Cham, Switzerland 2020 Volume 11691 10.1007/978-3-030-39431-8_53
  11. Karp R.M. Rabin M.O. Efficient randomized pattern-matching algorithms IBM J. Res. Dev. 1987 31 249 260 10.1147/rd.312.0249
  12. Harde P. Comparative study of string matching algorithms for DNA dataset Int. J. Comput. Sci. Eng. 2018 6 1067 1074 10.26438/ijcse/v6i5.10671074
  13. Tun N. Thin M. Comparison of three pattern matching algorithms using DNA Sequences Int. J. Sci. Eng. Technol. Res. 2014 3 6916 6920
  14. Chao J. Tang F. Xu L. Developments in algorithms for sequence alignment: A review Biomolecules 2022 12 546 10.3390/biom12040546 35454135
  15. Spouge J.L. Speeding up dynamic programming algorithms for finding optimal lattice paths SIAM J. Appl. Math. 1989 49 1552 1566 10.1137/0149094
  16. Zhang F. Qiao X.Z. Liu Z.Y. A parallel Smith-Waterman algorithm based on divide and conquer Proceedings of the Fifth International Conference on Algorithms and Architectures for Parallel Processing ICA3PP Beijing, China 23–25 October 2002 10.1109/ICAPP.2002.1173568
  17. Lipman D.J. Pearson W.R. Rapid and sensitive protein similarity searches Science 1985 227 1435 1441 10.1126/science.2983426 2983426
  18. Pearson W.R. Lipman D.J. Improved tools for biological sequence comparison Proc. Natl. Acad. Sci. USA 1988 85 2444 2448 10.1073/pnas.85.8.2444 3162770
  19. Pedretti K. Casavant T. Braun R. Scheetz T. Birkett C. Roberts C. Three complementary approaches to parallelization of local BLAST service on workstation clusters Fifth International Conference on Parallel Computing Technologies (PaCT)—Proceedings of the 5th International Conference, PaCT-99, St. Petersburg, Russia, September 6–10, 1999 Proceedings Lecture Notes in Computer Science (LNCS) Springer Berlin/Heidelberg, Germany 1999 Volume 1662
  20. Costa R. Lifschitz S. Database allocation strategies for parallel BLAST evaluation on clusters Distrib. Parallel Databases 2003 13 99 127 10.1023/A:1021569823663
  21. Oehmen C. Nieplocha J. ScalaBLAST: A scalable implementation of BLAST for high-performance data-intensive bioinformatics analysis IEEE Trans. Parallel Distrib. Syst. 2006 17 740 749 10.1109/TPDS.2006.112
  22. Thorsen O. Jiang K. Peters A. Smith B. Lin H. Feng W. Sosa C. Parallel genomic sequence-search on a massively parallel system Proceedings of the ACM 4th International Conference on Computing Frontiers Ischia, Italy 7–9 May 2007
  23. Lin H. Balaji P. Poole R. Sosa C. Ma X. Feng W.C. Massively parallel genomic sequence search on the Blue Gene/P architecture Proceedings of the ACM/IEEE Conference on Supercomputing Austin, TX, USA 25 August 2009
  24. Farrar M. Striped Smith-Waterman speeds database searches six times over other SIMD implementations Bioinformatics 2007 23 156 161 10.1093/bioinformatics/btl582 17110365
  25. Sathe S.R. Shrimankar D.D. Parallelizing and analyzing the behavior of sequence alignment algorithm on a cluster of workstations for large datasets Int. J. Comput. Appl. 2013 74 18 30
  26. Kaur K. Chakraborty S. Gupta M.K. Accelerating Smith-Waterman algorithm for faster sequence alignment using graphical processing unit Phys. Conf. Ser. 2022 2161 012028 10.1088/1742-6596/2161/1/012028
  27. Lipták P. Kiss A. Szalai-Gindl J.M. Heuristic pairwise alignment in database environments Genes 2022 13 2005 10.3390/genes13112005 36360242
  28. Grešová K. Vaculík O. Alexiou P. Using attribution sequence alignment to interpret deep learning models for miRNA binding site prediction Biology 2023 12 369 10.3390/biology12030369 36979061
  29. Petty T. Hannig J. Huszar T.I. Iyer H. A New string edit distance and applications Algorithms 2022 15 242 10.3390/a15070242
  30. Gancheva V. Stoev H. DNA sequence alignment method based on trilateration Bioinformatics and Biomedical Engineering—Proceedings of the 7th International Work-Conference, IWBBIO 2019, Granada, Spain, May 8–10, 2019, Proceedings, Part II Lecture Notes in Computer Science Springer Cham, Switzerland 2019 271 283 11466 10.1007/978-3-030-17935-9_25

Issue

Genes, vol. 15, 2024, , https://doi.org/10.3390/genes15030341

Вид: статия в списание, публикация в издание с импакт фактор, публикация в реферирано издание, индексирана в Scopus и Web of Science