Motivation Oxford Nanopore sequencing producing long reads at low cost has made many breakthroughs in genomics studies. However, the large number of errors in Nanopore genome assembly affect the accuracy of genome analysis. Polishing is a procedure to correct the errors in genome assembly and can improve the reliability of the downstream analysis. However, the performances of the existing polishing methods are still not satisfactory. Results We developed a novel polishing method, NeuralPolish, to correct the errors in assemblies based on alignment matrix construction and orthogonal Bi-GRU networks. In this method, we designed an alignment feature matrix for representing read-to-assembly alignment. Each row of the matrix represents a read, and each column represents the aligned bases at each position of the contig. In the network architecture, a bi-directional GRU network is used to extract the sequence information inside each read by processing the alignment matrix row by row. After that, the feature matrix is processed by another bi-directional GRU network column by column to calculate the probability distribution. Finally, a CTC decoder generates a polished sequence with a greedy algorithm. We used five real data sets and three assembly tools including Wtdbg2, Flye and Canu for testing, and compared the results of different polishing methods including NeuralPolish, Racon, MarginPolish, HELEN and Medaka. Comprehensive experiments demonstrate that NeuralPolish achieves more accurate assembly with fewer errors than other polishing methods and can improve the accuracy of assembly obtained by different assemblers.
ASJC Scopus subject areas
- Computational Theory and Mathematics
- Computational Mathematics
- Molecular Biology
- Statistics and Probability
- Computer Science Applications