Novel computational methods for promoter identification and analysis

  • Ramzan Umarov

Student thesis: Doctoral Thesis

Abstract

Promoters are key regions that are involved in differential transcription regulation of protein-coding and RNA genes. The gene-specific architecture of promoter sequences makes it extremely difficult to devise a general strategy for their computational identification. Accurate prediction of promoters is fundamental for interpreting gene expression patterns, and for constructing and understanding genetic regulatory networks. In the last decade, genomes of many organisms have been sequenced and their gene content was mostly identified. Promoters and transcriptional start sites (TSS), however, are still left largely undetermined and efficient software able to accurately predict promoters in newly sequenced genomes is not yet available in the public domain. While there are many attempts to develop computational promoter identification methods, reliable tools to analyze long genomic sequences are still lacking. In this dissertation, I present the methods I have developed for prediction of promoters for different organisms. The first two methods, TSSPlant and PromCNN, achieved state-of-the-art performance for discriminating promoter and non-promoter sequences for plant and eukaryotic promoters respectively. For TSSPlant, a large number of features were crafted and evaluated to train an optimal classifier. Prom- CNN was built using a deep learning approach that extracts features from the data automatically. The trained model demonstrated the ability of a deep learning approach to grasp complex promoter sequence characteristics. For the latest method, DeeReCT-PromID, I focus on prediction of the exact positions of the TSSs inside the eukaryotic genomic sequences, testing every possible location. This is a more difficult task, requiring not only an accurate classifier, but also appropriate selection of unique predictions among multiple overlapping high scoring genomic segments. The new method significantly outperform the previous promoter prediction programs by considerably reducing the number of false positive predictions. Specifically, to reduce the false positive rate, the models are adaptively and iteratively trained by changing the distribution of samples in the training set based on the false positive errors made in the previous iteration. The new methods are used to gain insights into the design principles of the core promoters. Using model analysis, I have identified the most important core promoter elements and their effect on the promoter activity. Furthermore, the importance of each position inside the core promoter was analyzed and validated using a large single nucleotide polymorphisms data set. I have developed a novel general approach to detect long range interactions in the input of a deep learning model, which was used to find related positions inside the promoter region. The final model was applied to the genomes of different species without a significant drop in the performance, demonstrating a high generality of the developed method.
Date of AwardMar 2 2020
Original languageEnglish (US)
Awarding Institution
  • Computer, Electrical and Mathematical Science and Engineering
SupervisorXin Gao (Supervisor)

Keywords

  • promoter
  • deep learning
  • transcription

Cite this

'