The development of Chromatin immunoprecipitation followed by sequencing (ChIP-seq) technology has enabled the construction of genome-wide maps of protein-DNA interaction. Such maps provide information about transcriptional regulation at the epigenetic level (histone modifications and histone variants) and at the level of transcription factor (TF) activity.
This dissertation presents novel computational methods for ChIP-seq data analysis and applications. The work of this dissertation addresses four main challenges. First, I address the problem of detecting histone modifications from ChIP-seq cancer samples. The presence of copy number variations (CNVs) in cancer samples results in statistical biases that lead to inaccurate predictions when standard methods are used. To overcome this issue I developed HMCan, a specially designed algorithm to handle ChIP-seq cancer data by accounting for the presence of CNVs. When using ChIP-seq data from cancer cells, HMCan demonstrates unbiased and accurate predictions compared to the standard state of the art methods.
Second, I address the problem of identifying changes in histone modifications between two ChIP-seq samples with different genetic backgrounds (for example cancer vs. normal). In addition to CNVs, different antibody efficiency between samples and presence of samples replicates are challenges for this problem. To overcome these issues, I developed the HMCan-diff algorithm as an extension to HMCan. HMCan-diff implements robust normalization methods to address the challenges listed above. HMCan-diff significantly outperforms another state of the art methods on data containing cancer samples.
Third, I investigate and analyze predictions of different methods for enhancer prediction based on ChIP-seq data. The analysis shows that predictions generated by different methods are poorly overlapping. To overcome this issue, I developed DENdb, a database that integrates enhancer predictions from different methods. DENdb also integrates several experimental data including ChIP-seq data for TF binding sites.
Finally, I present an extensive computational comparison of different ab-initio motif identification methods based on TF ChIP-seq data. The comparison included 10 different methods over 159 different TF datasets. Recommendations of this comparison indicate that the usage of simple methods outperforms the usage of high order models.
|Date of Award||Apr 25 2017|
|Original language||English (US)|
- Computer, Electrical and Mathematical Science and Engineering
|Supervisor||Vladimir Bajic (Supervisor)|
- computer science
- machine learning
- transcription regulation