Contributions to Computational Methods for Association Extraction from Biomedical Data: Applications to Text Mining and In Silico Toxicology

  • Arwa B. Raies

Student thesis: Doctoral Thesis


The task of association extraction involves identifying links between different entities. Here, we make contributions to two applications related to the biomedical field. The first application is in the domain of text mining aiming at extracting associations between methylated genes and diseases from biomedical literature. Gathering such associations can benefit disease diagnosis and treatment decisions. We developed the DDMGD database to provide a comprehensive repository of information related to genes methylated in diseases, gene expression, and disease progression. Using DEMGD, a text mining system that we developed, and with an additional post-processing, we extracted ~100,000 of such associations from free-text. The accuracy of extracted associations is 82% as estimated on 2,500 hand-curated entries. The second application is in the domain of computational toxicology that aims at identifying relationships between chemical compounds and toxicity effects. Identifying toxicity effects of chemicals is a necessary step in many processes including drug design. To extract these associations, we propose using multi-label classification (MLC) methods. These methods have not undergone comprehensive benchmarking in the domain of predictive toxicology that could help in identifying guidelines for overcoming the existing deficiencies of these methods. Therefore, we performed extensive benchmarking and analysis of ~19,000 MLC models. We demonstrated variability in the performance of these models under several conditions and determined the best performing model that achieves accuracy of 91% on an independent testing set. Finally, we propose a novel framework, LDR (learning from dense regions), for developing MLC and multi-target regression (MTR) models from datasets with missing labels. The framework is generic, so it can be applied to predict associations between samples and discrete or continuous labels. Our assessment shows that LDR performed better than the baseline approach (i.e., the binary relevance algorithm) when evaluated using four MLC and five MTR datasets. LDR achieved accuracy scores of up to 97% using testing MLC datasets, and R2 scores up to 88% for testing MTR datasets. Additionally, we developed a novel method for minority oversampling to tackle the problem of imbalanced MLC datasets. Our method improved the precision score of LDR by 10%.
Date of AwardNov 29 2018
Original languageEnglish (US)
Awarding Institution
  • Computer, Electrical and Mathematical Science and Engineering
SupervisorVladimir Bajic (Supervisor)


  • machine learning
  • text mining
  • computational toxicology
  • DNA methylation
  • multi-lable classification
  • multi-target regression

Cite this