A Predictive Framework for Integrating Disparate Genomic Data Types Using Sample-Specific Gene Set Enrichment Analysis and Multi-Task Learning
Understanding the root molecular and genetic causes driving complex traits is a fundamental challenge in genomics and genetics. Numerous studies have used variation in gene expression to understand complex traits, but the underlying genomic variation that contributes to these expression changes is not well understood. In this study, we developed a framework to integrate gene expression and genotype data to identify biological differences between samples from opposing complex trait classes that are driven by expression changes and genotypic variation. This framework utilizes pathway analysis and multi-task learning to build a predictive model and discover pathways relevant to the complex trait of interest. We simulated expression and genotype data to test the predictive ability of our framework and to measure how well it uncovered pathways with genes both differentially expressed and genetically associated with a complex trait. We found that the predictive performance of the multi-task model was comparable to other similar methods. Also, methods like multi-task learning that considered enrichment analysis scores from both data sets found pathways with both genetic and expression differences related to the phenotype. We used our framework to analyze differences between estrogen receptor (ER) positive and negative breast cancer samples. An analysis of the top 15 gene sets from the multi-task model showed they were all related to estrogen, steroids, cell signaling, or the cell cycle. Although our study suggests that multi-task learning does not enhance predictive accuracy, the models generated by our framework do provide valuable biological pathway knowledge for complex traits.