Increasing the Power to Detect Causal Associations by Combining Genotypic and Expression Data in Segregating Populations
To dissect common human diseases such as obesity and diabetes, a systematic approach is needed to study how genes interact with one another, and with genetic and environmental factors, to determine clinical end points or disease phenotypes. Bayesian networks provide a convenient framework for extracting relationships from noisy data and are frequently applied to large-scale data to derive causal relationships among variables of interest. Given the complexity of molecular networks underlying common human disease traits, and the fact that biological networks can change depending on environmental conditions and genetic factors, large datasets, generally involving multiple perturbations (experiments), are required to reconstruct and reliably extract information from these networks. With limited resources, the balance of coverage of multiple perturbations and multiple subjects in a single perturbation needs to be considered in the experimental design. Increasing the number of experiments, or the number of subjects in an experiment, is an expensive and time-consuming way to improve network reconstruction. Integrating multiple types of data from existing subjects might be more efficient. For example, it has recently been demonstrated that combining genotypic and gene expression data in a segregating population leads to improved network reconstruction, which in turn may lead to better predictions of the effects of experimental perturbations on any given gene. Here we simulate data based on networks reconstructed from biological data collected in a segregating mouse population and quantify the improvement in network reconstruction achieved using genotypic and gene expression data, compared with reconstruction using gene expression data alone. We demonstrate that networks reconstructed using the combined genotypic and gene expression data achieve a level of reconstruction accuracy that exceeds networks reconstructed from expression data alone, and that fewer subjects may be required to achieve this superior reconstruction accuracy. We conclude that this integrative genomics approach to reconstructing networks not only leads to more predictive network models, but also may save time and money by decreasing the amount of data that must be generated under any given condition of interest to construct predictive network models. Complex phenotypes such as common human diseases are caused by variations in DNA in many genes that interact in complex ways with a number of environmental factors. These multifactorial gene and environmental perturbations induce changes in molecular networks that in turn lead to phenotypic changes in the organism under study. The comprehensive monitoring of transcript abundances using gene expression microarrays in different tissues over a large number of individuals in a population can be used to reconstruct molecular networks that underlie higher-order phenotypes such as disease. The cost to generate these large-scale gene activity measurements over large numbers of individuals can be extreme. However, by integrating DNA variation and gene activity data monitored in each individual in a given population of interest, we demonstrate that the power to elucidate molecular networks that drive complex phenotypes can be significantly enhanced, without increasing the sample size. Using a biologically realistic simulation framework, we demonstrate that molecular networks reconstructed using the combined DNA variation and gene activity data are more accurate than molecular networks reconstructed from gene activity data alone, implying that adding DNA variation data might allow us to use fewer subjects to produce molecular networks that better explain complex phenotypes such as disease.