Variable selection in regression mixture modeling for the discovery of gene regulatory networks
成果类型:
Article
署名作者:
Gupta, Mayetri; Ibrahim, Joseph G.
署名单位:
University of North Carolina; University of North Carolina Chapel Hill
刊物名称:
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
ISSN/ISSBN:
0162-1459
DOI:
10.1198/016214507000000068
发表日期:
2007
页码:
867-880
关键词:
monte-carlo
identification
algorithm
database
sites
摘要:
The profusion of genomic data through genome sequencing and gene expression microarray technology has facilitated statistical research in determining gene interactions regulating a biological process. Current methods generally consist of a two-stage procedure: clustering gene expression measurements and searching for regulatory switches, typically short, conserved sequence patterns (motifs) in the DNA sequence adjacent to the genes. This process often leads to misleading conclusions as incorrect cluster selection may lead to missing important regulatory motifs or making many false discoveries. Treating cluster memberships as known, rather than estimated, introduces bias into analyses, preventing uncertainty about cluster parameters. Further, there is underutilization of the available data, as the sequence information is ignored for purposes of expression clustering and vice versa. We propose a way to address these issues by combining gene clustering and motif discovery in a unified framework, a mixture of hierarchical regression models, with unknown components representing the latent gene clusters, and genomic sequence features linked to the resultant gene expression through a multivariate hierarchical regression. We demonstrate a Monte Carlo method for simultaneous variable selection (for motifs) and clustering (for genes). The selection of the number of components in the mixture is addressed by computing the analytically intractable Bayes factor through a novel multistage mixture importance sampling approach. This methodology is used to analyze a yeast cell cycle dataset to determine an optimal set of motifs that discriminates between groups of genes and simultaneously finds the most significant gene clusters.