Some Statistical Strategies for DAE-seq Data Analysis: Variable Selection and Modeling Dependencies Among Observations

成果类型:
Article
署名作者:
Rashid, Naim; Sun, Wei; Ibrahim, Joseph G.
署名单位:
University of North Carolina; University of North Carolina Chapel Hill
刊物名称:
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
ISSN/ISSBN:
0162-1459
DOI:
10.1080/01621459.2013.869222
发表日期:
2014
页码:
78-94
关键词:
hidden markov-models human genome regression-models time-series oracle properties chip likelihood chromatin estimator algorithm
摘要:
In DAB (DNA after enrichment)-seq experiments, genomic regions related with certain biological processes are enriched/isolated by an assay and are then sequenced on a high-throughput sequencing platform to determine their genomic positions. Statistical analysis of DAE-seq data aims to detect genomic regions with significant aggregations of isolated DNA fragments (enriched regions) versus all the other regions (background). However, many Confounding factors may influence DAE-seq signals. In addition, the signals in adjacent genomic regions may exhibit strong correlations, which invalidate the independence assumption employed by many existing methods. To mitigate these issues, we develop a novel autoregressive Hidden Markov model (AR-HMM) to account for covariates effects and violations of the independence assumption. We demonstrate that our AR-HMM leads to improved performance in identifying enriched regions in both simulated and real datasets, especially in those in epigenetic datasets with broader regions of DAE-seq signal enrichment. We also introduce a variable selection procedure in the context of the HMM/AR-HMM where the observations are not independent and the mean value of each state-specific, emission distribution is modeled by some covariates. We study the theoretical properties of this variable selection procedure and demonstrate its efficacy in simulated and real DAE-seq data. In summary, we develop several practical approaches for DAE-seq data analysis that are also applicable to more general problems in statistics. Supplementary materials for this article are available online.