Sequence basis of transcription initiation in the human genome

成果类型:
Article
署名作者:
Dudnyk, Kseniia; Cai, Donghong; Shi, Chenlai; Xu, Jian; Zhou, Jian
署名单位:
University of Texas System; University of Texas Southwestern Medical Center; St Jude Children's Research Hospital
刊物名称:
SCIENCE
ISSN/ISSBN:
0036-12237
DOI:
10.1126/science.adj0116
发表日期:
2024-04-26
关键词:
core promoters u1 snrnp rna architecture alignment
摘要:
INTRODUCTION The promoter is responsible for initiating the transcription process and is an essential part of any gene. However, present understanding of how promoter sequence drives transcription initiation is still incomplete and cannot explain most human promoters. RATIONALE Each promoter has a distinct profile of transcription initiation signals that show the frequency and position of transcription start sites and can be experimentally measured at base-pair resolution. We hypothesized that these profiles reflect the underlying mechanisms of transcription initiation and that solving the machine-learning task of predicting the transcription initiation signal profile from the promoter sequence would provide insights into the sequence-based regulation of transcription initiation. To this end, we developed a deep learning-inspired explainable modeling approach, which involved first developing a deep-learning model for the prediction task and then designing a simple explainable model based on the analyses of the deep-learning model. RESULTS With this approach, we developed an explainable machine-learning model, Puffin, and showed that a small set of sequence patterns and simple rules are sufficient to explain most human promoters. These sequence patterns have different activating or repressive effects on transcription initiation depending on their position and strand relative to the transcription start site. We identified three types of sequence patterns: motifs, initiators, and trinucleotides. Motifs are the main contributors to transcription initiation, initiators fine-tune the local preference of transcription start sites, and trinucleotides capture the residual dependencies. The effects of individual sequence patterns at each base-pair location were combined additively at the log scale. Although most motifs identified by Puffin match known transcription factor motifs, the position- and strand-specific effects of these motifs on transcription initiation had not been previously characterized. We uncovered both motifs with directional effects and motifs with bidirectional effects. Directional motifs such as TATA and YY1 have strong activating effects on transcriptional initiation signals, preferentially on one strand, whereas bidirectional motifs such as NFY, ETS, SP, ZNF143, NRF1, and CREB activate transcription initiation on both strands toward opposite directions. Each motif has distinct strand- and position-specific effects, and they likely reflect the underlying mechanisms of transcription activation. We validated the Puffin model with various experimental data, including verifying the effects of depleting transcription factors NF-Y and YY1 on transcription initiation signals from data. We also developed a new CRISPR-Cap assay to assess the impact of sequence perturbations on transcription initiation signals in the native genome, and we verified that the sequence perturbation effects aligned with model predictions. CONCLUSION Puffin enabled us to explain transcription initiation from sequence at both the motif level and the base-pair level in human promoters and across mammalian genomes. With this capability, we demonstrated the associations between motif contributions and the cell-type specificity of promoters. We postulate that this link between motif contributions and cell-type specificity is explained by motif contributions tuning the response curve of promoters to external signals of transcriptional activation. Moreover, we elucidated the sequence basis of bidirectional transcription initiation at most human promoters. Qualitatively, bidirectional motifs, which are present in most promoters, are likely the main driver of bidirectional transcription, and we quantified the shared base-pair contributions of transcription on both strands. Furthermore, by comparing human and mouse data and sequence conservation across 241 mammalian species, we show that the transcription initiation rules are conserved across mammalian species. Overall, we show that our explainable sequence-based machine-learning model provides rich insights into understanding the sequence basis of transcription initiation. Looking forward, our approach of building explainable models based on insights learned from deep-learning models can be applied to studying the sequence basis of other genome regulatory processes. A unified model that explains the sequence basis of transcription initiation in the human genome. Puffin predicts transcription initiation signals by first detecting sequence patterns that appear in the DNA sequence and then applying the effects of every sequence pattern on the transcription initiation signal. The model includes three types of sequence patterns: motifs, initiators, and trinucleotides. Strand-specific base pair-resolution transcription initiation signals are predicted by combining motif effects additively in log scale and then transforming to output scale. bp, base pairs.