PCRMS: a database of predicted cis-regulatory modules and constituent transcription factor binding sites in genomes

Comparison of the contents of the three databases

Databases	Species	CRMCs			TFBSs
Databases	Species	Mean length (bp)	Number	Coverage of genome (%)	Number	Coverage of genome (%)
PCRMS	H. sapiens	981	1 404 973	44.03	90 671 016	16.71
PCRMS	M. musculus	1493	920 068	50.39	104 251 155	20.34
GeneHancer	H. sapiens	1489	394 086	18.99	X	X
GeneHancer	M. musculus	X	X	X	X	X
SCREEN	H. sapiens	273	926 535	8.2	X	X
SCREEN	M. musculus	272	339 815	3.39	X	X

Databases	Species	CRMCs			TFBSs
Databases	Species	Mean length (bp)	Number	Coverage of genome (%)	Number	Coverage of genome (%)
PCRMS	H. sapiens	981	1 404 973	44.03	90 671 016	16.71
PCRMS	M. musculus	1493	920 068	50.39	104 251 155	20.34
GeneHancer	H. sapiens	1489	394 086	18.99	X	X
GeneHancer	M. musculus	X	X	X	X	X
SCREEN	H. sapiens	273	926 535	8.2	X	X
SCREEN	M. musculus	272	339 815	3.39	X	X

Table 1.

Comparison of the contents of the three databases

Databases	Species	CRMCs			TFBSs
Databases	Species	Mean length (bp)	Number	Coverage of genome (%)	Number	Coverage of genome (%)
PCRMS	H. sapiens	981	1 404 973	44.03	90 671 016	16.71
PCRMS	M. musculus	1493	920 068	50.39	104 251 155	20.34
GeneHancer	H. sapiens	1489	394 086	18.99	X	X
GeneHancer	M. musculus	X	X	X	X	X
SCREEN	H. sapiens	273	926 535	8.2	X	X
SCREEN	M. musculus	272	339 815	3.39	X	X

Databases	Species	CRMCs			TFBSs
Databases	Species	Mean length (bp)	Number	Coverage of genome (%)	Number	Coverage of genome (%)
PCRMS	H. sapiens	981	1 404 973	44.03	90 671 016	16.71
PCRMS	M. musculus	1493	920 068	50.39	104 251 155	20.34
GeneHancer	H. sapiens	1489	394 086	18.99	X	X
GeneHancer	M. musculus	X	X	X	X	X
SCREEN	H. sapiens	273	926 535	8.2	X	X
SCREEN	M. musculus	272	339 815	3.39	X	X

Technical implementation

The current version of PCRMS (v2) was developed using MySQL 5.7.17 (http://www.mysql.com) and it runs on a Linux-based Apache2 server (http://www.apache.org). PHP 7.4 (http://www.php.net) scripts were used for back-end processing. The interactive interface and responsive features were implemented using Bootstrap 4 (https://getbootstrap.com/), JQuery (http://jquery.com) and dataTables (https://datatables.net/). NCBI sequence viewer 3.38.0 (https://www.ncbi.nlm.nih.gov/projects/sviewer/) was used for visualization.

Results and discussion

Predicted CRMs and constituent TFBSs in the human and mouse genomes

Applying dePCRM2 to the TF ChIP-seq datasets available to us (6/1/2019) in each organism, we predicted 1 404 973 and 920 068 CRMCs in the human (47) and mouse genomes, comprising 44.03% and 50.39% of their genomes, respectively. These CRMCs contain 90 671 016 and 104 251 155 TFBSs, comprising 16.71% and 20.34% of the human and mouse genomes, respectively. We compared the numbers and lengths of our CRMCs with those of cCREs in the SCREEN database (46) and those of enhancers in the GeneHancer database (39). cCREs were predicted based on overlaps among hundreds or thousands of DNase-seq, ATAC-seq and histone marks ChIP-seq datasets in various cell/tissue types in an organism (46). Enhancers in GeneHancer were predicted by combing nine sets of earlier predicted and experimentally determined human CRMs using a voting schema (39). As shown in Table 1, the numbers of our predicted CRMCs are much larger than those of cCREs and that of GeneHancer enhancers. Our predicted CRMCs also comprise much larger proportions of the genomes than do those in CREEN or GeneHancer (Table 1). We attribute to two reasons the larger numbers and higher genome coverages of our predicted CRMCs. First, the types of input data used by the three methods were different, which might capture different features of CRMs, thus have different capabilities of predicting CRMs. Specifically, the input data for predicting cCREs were DNase-seq, ATAC-seq and histone marks ChIP-seq data, those for predicting GeneHancer enhancers were earlier predicted and experimentally determined CRM sets by different groups, and those for predicting CRMCs were TF ChIP-seq data. Second, the number of predicted cCREs was limited by the number of called DNase I hypersensitive sites, transposase-accessible sites and epigenetic mark peaks, while the number of GeneHancer enhancers was constrained by the sizes of earlier predicted and experimentally determined enhancer sets. In contrast, by appropriately extending the originally called short TF binding peaks, we could greatly increase the power of available TF ChIP-seq data as we demonstrated earlier (47), since binding sites of co-operative TFs tend to be closely located on a genome segment to form a CRM (1), while a called short binding peak to which the ChIP-ed TF bind can be only a part of a longer CRM. For instance, the extended binding peaks (1000bp) in the 6079 human ChIP-seq datasets cover 77.47% of the mappable genome, and the extended parts of the peaks contribute to almost half (47.10%) of the coverage (47). dePCRM2 predicts 56.84% of the covered genome to be CRMC positions, and 42.13% of them are predicted solely based on the extended parts of originally called binding peaks (47). Importantly, we have shown that CRMC positions predicted by the extended parts of originally called binding peaks are under similarly strong evolutionary constraints as those predicted by the originally called binding peaks, thus, are likely true CRMC positions (47). On the other hand, due to the noisy nature of ChIP-seq data (51–53), 37.82% of genome positions covered by originally called binding peaks are not predicted to be CRMCs, and they are largely selectively neutral (47).

The lengths of our predicted CRMCs in the human (Figure 1A) and mouse (Figure 1B) genomes have similar distributions, ranging from a few hundred bp to a few thousand bp with a mean length of 981bp and 1,439bp, respectively, which are shorter than those of known human (2049bp) and mouse (2432bp) enhancers in the VISTA database (48), indicating that a portion of our CRMCs are only components of longer CRMs as we argued earlier (47). In contrast, the lengths of cCREs in the human and mouse genomes are almost uniform with a mean length of 273bp and 272bp, respectively (46) (Table 1), while the lengths of GeneHancer enhancers show a periodic pattern (47) with a mean length of 1489bp. Such erratic lengths of cCREs and GeneHancer enhancers are likely artifacts of the underlying algorithms as we argued earlier (47). On the other hand, as we pointed out earlier (47), accurate prediction of the lengths of CRMs is a highly challenge task, because a truncated enhancer can still be functional (1), and a super-enhancer may contain multiple discrete short enhancers (54). Thus, the length of a CRM depends on how it is defined. dePCRM2 predicts a CRMC as a cluster of TFBSs with the distance between any two adjacent TFBSs being short than 300bp (Methods and (47)). While cCREs might be shorter discrete units of longer CRMs, we estimated a FDR of 23.12% for the human cCREs positions based on their largely neutrally evolutionary behaviors (47). GeneHancer enhancers have a mean length of 1,489bp, which is shorter than that of known human enhancers (2049bp) in the VISTA database (48), we estimated a FDR of 29.28% for the genome positions of the GeneHancer enhancers based on their largely neutrally evolutionary behaviors (47). We have shown that our predicted CRMCs and TFBSs positions in the human genome are highly accurate based on validations using multiple independent data (47), and the same is true for the predicted CRMCs and TFBSs in the mouse genome (manuscript in preparation, P.N. and Z.S).

Figure 1.

Summary of the lengths of CRMCs and the numbers of TFBSs in a CRMC in the human and mouse genomes. A. Distributions of the lengths of CRMCs in the human and mouse genomes. B. Distributions of the number of TFBSs in a CRMC in the human and mouse genomes. C. Scatter plot of the number of TFBSs in a CRMC vs its length in the human genome. D. Scatter plot of the number of TFBSs in a CRMC vs its length in the mouse genome.

The number of TFBSs in a CRMC in either the human genome or the mouse genome varies widely, ranging from a few to a few hundreds, with a mean/median number of 90/34 and 183/67, respectively (Figure 1B). Interestingly, the number of TFBSs in a CRMCs is largely linearly related to the length of the CRMC in both the human (Figure 1C) and the mouse (Figure 1D) genomes, indicating that the density of TFBSs is largely the same in the most of the CRMCs. In contrast, no information of de novo predicted TFBSs in cCREs or enhancers is available in the SCREEN or GeneHancer databases (Table 1).

To evaluate the significance of the CRMCs, dePCRM2 computes a P-value for each predicted CRMCs based on its |${{\rm{S}}_{{\rm{CRM}}}}$|⁠. We have shown earlier that the longer a CRMC, the higher its |${{\rm{S}}_{{\rm{CRM}}}}$| score, the smaller its P-value, and the stronger evolutionary constraint it is subject to (47). Therefore, both the |${{\rm{S}}_{{\rm{CRM}}}}$| score and its associated P-value capture essential features of a true CRM. This result also justifies our assumption that a genome segment containing closely located putative TFBSs is more likely a CRM than a segment without such sequence patterns. It is based on this assumption that dePCRM2 predicts CRMs The assumption is clearly in agreement with the well-known notion that a functional genome segment such as a CRM must contain certain sequence patterns (i.e, clusters of TFBSs) that are unlikely to occur by chance, and that the longer the patterns, the less likely they occur by chance.

However, as dePCRM2 predicts CRMCs based on the predicted TFBSs in the genome, false positive and false negative predictions of TFBSs would result in false positive, false negative and incomplete predictions of CRMs. We estimated the FDR of our motif-finder ProSampler used in the dePCRM2 pipeline to be about 8% (50). Thus, we designed dePCRM2 to further filter out potentially false positive motifs returned by ProSampler in the extended binding peaks in a dataset based on their cooccurring patterns (see Method and (47)). We estimated the FDR of the predicted CRMC positions to be about 0.05%, thus, FDR for TFBSs are likely further reduced (47). However, as we indicated earlier (47), due to the limitation of the available TF ChIP-seq datasets, our predicted TFBSs are still incomplete, and a proportion of our predicted CRMCs might be only components of long CRMs whose full prediction depends on more data available in the future. Nonetheless, the short CRMC components can be effectively filtered out using a higher |${{\rm{S}}_{{\rm{CRM}}}}$| score cutoff or a lower P-value cutoff (47). To assist the users who might be interested in CRMCs with different lengths, statistical significance or with different evolutionary constraints, in addition to making the entire sets of predicted CRMs available for bulk downloading, we provide four options of P-value cutoffs (P-value <0.05, 0.01, |$5 \times {10^{ - 6}}$| and |$1 \times {10^{ - 6}})$| to query the database. Table 2 summarizes the predicted CRMs using these P-value cutoffs; they are subsets of the CRMCs with different length distributions and conservation levels (47). Clearly, the smaller a P-value cutoff, the longer the predicted CRMs.

Table 2.

Summary of the predicted CRMs at different P-values in the human and mouse genomes

Species	P-value	CRMs			TFBSs
Species	P-value	Mean length (bp)	Number	Coverage of genome (%)	Number	Coverage of genome (%)
H. sapiens	0.05	1162	1 155 151	43.47	89 948 206	16.54
	0.01	1292	1 020 679	42.72	88 912 654	16.32
	5.00E-06	2292	428 628	31.81	71 478 114	12.88
	1.00E-06	2624	327 396	27.82	64 136 635	11.47
M. musculus	0.05	1749	777 409	49.9	103 718 473	20.21
	0.01	1944	688 033	49.06	102 730 265	19.99
	5.00E-06	3182	338 635	39.53	88 579 892	16.96
	1.00E-06	3780	250 606	34.75	80 002 349	15.2

Species	P-value	CRMs			TFBSs
Species	P-value	Mean length (bp)	Number	Coverage of genome (%)	Number	Coverage of genome (%)
H. sapiens	0.05	1162	1 155 151	43.47	89 948 206	16.54
	0.01	1292	1 020 679	42.72	88 912 654	16.32
	5.00E-06	2292	428 628	31.81	71 478 114	12.88
	1.00E-06	2624	327 396	27.82	64 136 635	11.47
M. musculus	0.05	1749	777 409	49.9	103 718 473	20.21
	0.01	1944	688 033	49.06	102 730 265	19.99
	5.00E-06	3182	338 635	39.53	88 579 892	16.96
	1.00E-06	3780	250 606	34.75	80 002 349	15.2

Table 2.

Summary of the predicted CRMs at different P-values in the human and mouse genomes

Species	P-value	CRMs			TFBSs
Species	P-value	Mean length (bp)	Number	Coverage of genome (%)	Number	Coverage of genome (%)
H. sapiens	0.05	1162	1 155 151	43.47	89 948 206	16.54
	0.01	1292	1 020 679	42.72	88 912 654	16.32
	5.00E-06	2292	428 628	31.81	71 478 114	12.88
	1.00E-06	2624	327 396	27.82	64 136 635	11.47
M. musculus	0.05	1749	777 409	49.9	103 718 473	20.21
	0.01	1944	688 033	49.06	102 730 265	19.99
	5.00E-06	3182	338 635	39.53	88 579 892	16.96
	1.00E-06	3780	250 606	34.75	80 002 349	15.2

Species	P-value	CRMs			TFBSs
Species	P-value	Mean length (bp)	Number	Coverage of genome (%)	Number	Coverage of genome (%)
H. sapiens	0.05	1162	1 155 151	43.47	89 948 206	16.54
	0.01	1292	1 020 679	42.72	88 912 654	16.32
	5.00E-06	2292	428 628	31.81	71 478 114	12.88
	1.00E-06	2624	327 396	27.82	64 136 635	11.47
M. musculus	0.05	1749	777 409	49.9	103 718 473	20.21
	0.01	1944	688 033	49.06	102 730 265	19.99
	5.00E-06	3182	338 635	39.53	88 579 892	16.96
	1.00E-06	3780	250 606	34.75	80 002 349	15.2

Web interface to the database

We provide a user-friendly web interface to the PCRMS database for quickly inquiring and browsing predicted CRMs and TFBSs at different statistically significant levels in each organism as well as three functional analysis modules. Using these modules, the user can (i) search the closest CRM to a given gene, (ii) search all CRMs in the upstream and/or downstream regions of a gene of interest and (iii) search the TFBSs of a TF on one or more chromosomes in an organism (Figure 2).

Figure 2.

Overview of data integration and analysis modules and features of the PCRMS database.

Browse of database contents

We provide a Browse function by which the user can browse all CRMs predicted at a selected P-value cutoff on one or multiple selected chromosomes in a selected organism and inspect each CRMs and constituent TFBSs in detail. The user starts in the search form (Figure 3A) by selecting an organism (e.g. H. sapiens), one or more chromosomes (e.g. chrX) and a P-value cutoff (e.g. 1E-06). The search returns all the predicted CRMs (n = 8762) on the chromosome (chrX) of the organism (H. sapiens) in the interactive CRM list table (Figure 3B). Clicking on a CRM of interest, e.g. the first CRM hse1000017 in the table pops up the CRM information table (Figure 3C), where some parameter of the CRM are shown in the left panel and the locus of the CRM is displayed in the NCBI sequence viewer (shadowed rectangle) in the right panel, enabling detailed inspections of the genomic context of the CRM, including its neighboring genes and other annotations using the zooming and the translation functions of the viewer. For instance, the viewer reveals that hse1000017 is located in the second through the fifth introns, and spans the third through fifth exons, of the BCOR gene that codes for a corepressor of a transcription repressor BCL6. Both BCOR and BCL6 are involved in B lymphocytes differentiation (55, 56). Interestingly, hse1000017 overlaps two regulatory sequences annotated as ‘enhancer’ and ‘transcriptional cis-regulation’, while many ClinVar variants are located in hse1000017 (Figure 3C). Finally, clicking on the CRM ID (e.g. hse1000017) in the right panel of the CRM information table (Figure 3C) displays the CRM’s 5094 constituent TFBSs in the interactive TFBS table (Figure 3D), which includes the coordinates of the TFBSs, their UM IDs, binding scores, UM logos and matched known motifs. The vast majority of these TFBSs match those of known TF families (Figure 3D).

Figure 3.

The browse functions. A. In the search form, the use selects an organism (e.g. H. sapiens), one or more chromosomes (e.g. chrX) and a P-value cutoff (e.g. 1E-06). B. The searching results are displayed in the CRM list table. Shown is a snapshot of the resulting CRM list table containing 8762 predicted CRMs on chrX of H. sapiens. The first CRM hse1000017 in the list table is selected for further visualization. C. In the CRM information table, some parameters of the selected CRM hse1000017 is shown in the right panel, and the locus is displayed in the NCBI sequence viewer for further inspection. Clicking on ‘hse1000017’ in the right panel of the CRM information table displays its constituent TFBSs. D. A snapshot of the TFBS table of hse1000017 containing its 5094 constituent TFBSs.

Figure 4.

Search the closest CRM(s) to a gene. A. In the search form, the use selects an organism (e.g. H. sapiens) and a P-value cutoff (e.g. 1E-06), and inputs a gene name (e.g. GL13). B. The searching results are displayed in the CRM list table. Shown is a snapshot of the returned CRM list table containing 53 predicted CRMs. The third CRM hse1003435 in the list table is selected for further inspection. C. In the CRM information table, some parameters of the selected CRM hse10003435 is displayed in the right panel, and the locus is displayed in the NCBI sequence viewer for further inspections. Clicking on ‘hse1003435’ in the right panel of the CRM information table displays its constituent TFBSs. D. A snapshot of the TFBS table of hse1003435 containing its 988 constituent TFBSs.

Figure 5.

Search CRM(s) in a region around a gene. A. In the search form, the use selects an organism (e.g. H. sapiens) and a P-value cutoff (e.g. 1E-06), and inputs a gene name (e.g.SOX2). B. The searching results are displayed in the CRM list table. Shown is a snapshot of the 102 returned CRMs in the table. The second CRM hse1002109 in the list table is selected for further inspection. C. In the CRM information table, parameters of the selected CRM hse1002109 is shown in the right panel, and the locus is displayed in the NCBI sequence viewer for further inspections. Clicking on ‘hse1002109’ in the right panel of the CRM information table displays its constituent TFBSs. D. A snapshot of the TFBS table of hse1002109 containing its 1344 constituent TFBSs.

Figure 6.

Search TFBSs of a TF. A. In the search form, the use selects an organism (e.g. H. sapiens), input the name of a TF (e.g. RUNX1), and select a chromosome (e.g. chrX). B. A snapshot of the resulting TFBS table containing 2678 TFBSs of RUNX1 in chrX of H. sapiens.

In both the interactive CRM list table (Figure 3B) and the interactive TFBS table (Figure 3C), the user can change the number of entries to display in a page, sort results based on different columns, filter the results using the search box and set visible columns. The user can copy or export the selected items in a file in the CSV or Excel formats or export all records if no item is selected by default (Figure 3).

Functional analyses

To facilitate analyzing potential CRM-gene relationships and TFBSs landscape of specific TFs, we provide three functional analysis modules. First, using the ‘select the closest CRMs to a gene’ function, the user can search the closest CRMs to a gene (e.g. GL13) in an organism (e.g. H. sapiens) at a P-value cutoff (⁠|${\rm{e}}.{\rm{g}}.,{\ }1 \times {10^{ - 6}}){\ }$|(Figure 4A). The search returns the interactive CRM list table containing all CRMs to which the gene is the closest among all other genes in the chromosome (Figure 4B). In the example of the GLI3 gene, a total of 53 CRMs are returned. The user can inspect any of them by clicking on the CRM ID, which pops up the information table of the CRM as we demonstrated earlier (Figure 3C). For instance, clicking on the third CRM hse1003435 in the table displays it in the NCBI sequence viewer, revealing that the CRM is located in the third and fourth introns, and spans the fourth and fifth exons, of the GL13 gene (shadowed rectangle in Figure 4C). Interestingly, hse1003435 overlaps two annotated enhancers and many ClinVar variants (Figure 4C). Finally, clicking on the CRM ID hse1003435 in the right panel of the CRM information table (Figure 4C) displays the CRM’s 988 constituent TFBSs in the interactive TFBS table (Figure 4D). Most of these TFBSs match those of known TF families, while a few need to be determined (TBD) for their cognate TFs (Figure 4D).

Second, using the ‘select CRMs around a gene’ function, the user can search in an organism (e.g. H. sapiens) all CRMs in the upstream and/or downstream regions (e.g. 0.5Mbp) of a given gene (SOX2) (Figure 5A). The search returns all CRMs in the interactive CRM list table (Figure 5B). As before, each CRM can be inspected in its information table by clicking on the CRM ID. In the example of the SOX2 gene of H. sapiens, a total of 102 CRMs on chr3 are returned in the CRM list table (Figure 5B). Inspection of the second CRM hse1002109 in the NCBI sequence viewer reveals that the CRM is located in the sixth and seventh intron, and spans the seventh exon, of the SOX2 gene. Interestingly, it overlaps two annotated enhancer sequences, as well as many ClinVar variants (Figure 5C). Clicking on the CRM ID hse1002109 in the right panel of the CRM information table (Figure 5C) displays the CRM’s 1344 constituent TFBSs in the interactive TFBS table (Figure 5D). Some of these TFBSs match those of known TF families, while others need to be determined (TBD) for their cognate TFs.

Using the ‘search TFBSs of a transcription factor’ function, the user can retrieve all TFBSs of a given TF (e.g. RUNX1) in one or more selected chromosomes (e.g. chrX) in an organism (e.g. H. sapiens) (Figure 6A). The results are returned in the interactive TFBS table (Figure 6B). In the example of the TF RUNX1, a total of 2678 binding sites are found in chrX of H. sapiens.

Batch download

Using the Download function from the home page, the user can download all predicted CRMCs and constituent TFBSs in an organism in a file in BED format.

Future development

In the future, we will add predicted CRMCs and TFBSs in other important model organisms such as the worm (C. elegans) and the fly (D. melanogaster). We will also update the predictions in each organism when more data are available. We will add more information about the CRMCs, including their predicted functional states (active or non-active) of the CRMCs in various cell/tissue types, predicted target genes and causal variants of complex traits and diseases by integrating more data sources.

Conclusions

We have developed the PCRMS database that contains the most comprehensive collections of accurately predicted CRMs and constituent TFBSs in the human and mouse genomes. The web interface to PCRMS allows the user to browse, search and visualize the CRMs and constituent TFBSs. It also provides three functional analysis modules to search the closest CRM(s) to a gene, CRM(s) in a region around a gene and TFBSs landscape of a specific TF. The results can be inspected in interactive ways and exported in files in different formats. All the predicted CRMCs and TFBSs in an organism can be download in BED format. PCRMS will facilitate the research community’s efforts to characterize the regulatory genomes in important organisms

Acknowledgements

The authors would like to acknowledge members from Office of Technical Service of College of Computing and Informatics at UNC Charlotte for security reviewing and deploying the databases.

Funding

US National Science Foundation (DBI-1661332). The funding bodies played no role in the design of the study and collection, analysis and interpretation of data and in writing the manuscript.

Conflict of interest

The authors declare that they have no competing interests.

Declarations

Ethics approval and consent to participate. Ethics approval is not applicable to this study.

Data availability

All predicted CRMCs of human and mouse can be freely downloaded at https://cci-bioinfo.uncc.edu.

Consent for publication

Not applicable.

Author contributions

Z.S. and P.N. conceived and designed the project. P.N. carried out the computational analysis and built the database and the web interface. Z.S. and P.N. wrote the manuscript. All the authors read and approved the final manuscript.

References

Davidson

E.H.

(

2006

)

The Regulatory Genome: Gene Regulatory Networks In Development And Evolution

Academic Press

, Amsterdam.

Google Preview

Hindorff

L.A.

Sethupathy

Junkins

H.A.

et al. (

2009

)

Potential etiologic and functional implications of genome-wide association loci for human diseases and traits

Proc. Natl. Acad. Sci. U.S.A.

106

9362

–

9367

Ramos

E.M.

Hoffman

Junkins

H.A.

et al. (

2014

)

Phenotype-Genotype Integrator (PheGenI): synthesizing genome-wide association study (GWAS) data with existing genomic resources

Eur. J. Hum. Genet.

144

–

147

Wittkopp

P.J.

and

Kalay

(

2012

)

Cis-regulatory elements: molecular mechanisms and evolutionary processes underlying divergence

Nat. Rev. Genet.

–

Crossref

Rubinstein

and

de Souza

F.S.

(

2013

)

Evolution of transcriptional enhancers and animal diversity

Philos. Trans. R. Soc. Lond., B, Biol. Sci.

368

, 20130017.

Siepel

and

Arbiza

(

2014

)

Cis-regulatory elements and human evolution

Curr. Opin. Genet. Dev.

–

King

and

Wilson

(

1975

)

Evolution at two levels in humans and chimpanzees

Science

188

107

–

116

Maurano

M.T.

Humbert

Rynes

et al. (

2012

)

Systematic localization of common disease-associated variation in regulatory DNA

Science

337

1190

–

1195

Kasowski

Kyriazopoulou-Panagiotopoulou

Grubert

et al. (

2013

)

Extensive variation in chromatin states across humans

Science

342

750

–

752

10.

Kilpinen

Waszak

S.M.

Gschwind

A.R.

et al. (

2013

)

Coordinated effects of sequence variation on DNA binding, chromatin structure, and transcription

Science

342

744

–

747

11.

McVicker

van de Geijn

Degner

J.F.

et al. (

2013

)

Identification of genetic variants that affect histone modifications in human cells

Science

342

747

–

749

12.

Huang

and

Ovcharenko

(

2015

)

Identifying causal regulatory SNPs in ChIP-seq enhancers

Nucleic Acids Res.

225

–

236

13.

Ward

L.D.

and

Kellis

(

2012

)

Interpreting noncoding genetic variation in complex traits and human disease

Nat. Biotechnol.

1095

–

1106

14.

Pai

A.A.

Pritchard

J.K.

and

Gilad

(

2015

)

The genetic and mechanistic basis for variation in gene regulation

PLoS Genet.

, e1004857.

15.

Schmidt

Wilson

M.D.

Spyrou

et al. (

2009

)

ChIP-seq: using high-throughput sequencing to discover protein–DNA interactions

Methods

240

–

248

16.

Song

and

Crawford

G.E.

(

2010

)

DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells

Cold Spring Harb. Protoc.

2010

pdb

–

prot5384

17.

Buenrostro

J.D.

Chang

H.Y.

et al. (

2015

)

ATAC‐seq: a method for assaying chromatin accessibility genome‐wide

Curr. Protoc. Mol. Biol.

109

–

18.

Simon

J.M.

Giresi

P.G.

Davis

I.J.

et al. (

2012

)

Using formaldehyde-assisted isolation of regulatory elements (FAIRE) to isolate active regulatory DNA

Nat. Protoc.

, 256.

19.

Schones

D.E.

Cui

Cuddapah

et al. (

2008

)

Dynamic regulation of nucleosome positioning in the human genome

Cell

132

887

–

898

20.

Consortium EP

. (

2004

)

The ENCODE (ENCyclopedia Of DNA Elements) project

Science

306

636

–

640

Crossref

21.

Consortium EP

. (

2011

)

A user’s guide to the encyclopedia of DNA elements (ENCODE)

PLoS Biol.

, e1001046.

22.

Bernstein

B.E.

Stamatoyannopoulos

J.A.

Costello

J.F.

et al. (

2010

)

The NIH Roadmap Epigenomics Mapping Consortium

Nat. Biotechnol.

1045

–

1048

23.

Kundaje

Meuleman

Ernst

et al. (

2015

)

Integrative analysis of 111 reference human epigenomes

Nature

518

317

–

330

24.

Consortium

G.T.

(

2013

)

The genotype-tissue expression (GTEx) project

Nat. Genet.

580

–

585

25.

Whitington

Frith

M.C.

Johnson

et al. (

2011

)

Inferring transcription factor complexes from ChIP-seq data

Nucleic Acids Res.

, e98.

26.

Sun

Guns

Fierro

A.C.

et al. (

2012

)

Unveiling combinatorial regulation through the combination of ChIP information and in silico cis-regulatory module detection

Nucleic Acids Res.

, e90.

27.

Polychronidou

and

Lohmann

(

2012

)

COPS: detecting co-occurrence and spatial arrangement of transcription factor binding motifs in genome-wide datasets

PLoS One

, e52055.

28.

Rohr

C.O.

Parra

R.G.

Yankilevich

et al. (

2013

)

INSECT: IN-silico SEarch for Co-occurring Transcription factors

Bioinformatics

2852

–

2858

29.

Ernst

and

Kellis

(

2012

)

ChromHMM: automating chromatin-state discovery and characterization

Nat. Methods

215

–

216

30.

Ernst

Kheradpour

Mikkelsen

T.S.

et al. (

2011

)

Mapping and analysis of chromatin state dynamics in nine human cell types

Nature

473

–

31.

Hoffman

M.M.

Buske

O.J.

Wang

et al. (

2012

)

Unsupervised pattern discovery in human chromatin structure through genomic segmentation

Nat. Methods

473

–

476

32.

Andersson

Gebhard

Miguel-Escalada

et al. (

2014

)

An atlas of active enhancers across human cell types and tissues

Nature

507

455

–

461

33.

Khan

and

Zhang

(

2016

)

dbSUPER: a database of super-enhancers in mouse and human genome

Nucleic Acids Res.

D164

–

D171

34.

Jiang

Qian

Bai

et al. (

2019

)

SEdb: a comprehensive human super-enhancer database

Nucleic Acids Res.

D235

–

D243

35.

Ashoor

Kleftogiannis

Radovanovic

et al. (

2015

)

DENdb: database of integrated human enhancers

Database : j. biol. databases curation

2015

, bav085.

36.

Dreos

Ambrosini

Cavin Perier

et al. (

2013

)

EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era

Nucleic Acids Res.

D157

–

D164

37.

Dimitrieva

and

Bucher

(

2013

)

UCNEbase—a database of ultraconserved non-coding elements and genomic regulatory blocks

Nucleic Acids Res.

D101

–

D109

38.

Visel

Taher

Girgis

et al. (

2013

)

A high-resolution enhancer atlas of the developing telencephalon

Cell

152

895

–

908

39.

Fishilevich

Nudel

Rappaport

et al. (

2017

)

GeneHancer: genome-wide integration of enhancers and target genes in GeneCards

Database : j. biol. databases curation

2017

, bax028.

40.

Wang

Dai

Berry

L.D.

et al. (

2019

)

HACER: an atlas of human active enhancers to interpret regulatory variants

Nucleic Acids Res.

D106

–

D112

41.

Cai

Cui

Tan

et al. (

2019

)

RAEdb: a database of enhancers identified by high-throughput reporter assays

Database: j. biol. databases curation

2019

, bay140.

42.

Wang

Zhang

et al. (

2018

)

HEDD: human enhancer disease database

Nucleic Acids Res.

D113

–

D120

43.

Zhang

Shi

Zhu

et al. (

2018

)

DiseaseEnhancer: a resource of human disease-associated enhancer catalog

Nucleic Acids Res.

D78

–

D84

44.

Wei

Zhang

Shang

et al. (

2016

)

SEA: a super-enhancer archive

Nucleic Acids Res.

D172

–

D179

45.

Gao

and

Qian

(

2020

)

EnhancerAtlas 2.0: an updated resource with enhancer annotation in 586 tissue/cell types across nine species

Nucleic Acids Res.

D58

–

D64

46.

Moore

J.E.

Purcaro

M.J.

Pratt

H.E.

et al. (

2020

)

Expanded encyclopaedias of DNA elements in the human and mouse genomes

Nature

583

699

–

710

47.

and

(

2021

)

Accurate prediction of cis-regulatory modules reveals a prevalent regulatory genome of humans

NAR Genom. Bioinform.

, lqab052.

48.

Visel

Minovitsky

Dubchak

et al. (

2007

)

VISTA Enhancer Browser—a database of tissue-specific human enhancers

Nucleic Acids Res.

D88

–

D92

49.

Mei

Qin

et al. (

2017

)

Cistrome Data Browser: a data portal for ChIP-Seq and chromatin accessibility data in human and mouse

Nucleic Acids Res.

D658

–

D662

50.

Zhang

et al. (

2019

)

ProSampler: an ultra-fast and accurate motif finder in large ChIP-seq datasets for combinatory motif discovery

Bioinformatics

4632

–

4639

51.

Mendoza-Parra

M.A.

Van Gool

Mohamed Saleem

M.A.

et al. (

2013

)

A quality control system for profiles obtained by ChIP sequencing

Nucleic Acids Res.

, e196.