Evidence classification of high-throughput protocols and confidence integration in RegulonDB Open Access

Evidence classification of HT methods

		Evidence code in RegulonDB
1. TSSs
Strong evidence	Identification of TSSs using at least two different strategies of enrichment for primary transcripts, consistent biological replicates	RS-EPT-CBR
	Identification of TSSs of ncRNA, using at least two different strategies of enrichment for primary transcripts, consistent biological replicates, and evidence for a non-coding gene	RS-EPT-ENCG-CBR
Weak evidence	All other RNA-seq protocols	RS
2. Regulatory interactions
Strong evidence	ChIP analysis and statistical validation of TF-binding sites	CHIP-SV
Weak evidence	ChIP analysis; example: ChIP-chip and ChIP-seq	CHIP
	Gene expression analysis using RNA-seq or microarray analysis	GEA
	Genomic SELEX (systematic evolution of ligands by exponential enrichment)	GSELEX
	ROMA (run-off transcription microarray analysis)	ROMA
3. TUs
Strong evidence	Mapping of signal intensities by RNA-seq and evidence for a single gene, consistent biological replicates	MSI-ESG-CBR
	PET (paired end di-tagging)	PET
Weak evidence	Mapping of signal intensities by microarray analysis or RNA-seq	MSI

		Evidence code in RegulonDB
1. TSSs
Strong evidence	Identification of TSSs using at least two different strategies of enrichment for primary transcripts, consistent biological replicates	RS-EPT-CBR
	Identification of TSSs of ncRNA, using at least two different strategies of enrichment for primary transcripts, consistent biological replicates, and evidence for a non-coding gene	RS-EPT-ENCG-CBR
Weak evidence	All other RNA-seq protocols	RS
2. Regulatory interactions
Strong evidence	ChIP analysis and statistical validation of TF-binding sites	CHIP-SV
Weak evidence	ChIP analysis; example: ChIP-chip and ChIP-seq	CHIP
	Gene expression analysis using RNA-seq or microarray analysis	GEA
	Genomic SELEX (systematic evolution of ligands by exponential enrichment)	GSELEX
	ROMA (run-off transcription microarray analysis)	ROMA
3. TUs
Strong evidence	Mapping of signal intensities by RNA-seq and evidence for a single gene, consistent biological replicates	MSI-ESG-CBR
	PET (paired end di-tagging)	PET
Weak evidence	Mapping of signal intensities by microarray analysis or RNA-seq	MSI

Table 1

Open in new tab Download slide

Evidence classification of HT methods

		Evidence code in RegulonDB
1. TSSs
Strong evidence	Identification of TSSs using at least two different strategies of enrichment for primary transcripts, consistent biological replicates	RS-EPT-CBR
	Identification of TSSs of ncRNA, using at least two different strategies of enrichment for primary transcripts, consistent biological replicates, and evidence for a non-coding gene	RS-EPT-ENCG-CBR
Weak evidence	All other RNA-seq protocols	RS
2. Regulatory interactions
Strong evidence	ChIP analysis and statistical validation of TF-binding sites	CHIP-SV
Weak evidence	ChIP analysis; example: ChIP-chip and ChIP-seq	CHIP
	Gene expression analysis using RNA-seq or microarray analysis	GEA
	Genomic SELEX (systematic evolution of ligands by exponential enrichment)	GSELEX
	ROMA (run-off transcription microarray analysis)	ROMA
3. TUs
Strong evidence	Mapping of signal intensities by RNA-seq and evidence for a single gene, consistent biological replicates	MSI-ESG-CBR
	PET (paired end di-tagging)	PET
Weak evidence	Mapping of signal intensities by microarray analysis or RNA-seq	MSI

		Evidence code in RegulonDB
1. TSSs
Strong evidence	Identification of TSSs using at least two different strategies of enrichment for primary transcripts, consistent biological replicates	RS-EPT-CBR
	Identification of TSSs of ncRNA, using at least two different strategies of enrichment for primary transcripts, consistent biological replicates, and evidence for a non-coding gene	RS-EPT-ENCG-CBR
Weak evidence	All other RNA-seq protocols	RS
2. Regulatory interactions
Strong evidence	ChIP analysis and statistical validation of TF-binding sites	CHIP-SV
Weak evidence	ChIP analysis; example: ChIP-chip and ChIP-seq	CHIP
	Gene expression analysis using RNA-seq or microarray analysis	GEA
	Genomic SELEX (systematic evolution of ligands by exponential enrichment)	GSELEX
	ROMA (run-off transcription microarray analysis)	ROMA
3. TUs
Strong evidence	Mapping of signal intensities by RNA-seq and evidence for a single gene, consistent biological replicates	MSI-ESG-CBR
	PET (paired end di-tagging)	PET
Weak evidence	Mapping of signal intensities by microarray analysis or RNA-seq	MSI

Classification of HT-protocols—transcription start sites

RNA-seq protocols

RNA-seq is a powerful application used to quantitatively analyse transcriptomes. Examples are the comparative analyses of complete sets of RNA transcribed in different growth conditions, the identification of regulons, transcription start sites (TSSs), and TUs (8–18). The basic principle of RNA-seq is the analysis of cDNA libraries by next-generation sequencing technologies, which are obtained by reverse transcription of RNA pools (19–24). This is achieved by a series of consecutive steps: RNA extraction and depletion, reverse transcription into DNA, introduction of adaptor sequences at the 5′- and 3′-ends of the cDNA, PCR amplification of the cDNA library (optional), followed by next-generation sequencing and mapping of the sequence reads into the reference genome. For each step, different protocols have been published, which can be assembled in a modular fashion. As a consequence, RNA-seq protocols exhibit great variability. For instance, protocols differ in the enrichment of RNA, in the construction of the cDNA libraries, and also dependent on whether the analyses aim at the comparative quantification of transcripts, the identification of TSSs or at the analysis of TUs. For quantitative expression analyses, the isolated RNA is fragmented to get an even distribution of reads along the length of the transcripts. In contrast, for the identification of TSSs, that is the identification of primary 5′-ends of transcripts, this step must be omitted.

RNA degradation is a major source of false positives in RNA-seq

The purification and analysis of bacterial mRNA is more challenging than eukaryotic mRNA.

For instance, bacterial mRNA is polycistronic and frequently contains internal initiation and termination sites, resulting in a complex transcriptional profile with overlapping TUs (25). Moreover, isolation of mRNA using oligo-dT selection is not possible since the majority of bacterial RNA lacks poly(A) tails. To remove the abundant ribosomal RNA and increase the rate of mRNA reads, different rRNA depletion methods are required, such as the removal of rRNA by hybridization to rRNA-specific probes (26).

The greatest challenge, however, is the instability of prokaryotic mRNA, which exhibits an average half-life of ∼3–8 min (27,28), ranging from less than a minute to half an hour, resulting in a large fraction of processed RNA molecules. Therefore, the unambiguous identification of TSSs requires an efficient measure to distinguish the 5′-ends of such processed or degraded mRNA ends from those of genuine transcripts.

The enrichment for 5′-triphosphate ends reduces detection of RNA-degradation products

Degradation intermediates and processed RNA products can be distinguished from primary transcripts by means of the chemical nature of their 5′-ends, since the latter transcripts carry 5′-triphosphate ends (5′-PPP) (11,12), while processed and degraded RNA carries a 5′-monophosphate (5′-P). This can be exploited to specifically enrich for primary transcripts. A strategy utilizes 5′-dependent terminator exonuclease (TEX) that degrades RNA carrying a 5′-P end, while RNA carrying 5′-PPP ends are not substrates of this enzyme and therefore are not degraded. In dRNA-seq (differential RNA-seq), reads derived from a TEX-treated library are compared with an untreated library to discriminate between primary and processed 5′-ends (11,12). Comparison of TEX-treated RNA libraries with untreated libraries has demonstrated that a large proportion of RNA libraries is degraded or processed RNA (12). As a consequence, read coverages obtained by dRNA-seq are shifted towards the 5′-end, with peak profiles raising at the position of the TSSs (11). However, the presence of the pyrophosphohydrolase activity in bacterial genomes, coded by the rppH gene in E. coli, which converts 5′-PPP ends into 5′-P ends, masks genuine TSSs. Therefore, the direct subtraction of the 5′-P ends is not an option.

The usefulness of the dRNA-seq protocol has been shown in a recent analysis of the Synechocystis transcriptome. Of the 64 TSS that had previously been identified by classical transcription initiation mapping, 44 were detected in this study and confirmed by the published results (16). In addition to the use of TEX, other protocols can be used for the enrichment of 5′-PPP ends. For instance, the ligation of biotinylated adapters to processed RNAs carrying a 5′-P end allows their removal using magnetic streptavidin (1). Another method is 5′-tagRACE that involves the differential tagging of 5′-P and 5′-PPP ends (29).

Due to the inherent noisy nature of the transcriptome, the random errors of the experiments due to bias in library construction, amplification and sequencing efficiency (30–33), and the fact that it is not straightforward to discriminate between primary from processed transcripts, high reproducibility needs to be fulfilled in order to be confident of the TSSs assignment. Therefore, classification as strong evidence requires that the data are validated by multiple biological and technical replicates, which may be analysed either within the same study or even better, independent studies. In addition, data have to be supported by at least two different enrichment methodologies, for instance a combination of dRNA-seq and the differential ligation of adaptors to processed transcripts (Table 1).

An even more critical case is the identification of TSSs of non-coding RNAs (ncRNAs). Such RNAs lack an apparent open reading frame. Therefore, their corresponding TUs escape detection by conventional sequence analysis. Identification of ncRNAs by RNA-seq is particularly prone to false-positive results, that may occur due to the spurious synthesis of second strand cDNA, or residual genomic DNA contaminating the RNA pool (9,34), as well as ‘false priming’ (35,36), caused by priming of the reverse transcription reaction in hairpin structures in the RNA or other, partially complementary RNA molecules. In addition, it has been reported that a substantial fraction of the detected transcripts could be the result of spurious transcription initiation events at promoter-like sequences (37,38). Therefore, the identification of TSSs of ncRNA by the above combination of different enrichment strategies requires verification and is only classified as strong evidence, if the ncRNA is validated by additional experimental evidence, such as northern blots or quantitative PCR (39,40) (Table 1).

RNA-seq protocols without enrichment for 5′-PPP ends are classified as weak evidence

In addition to the enrichment for primary transcripts, other measures to minimize false TSSs have been employed. These include the use of cutoff values for sequence counts (41,42) or restricting the location of potential TSSs to certain windows within 5′-untranslated regions. Cutoff values are claimed to be efficient measures to reduce the background noise of read starts.

However, these are not suited to reduce the number of false positives derived from non-random RNA degradation (43), stochastic transcriptional events (10) and PCR biases that arise during library construction. Non-random RNA degradation is in part due to sequence preferences for AU-rich regions, as shown for RNAse E, as well as hotspots for RNAses due to secondary structure elements of the RNA (43–45). Similarly, restricting the location of TSSs to certain windows within the 5′-untranslated region of a gene (41) does enrich for bona fide TSSs, but does not efficiently exclude RNA degradation products. In addition, this strategy overlooks genuine TSSs located within genes and in antisense orientation. A recently described transcriptome sequencing approach is flow cell reverse transcription sequencing (FRT-seq) (46), in which RNA is reverse transcribed on the flow cell without further amplification of the cDNA. FRT-seq avoids biases that are introduced at the amplification step, but like RNA-seq, it does not discriminate sufficiently between primary and processed or degraded transcripts. Accordingly, we rate these protocols as weak evidence (Table 1).

Classification of HT protocols—TUs

Identification of TUs by RNA-seq and microarrays

HT technologies assign TUs if the expression levels of neighboring genes correlate. Using microarrays (47–50) or RNA-seq analyses (11,41,42), TUs can be inferred by mapping the hybridization intensities or peak values onto the bacterial genome. Operons are assigned if the continuous coverage extends into one or more co-directional neighbouring genes, including the intergenic regions. Evaluation of expression levels is frequently combined with computational approaches for the prediction of operons, which integrate, for instance, intergenic distances or the location of promoter and TF-binding sites (TFBSs) (51). However, the assignment of TUs on the basis of expression correlation has several limitations. For instance, signal intensities might not correlate with a particular TU if additional transcripts, driven by internal promoters, overlap the TU. Furthermore, differentiation between co-transcription and co-regulation of neighbouring genes that are expressed under similar growth conditions is ambiguous.

Another limitation is that the sequence coverage frequently varies considerably over the length of a transcript. Such non-uniform read distributions occur during the random hexamer priming and PCR amplification step, due to positional nucleotide biases, GC content (31,52), and transcript length biases (53,54). Depending on the fragmentation method employed, read coverages are differently biased towards the transcript ends (23,55). Coverage is more uniform within the transcript if the RNA is fragmented prior to reverse transcription, but relatively depleted for both 5′- and 3′-ends, while fragmentation of the cDNA creates biases towards the 3′-end (23).

Like RNA-seq, microarray analyses suffer from limitations, such as measurement noise, biases due to systematic variations between experimental conditions or sample handling, labelling biases and preferential amplification due to the variable hybridization strength of the probe–target pairs (56–58). Microarray analyses also suffer from signal saturation errors and exhibit a much more narrow dynamic range when compared with RNA-seq (59).

Therefore, the identification of TUs on the basis of uniform levels of signal intensities, using either RNA-seq or microarray analysis, is ambiguous and classified as weak evidence with two exceptions. One exception is the identification of a monocistronic TU that is flanked by neighbouring genes transcribed in the opposite direction, which is classified as strong evidence (Table 1). The other exception is the detection of cotranscribed genes in the same mRNA molecule using paired-end RNA-seq with different insert sizes (1,60). This method provides strong evidence that both RNA ends are derived from the same transcript. As is the case for other methods that are classified as strong evidence, this requires in addition validation by consistent biological replicates (Table 1).

Classification of HT protocols—regulatory interactions

Evidence for regulatory interactions derived from gene expression analysis

Transcriptome analysis by RNA-seq or microarrays may also provide evidence for regulatory binding sites (61–64), based on a comparative analysis of the expression of potential target genes, and dependent on changes in the activity of the TF. For instance, in classical experimentation, a commonly used technique is the analysis of a promoter-lacZ fusion in response to the deletion, over-expression or mutation of the TF. HT transcriptional profiling monitors the entire cascade of changes in gene expression, as a response to the deletion or overexpression of a regulatory protein. However, these responses include indirect effects, such as the regulation by additional TFs, sRNAs, as well as effects due to metabolic changes induced by the altered gene expression. Therefore, as is the case for classical gene expression analyses, the identification of regulatory binding sites by global transcriptome analyses is classified as weak evidence (Table 1).

An alternative method used for the characterization of regulatory networks of TFs and sigma factors is run-off transcription-microarray analysis (ROMA) (65–67). ROMA resembles a HT in vitro transcription assay, using purified RNAP, regulatory proteins and a genomic DNA pool as the template. The resulting mRNA pool is subsequently reverse transcribed into cDNA and analysed on microarrays, relative to the transcripts generated in the absence of the regulatory protein. In contrast to in vivo transcriptional profiling, ROMA avoids false positives stemming from indirect regulation and offers an advantage in the detection of short-lived mRNA transcripts. However, ROMA includes other sources of false positives, most importantly read-through transcripts into adjacent genes due to inefficient transcription termination in vitro, as well as ambiguities derived from impure protein preparation or the microarray analysis as such (65). Therefore, ROMA is classified as weak evidence (Table 1).

Use of chromatin immunoprecipitation technology for the identification of TFBSs

The chromatin immunoprecipitation (ChIP) technology allows probing protein–DNA interactions inside living cells and has been widely used to characterize regulatory transcriptional networks under various physiological conditions (68–71). Briefly, proteins that interact with DNA are covalently crosslinked in vivo to their target sites with formaldehyde. Cells are subsequently lysed and the chromatin is fragmented by sonication or enzymatic treatment. Next, DNA fragments carrying crosslinked protein are co-immunoprecipitated using a highly specific antibody directed against the protein of interest. After reversal of the crosslinking, the enriched DNA fragments are analysed either by hybridization to microarrays, designed as low- or high-density tiling arrays (ChIP-on-chip or ChIP-chip), or by HT sequencing (ChIP-seq), followed by a computational analysis of the sequence data, which involves a statistical analysis for quality control and normalization of the data, the identification of significantly enriched regions and the identification of binding motifs.

Resolution in the initial mapping of the binding regions is much higher for ChIP-seq when compared with ChIP-chip. In ChIP-chip, resolution depends on several factors, such as the size of the fragments generated by shearing, or the density of the tiling arrays, and usually is within a range of 300–500 bp (72), while resolution in ChIP-seq is up to a single base pair with reduced noise and a broader dynamic range (73). For these reasons and due to the rapid development of next-generation sequencing techniques, ChIP-seq is rapidly replacing the analysis by microarrays.

The DNA library obtained by co-immunoprecipitation is enriched in DNA fragments carrying the desired binding regions, but it is not pure. The challenge in ChIP technology is to identify the DNA fragments carrying the bona fide binding sites in a large background, a source of systematic and stochastic noise. False positives can occur at all three basic steps in ChIP technology: (i) the preparation of the DNA pool carrying the potential binding sites, (ii) the characterization of the DNA fragments by hybridization to the microarrays or HT sequencing and (iii) the computational analysis including mapping of the potential binding regions to the genome, peak detection and sequence motif analysis. For instance, false positives derived from the preparation of the DNA pool can be due to non-specific interactions of the protein of interest with DNA or other DNA-binding proteins, or due to cross-reactivity of the antibody. In addition, systematic variations between experimental conditions, such as sample handling, or biases introduced during labelling or amplification steps, such as a GC bias, give rise to false positives at the peak-calling step (31,68,73,74). High background noise has been reported to result from complementary sequences or non-unique gene loci on the chromosome as well as insufficient RNase treatment (75). In addition, it has been reported that false positives can be caused by large protein–DNA complexes, which preferentially form at highly transcribed regions. Such complexes can survive washing and elution steps due to the incomplete reversion of crosslinking and retention of the complexes in spin columns. These complexes are eluted at a later step under denaturing conditions, resulting in a contamination of the DNA pool (75).

Since as mentioned before the lengths of the enriched sequences vary between 300 to 500 bp, this partial result still requires the computational precise identification of the binding sites. Some of the sequences might be false positives with no TFBSs, whereas other sequences may have binding sites for other cofactors. In order to control these issues, and for homogeneity in the evaluation of experiments performed in different laboratories, ideally, the best alternative would be the use of a common computational strategy with well-established programs universally available to the community.

In conclusion, even though ChIP technology is a powerful method, it carries several potential pitfalls and is classified as weak evidence (Table 1). However, confidence scores for individual binding sites can be assessed by a standardized statistical analysis to allow a higher classification of strength of evidence for a subset of the data. This is discussed in more detail in sections below.

Use of genomic systematic evolution of ligands by exponential enrichment for the identification of TFBSs

Genomic systematic evolution of ligands by exponential enrichment (SELEX) is a variant of the classic SELEX protocol. Like ChIP technology, it is a powerful technique to identify DNA-binding sites for a TF. Its basic principle is to enrich fragmented genomic DNA (whereas classic SELEX starts with random DNA) in several iterative cycles consisting of the binding reaction, affinity purification of the complexes formed between DNA and the protein of interest, and amplification of the potential target regions (76–79). One major difference between the ChIP and the SELEX technology is that ChIP is directed towards the identification of sites that are bound in vivo under specific growth conditions, while SELEX identifies binding sites which are bound in an in vitro reaction. In SELEX, false positives can originate from aggregates or unspecific interactions with the affinity matrix. The selection for such nonspecific-bound DNA fragments depends strongly on the number of the iterative cycles (78). In addition, the binding conditions, for instance ionic strength or pH, as well as the high local concentration of protein–DNA complexes upon enrichment on the affinity matrix, might not reflect physiological conditions. Therefore, genomic SELEX as such is classified as weak evidence (Table 1). Classification as strong evidence requires additional, independent evidence, that the identified sites function in vivo (see section for cross-validation).

Statistical validation of ChIP data and consistency with position weight matrices generated from classic experimental evidence

Regulatory binding sites exhibit characteristic sequence patterns, which are commonly represented as sequence logos or position weight matrices (PWMs) and describe the specificity of a DNA-binding protein (80,81). Such PWMs represent a weighted average of aligned sequences and provide the basis for the genome-wide computational predictions of TFBSs (82,83). The sequence motif analysis serves to pinpoint the exact location of binding sites in potential target regions obtained by ChIP. This can be achieved either by scanning for a known sequence motif or by performing a de novo motif analysis (84,85). Moreover, binding sites identified by such a sequence motif analysis come with a statistical confidence score and/or P-value. This offers the possibility to rate the confidence levels of the identified objects according to these values and, using a stringent threshold value, validates subsets of the identified binding sites as strong evidence.

For consistency, such an approach requires the use of defined algorithms and criteria. Here, we present an approach to evaluate the confidence levels of TFBSs using the tools, ‘matrix-quality’ (86,87), ‘peak-motifs’ (88), ‘footprint-discovery’ (89) and ‘matrix-scan’ (90), that belong to the software suite regulatory sequence analysis tools (90). These tools are publicly available at http://rsat.ulb.ac.be/, with the adequate documentation for their utilization.

To identify sites with high confidence, we first obtain a PWM using peak-motifs or footprint discovery. Peak-motifs facilitates the discovery of binding motifs using a combination of several algorithms at a time, and it detects not only the strongest motif but also secondary ones, providing valuable information concerning cofactors, and mechanism of function for TFs (88). The major difference between using this or other previously proposed algorithms lays in its efficiency. The program is significantly faster than other comparable algorithms and allows motif discovery in full-size ChIP datasets (88). Thus, peak-motifs allow to build PWMs from a set of known binding sites, or to perform a de novo motif analysis using the raw ChIP data as an input. The discovered motif is compared with the annotated matrices in RegulonDB, to detect whether they correspond to the annotated one for the TF of the ChIP experiment. Alternatively, a multi-genome approach is useful in cases where only a few binding sites are known for a given TF and there is none annotated matrix. Using the program ‘footprint-discovery’, conserved motifs in promoter regions of orthologous target genes (phylogenetic footprints) can be detected at different taxonomical levels of E. coli (86,89).

Next, the quality of the discovered PWMs, that is the discriminative power of the matrices, can be evaluated by using the program matrix-quality. This program analyses matrices by comparing the theoretical and empirical weight score distributions for each PWM in a group of sequences (86). It can also be used to evaluate the quality of raw datasets derived from ChIP experiments, that is, to evaluate the level of enrichment for putative TFBSs in different collections of sequences, for a given PWM (86). The program uses one matrix representing the TF-binding motif and the peak sequences as input. The output will show a graph displaying one curve for the expected enrichment by chance and the observed enrichment in the peaks. These two curves should show a clear difference of enrichment of binding sites with high scores (86). If there is no enrichment, it can be due to two possibilities: several false positives dilute the collected regions, or the TFBS in that collection is considerably different than the previously reported ones used to build the matrix.

Using the PWM with the best enrichment TFBSs that score above a threshold P-value are identified and localized using matrix-scan. In contrast to aiming at the genome-wide computational prediction of binding sites, our approach for statistical validation requires that the positive predictive value is strongly favoured at the expense of sensitivity. This is important to prevent spurious sites accepted with strong evidence or confidence. We use a P-value of 1e−5 or lower as a stringent cutoff. Binding sites that score above this threshold will be classified as strong evidence, and binding sites, which score below, as weak evidence (Table 1). It is important to note that this approach for evaluating sites produced by ChIP-seq is consistent with the evaluation of the quality of PWMs coming from manual curation (86). That is to say, we are being congruent in assessing evidence for knowledge, irrespective of the methods used to generate it. For a full pipeline application for an experiment of ChIP-chip of PurR sites, see the new RegulonDB paper and Supplementary Material (87).

Classification of multiple evidence and introduction of the new confidence score ‘confirmed’

In the past, we have judged and classified the strength of evidence for single types of evidence. As a consequence, the strength of evidence for a given object or assertion was derived from one experiment, which is the experiment with the highest score. However, in scientific experimental research, an assertion and its degree of confidence are usually derived from a combination of different approaches. Such additional experiments are conducted with two intentions, to confirm or reproduce the assertion on the one hand, and to exclude alternative explanations on the other hand. Reproducibility is a prerequisite, to account for it in HT experiments, we demand the use of biological replicates as well as the use of at least two independent enrichment strategies for the assignment of strong evidence to RNA-seq methods. We now present a strategy to account for the second intention, the exclusion of alternative explanations or false positives, termed ‘independent cross-validation’.

A decrease in the number of false positives is achieved, if false positives can be mutually excluded by evaluating the results of two methods or strategies together, compared with each experiment alone (Figure 1). This requires that the following conditions are met. (i) The two methodologies have to be independent, that is, they should not use common raw materials or common experimental steps. (ii) Both methods have to point to the same object or assertion. Both approaches might, however, analyse different aspects or properties of the assertion. For instance, a promoter can be located by the identification of a TSS or an RNAP-binding site. Cross-validation of TFBSs and promoters requires that the exact location of the object is specified for each individual evidence. For instance, gel mobility shift assays provide evidence for the interaction with a binding region, but the exact location is not determined, and therefore cannot be combined with other evidence for cross-validation of TFBSs. (iii) There must be little overlap in potential false positives or alternative explanations for both independent methodologies. For instance, genuine TSSs mapped by transcription initiation mapping are diluted by false positives derived from RNA processing or degradation. However, these TSSs can be validated by RNAP FP since false positives derived from RNA degradation or processing are excluded by the second experiment. Therefore, if combined, the intersection of both methods should contain TSSs with a higher confidence level than the individual experiments alone. In contrast, the combined evaluation of the following two methodologies does not result in a higher confidence level: To confirm that an activator binds to the 5′-upstream region of a target gene and regulates its expression, it is either possible to analyse in vivo expression of a promoter–reporter gene fusion in a wild-type and mutant background, or to perform gel-mobility-shift assays using cell extracts of wild-type and mutant strains. Here, the alternative explanation for a positive result, which is the indirect regulation of the target gene, is not excluded when evaluating the results of both methods together since it is common to both. (iv) Finally, as a fourth requirement, the sample population needs to be large enough to ensure a low probability for the coincidental identification of a false positive by the two independent methodologies. (v) Cross-validation of HT experiments requires consistent biological replicates.

Figure 1

Schematic overview of evaluation of confidence in RegulonDB. Confidence is evaluated in two stages. In the first stage, individual methods are classified into weak or strong strength of evidence. In the second stage, subsets of data are validated by integrating multiple evidence using two strategies, statistical validation and independent cross-validation. Statistical validation is applied for ChIP datasets. It involves the evaluation of both the quality of the dataset and the quality of the discovered PWMs. The analysis validates binding sites, which score above a stringent threshold value. Cross-validation integrates multiple evidence and requires that the types of evidence, that are combined with each other, are independent and mutually exclude false positives. Weak evidence is cross-validated to strong evidence, whereas strong evidence is validated to confirmed evidence.

Using these criteria, we can now define combinations of HT experiments or classical evidence, to allow an upgrade from weak to strong evidence (Table 2). Moreover, it is also possible to cross-validate data, which have been classified as strong evidence. To this end, a third confidence score, designated ‘confirmed’, is introduced. The possible combinations of experiments that allow an upgrade to confirmed confidence are shown in Table 2. By using this approach, we are now able to create a new class of objects or assertions that are annotated with a very high reliability to RegulonDB in a step towards building gold standard sets.

Table 2

Independent cross-validation of weak and strong evidence

Cross-validation of weak evidence
Regulatory interactions
Genomic SELEX, ROMA (run-off transcription-microarray analysis)
In vivo gene expression analysis
Cross-validation of strong evidence
Promoter
FP with purified RNA-polymerase
In vitro transcription assay using purified proteins
Transcription initiation mapping; Examples: 5′-RACE; primer extension; nuclease S1 mapping; RNA-seq data, classified as strong evidence
Evidence inferred from SM; Example: Expression analysis when putative promoter element is mutated
TFBSs
FP using purified protein
Evidence inferred from SM; Example: Expression analysis when putative TFBSs are mutated
ChIP data, classified as strong evidence; Example: ChIP data, statistical validated
Genomic SELEX data, classified as strong evidence; Example: Genomic SELEX, cross-validated by in vivo gene expression analysis
TUs
Polar mutations which affect transcription of a downstream gene
Northern blotting; RNA-seq data classified as strong evidence

For each object, the types of evidence are given, which can be combined with each other to allow an upgrade to confirmed confidence. Any two methods from different rows can be combined. Types of evidence in the same row cannot be combined with each other. For instance, different protocols for transcription initiation mapping cannot be combined for cross-validation, since these methods use mRNA as the starting material and therefore share a common source of false positives, which is RNA processing or degradation. Similarly, TUs identified by northern blotting cannot be cross-validated by RNA-seq. Cross-validation of TFBSs and promoters requires that the exact location of the object is specified for each individual evidence.

Table 2

Independent cross-validation of weak and strong evidence

Cross-validation of weak evidence
Regulatory interactions
Genomic SELEX, ROMA (run-off transcription-microarray analysis)
In vivo gene expression analysis
Cross-validation of strong evidence
Promoter
FP with purified RNA-polymerase
In vitro transcription assay using purified proteins
Transcription initiation mapping; Examples: 5′-RACE; primer extension; nuclease S1 mapping; RNA-seq data, classified as strong evidence
Evidence inferred from SM; Example: Expression analysis when putative promoter element is mutated
TFBSs
FP using purified protein
Evidence inferred from SM; Example: Expression analysis when putative TFBSs are mutated
ChIP data, classified as strong evidence; Example: ChIP data, statistical validated
Genomic SELEX data, classified as strong evidence; Example: Genomic SELEX, cross-validated by in vivo gene expression analysis
TUs
Polar mutations which affect transcription of a downstream gene
Northern blotting; RNA-seq data classified as strong evidence

To exemplify this approach, we have cross-validated the evidence for TFBSs of PurR. Shown in Table 3 are the strong types of single evidence from classical experiments, that are supporting the individual binding sites for PurR, FP and evidence derived from a mutational analysis of the TFBSs (SM). In addition, most of these sites are supported by strong evidence derived from the statistical validation of an HT ChIP-chip analysis (87). All three types of evidence, FP, site mutation (SM) analysis and statistical validated ChIP-chip data (CHIP-SV), can be combined for independent cross-validation (Table 2). As a result, 14 out of 23 TFBSs are cross-validated to confirmed evidence, while 9 TFBSs are supported by a single strong evidence and not cross-validated (Table 3).

Table 3

Open in new tab Download slide

Independent cross-validation of single types of evidence for PurR-binding sites

^aFor each gene or operon, the evidence types that are annotated as strong evidence in RegulonDB are given, as well as the strong evidence derived from the statistical validation of an ChIP-chip analysis of PurR-binding sites (61, 87). ^bFor independent cross-validation, the three evidence types FP, SM analysis and ChIP-chip data that have been rated as strong evidence by statistical validation (CHIP-SV) (87) are combined pairwise to confirmed evidence.

With the exception of glyA, all of the confirmed binding sites belong to genes involved in the central pathways for the de novo synthesis of purines and pyrimidines (Figure 2), which is in agreement with the role of PurR as the master regulator of these pathways. TFBSs that are supported by strong evidence and not upgraded to confirmed evidence either belong to these pathways, to genes involved in nucleoside or nucleobase uptake (codBA, tsx, and xanP), or nitrogen metabolism (glnB and speA). This demonstrates that independent cross-validation is well suited to identify data that resemble the well-established knowledge of the scientific literature, representing the ‘textbook knowledge’ in RegulonDB.

Figure 2

De novo pathways of purine and pyrimidine synthesis in E. coli. PurR is the master regulator for purine (left) and pyrimidine (right) de novo biosynthesis. Genes that carry binding sites that have been cross-validated to confirmed evidence are shown in bold. With the exception of glyA (not shown), all genes that carry binding sites supported by confirmed evidence belong to these two central pathways of nucleotide biosynthesis. Abbreviations: PRPP, 5-phosphoribosyl-1-diphosphate; PRA, 5-phosphoribosylamine; GAR, 5′-phosphoribosyl-1-glycinamide; FGAR, 5′-phosphoribosyl-N-formylglycinamide; FGAM, 5′-phosphoribosyl-N-formylglycinamidine; AIR, 5′-phosphoribosyl-5-aminoimidazole; N5-CAIR, 5′-phosphoribosyl-5-aminoimidazole-N-5-carboxylate; CAIR, 5′-phosphoribosyl-5-aminoimidazole-4-carboxylate; SAICAR, 5′-phosphoribosyl-4-(N-succinocarboxamide)-5-aminoimidazole; AICAR, 5′-phosphoribosyl-4-carboxamide-5-aminoimidazole; FAICAR, 5′-phosphoribosyl-4-carboxamide-5-formamidoimidazole; IMP, inosine 5′-monophosphate; AMP, adenosine 5′-monophosphate; GMP, guanosine 5′-monophosphate; Gln, glutamine; CP, carbamoyl phosphate; CA, carbamoyl aspartate; DHO, dihydroorotate; OA, orotate; OMP, orotidine 5′-monophosphate; UMP, uridine 5(-monophosphate; CTP, cytidine 5(-triphosphate.

Discussion

The data collected in RegulonDB are diverse in two respects. On the one hand, the different types of evidence exhibit a very broad variability in confidence and on the other hand, the objects itself, e.g. TUs, TFBSs or promoters, have different characteristics and are supported by different types of evidence. As a consequence, we need a strategy for confidence assessment that is generally applicable for all kinds of different objects, and such that the strengths of confidence are comparable between the different types of objects.

The criteria presented here follow the same principles of science as applied by wet-laboratory scientists, where data are confirmed by repetitions on the one hand, and by additional experimental strategies to exclude alternative explanations on the other.

The rating of the single evidence is the primary criterion for reliability and provides the foundation of our classification scheme. Validation of the data to upgrade from weak to strong or strong to confirmed evidence requires in addition high congruence, that is confirmation of the data by truly independent methods that reduce alternative explanations for the findings. This approach is superior to a strategy, in which confidence is solely rated according to the number of experiments supporting the assertion, irrespective of the type of evidence. Such a rating system could introduce a bias, due to the weighting of spurious alternative explanations.

It should be pointed out that evidence or confidence scores are always an estimate, not a precise rating. When rating an evidence, we rate the protocol as such, but it is difficult to judge whether for a given experiment the protocol has been properly implemented. This ambiguity pertains also to classical wet-laboratory experiments. For instance, in RegulonDB, a gel mobility shift assay using purified proteins is rated as strong evidence for TFBSs. However, the reliability for such an experiment strongly depends on the conditions, such as salt concentration, pH or protein concentration. Using too high a protein concentration increases the risk for nonspecific interactions or even binding of a different contaminating protein present in the preparation. The judgement, whether such an experiment has been conducted properly or not, is at least in part also the task of the peer-reviewing process for the publication of results.

To judge the confidence level of single types of evidence, the ideal solution would be to precisely assess the success rate of each evidence type, that is, to determine how often an assertion that is derived from a certain evidence is confirmed or disproved by subsequent experiments. However, in scientific publications, an assertion is usually supported by several different experiments which are conducted in parallel to confirm the statement or disprove alternative models. Therefore, each published single evidence is validated to varying extents by the accompanying pieces of evidence and an assessment of the success rate of an individual evidence would actually measure the averaged overall confidence of the published datasets, as well as the additional cited evidence used to support the assertion. For instance, a common method to study the regulation of a target gene by a TF is gene expression analysis, by measuring expression of a fusion between the target promoter and a reporter gene. In RegulonDB, this is classified as weak evidence due to the potential of indirect regulatory mechanisms. In classical experimentation, gene expression analysis is frequently validated by in vitro DNA-binding experiments, which are classified as strong evidence. In fact, all 17 PurR-binding sites that are supported by FP (Table 3) are in addition supported by gene expression analysis, in most cases within the same study. Thus, in an evaluation of the success rate of classical gene expression analysis, this evidence would inherit an apparently strong evidence from the FP experiments. In contrast to these classical experiments, the HT gene expression analysis by Cho et al. (61) finds that the expression of 56 genes or operons is directly or indirectly affected in response to PurR and adenine. This difference in the number of targets detected by classical and HT gene expression analysis demonstrates the potential of detecting indirect regulation, as well as the extent to which classical experiments are verified by additional experiments within each individual study. Therefore, to achieve an adequate rating of single types of evidence, we have to build on our knowledge and expert judgement of direct versus indirect effects and alternative regulatory mechanisms. This will provide the foundation for the overall classification of strength of confidence in RegulonDB.

Our three-tier rating system allows the user to recognize the confidence level of individual data at a glance. To this end, the display of the different types of degrees of confidence has to be clearly visualized. Currently, weak versus strong evidence is visually distinguishable both in RegulonDB and in EcoCyc. For instance, promoters with strong evidence are displayed with a solid line arrow, whereas those with weak evidence are displayed with a dashed-line arrow. This system can be easily extended, by using thick solid lines for confirmed objects.

Another closely related question is, how the different data types, the computational predictions, HT data and classical wet-laboratory experiments, are going to be displayed and made available for users. At present, we filter HT-generated data and only add, for instance ChIP sites that have an identified binding site which occurs within the expected upstream regions close to promoters. In addition, computationally predicted promoters are included within upstream regions only if there is no experimentally determined promoter within the region. These two cases illustrate our role that we can describe as ‘strict guardians’ of the classic paradigm of transcriptional regulation. The advantage of this policy is that the number of less reliable data is kept at a minimum. However, the drawback is that we might be losing valuable information. In fact, we have had situations, where a predicted promoter has been withdrawn due to the experimental identification of a second promoter in the same region, but had to be annotated again later due to the confirmation by additional experiments. Since computational predictions as well as HT data are very valuable data for the scientific community, we definitely need an annotation policy for the display of data of diverse origins (classical experiments, computational and HT data) in an integrated fashion.

Given the criteria here proposed, we consider a better and more useful strategy for the community to expand our ‘downloadable datasets’ that have for years been available in RegulonDB and to offer now a variety of complete datasets including HT-generated datasets in a separate genome browser, with a menu for the user to select which ones to display, such that the data can be toggled in and out on demand, using either the data type or the confidence score as a filter. The HT-generated datasets will previously be marked with our confidence score following the criteria here discussed. The information for any laboratory to submit a dataset is available in RegulonDB.

We are aware that the proposed three-tier system is a logical and consistent expansion of the previous strong and weak assignments we have had for years. This confidence assignment will facilitate the comparison and best integration of the different sources of knowledge of the regulatory network of E. coli. It also facilitates future benchmarking studies for predictive methods as well as for HT studies. These criteria are not unique to a single bacterium, given the common genome organization of regulatory elements and the common experimental challenges, these should be equally applicable to the biocuration and organization of any bacterial regulatory network.

Acknowledgements

We are grateful to Yalbi Balderas for the TF conformation cross-validation discussion, to Stephen Busby for suggesting the introduction of the third confidence level confirmed, and we would also like to thank Alfredo Mendoza for fruitful discussions. We acknowledge César Bonavides-Martínez for his excellent technical support.

Funding

This work was supported by the National Institute of General Medical Sciences of the National Institutes of Health GM071962, by the Consejo Nacional de Ciencia y Tecnología (CONACyT) [CB2008-103686-Q and 179997] and by the Programa de Apoyo a Proyectos de Investigación e Innovación Tecnológica (PAPIIT-UNAM) [IN210810 and IN209312]. Funding for open access charge: National Institutes of Health [GM071962].

Conflict of interest. None declared.

References

Gama-Castro

Salgado

Peralta-Gil

, et al. ,

RegulonDB Version 7.0: transcriptional regulation of Escherichia Coli K-12 integrated within genetic sensory response units (gensor units)

Nucleic Acids Res.

2011

, vol.

Database issue

(pg.

D98

D105

)

Keseler

Collado-Vides

Santos-Zavaleta

, et al. ,

EcoCyc: a comprehensive database of Escherichia coli biology

Nucleic Acids Res.

2011

, vol.

Database issue

(pg.

D583

D590

)

Lane

Argoud-Puy

Britan

, et al. ,

NeXtProt: a knowledge platform for human proteins

Nucleic Acids Res.

2012

, vol.

Database issue

(pg.

D76

D83

)

de Boer

Hughes

. ,

YetTFaSCo: a database of evaluated yeast transcription factor sequence specificities

Nucleic Acids Res.

2012

, vol.

Database issue

(pg.

D169

D179

)

Kerrien

Aranda

Breuza

, et al. ,

The IntAct molecular interaction database in 2012

Nucleic Acids Res.

2012

, vol.

Database issue

(pg.

D841

D846

)

Licata

Briganti

Peluso

, et al. ,

Mint, the molecular interaction database: 2012 update

Nucleic Acids Res.

2012

, vol.

Database issue

(pg.

D857

D861

)

Gama-Castro

Jimenez-Jacinto

Peralta-Gil

, et al. ,

RegulonDB (Version 6.0): gene regulation model of Escherichia Coli K-12 beyond transcription, active (experimental) annotated promoters and textpresso navigation

Nucleic Acids Res.

2008

, vol.

Database issue

(pg.

D120

D124

)

Passalacqua

Varadarajan

Ondov

, et al. ,

Structure and complexity of a bacterial transcriptome

J. Bacteriol.

2009

, vol.

191

(pg.

3203

3211

)

Perkins

Kingsley

Fookes

, et al. ,

A strand-specific RNA-Seq analysis of the transcriptome of the typhoid bacillus Salmonella Typhi

PLoS Genet.

2009

, vol.

pg.

e1000569

Yoder-Himes

Chain

Zhu

, et al. ,

Mapping the Burkholderia cenocepacia niche response via high-throughput sequencing

Proc. Natl Acad. Sci. USA

2009

, vol.

106

(pg.

3976

3981

)

Sharma

Hoffmann

Darfeuille

, et al. ,

The primary transcriptome of the major human pathogen Helicobacter Pylori

Nature

2010

, vol.

464

(pg.

250

255

)

Albrecht

Sharma

Reinhardt

, et al. ,

Deep sequencing-based discovery of the Chlamydia trachomatis transcriptome

Nucleic Acids Res.

2010

, vol.

(pg.

868

877

)

Filiatrault

Stodghill

Bronstein

, et al. ,

Transcriptome analysis of Pseudomonas syringae identifies new genes, noncoding rnas, and antisense activity

J. Bacteriol.

2010

, vol.

192

(pg.

2359

2372

)

Wang

Mao

, et al. ,

Single-nucleotide resolution analysis of the transcriptome structure of Clostridium beijerinckii NCIMB 8052 using RNA-Seq

BMC Genomics

2011

, vol.

pg.

479

Chaudhuri

Kanji

, et al. ,

Quantitative RNA-seq analysis of the Campylobacter jejuni transcriptome

Microbiology

2011

, vol.

157

Pt 10

(pg.

2922

2932

)

Mitschke

Georg

Scholz

, et al. ,

An experimentally anchored map of transcriptional start sites in the model cyanobacterium Synechocystis sp

PCC6803. Proc. Natl Acad. Sci. USA

2011

, vol.

108

(pg.

2124

2129

)

Kroger

Dillon

Cameron

, et al. ,

The transcriptional landscape and Small RNAs of Salmonella enterica serovar typhimurium

Proc. Natl Acad. Sci. USA

2012

, vol.

109

(pg.

E1277

E1286

)

Raghavan

Sage

Ochman

. ,

Genome-wide identification of transcription start sites yields a novel thermosensing RNA and new cyclic AMP receptor protein-regulated genes in Escherichia coli

J. Bacteriol.

2011

, vol.

193

(pg.

2871

2874

)

Costa

Angelini

De Feis

, et al. ,

Uncovering the complexity of transcriptomes with RNA-seq

J. Biomed. Biotechnol.

2010

, vol.

2010

pg.

853916

Croucher

Thomson

. ,

Studying bacterial transcriptomes using RNA-Seq

Curr. Opin. Microbiol.

2010

, vol.

(pg.

619

624

)

Levin

Yassour

Adiconis

, et al. ,

Comprehensive comparative analysis of strand-specific rna sequencing methods

Nat. Methods

2010

, vol.

(pg.

709

715

)

van Vliet

. ,

Next generation sequencing of microbial transcriptomes: challenges and opportunities

FEMS Microbiol. Lett.

2010

, vol.

302

(pg.

)

Wang

Gerstein

Snyder

. ,

RNA-seq: a revolutionary tool for transcriptomics

Nat. Rev. Genet.

2009

, vol.

(pg.

)

Mader

Nicolas

Richard

, et al. ,

Comprehensive identification and quantification of microbial transcriptomes by genome-wide unbiased methods

Curr. Opin. Biotechnol.

2011

, vol.

(pg.

)

Salgado

Gama-Castro

Peralta-Gil

, et al. ,

RegulonDB (Version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions

Nucleic Acids Res.

2006

, vol.

Database issue

(pg.

D394

D397

)

Wurtzel

Singh

, et al. ,

Validation of two ribosomal RNA removal methods for microbial metatranscriptomics

Nat. Methods

2010

, vol.

(pg.

807

812

)

Selinger

Saxena

Cheung

, et al. ,

Global RNA half-life analysis in Escherichia coli reveals positional patterns of transcript degradation

Genome Res.

2003

, vol.

(pg.

216

223

)

Bernstein

Khodursky

Lin

, et al. ,

Global analysis of mRNA decay and abundance in Escherichia coli at single-gene resolution using two-color fluorescent DNA microarrays

Proc. Natl Acad. Sci. USA

2002

, vol.

(pg.

9697

9702

)

Fouquier d'Herouel

Wessner

Halpern

, et al. ,

A simple and efficient method to search for selected primary transcripts: non-coding and antisense RNAs in the human pathogen Enterococcus faecalis

Nucleic Acids Res.

2011

, vol.

pg.

e46

Minoche

Dohm

Himmelbauer

. ,

Evaluation of genomic high-throughput sequencing data generated on illumina HiSeq and genome analyzer systems

Genome Biol.

2011

, vol.

pg.

R112

Dohm

Lottaz

Borodina

, et al. ,

Substantial biases in ultra-short read data sets from high-throughput DNA sequencing

Nucleic Acids Res.

2008

, vol.

pg.

e105

Sendler

Johnson

Krawetz

. ,

Local and global factors affecting RNA sequencing analysis

Anal. Biochem.

2011

, vol.

419

(pg.

317

322

)

Leek

Scharpf

Bravo

, et al. ,

Tackling the widespread and critical impact of batch effects in high-throughput data

Nat. Rev. Genet.

2010

, vol.

(pg.

733

739

)

Perocchi

Clauder-Munster

, et al. ,

Antisense artifacts in transcriptome microarray experiments are resolved by actinomycin D

Nucleic Acids Res.

2007

, vol.

pg.

e128

Beiter

Reich

Weigert

, et al. ,

Sense or antisense? False priming reverse transcription controls are required for determining sequence orientation by reverse transcription-PCR

Anal. Biochem.

2007

, vol.

369

(pg.

258

261

)

Timofeeva

Skrypina

. ,

Background activity of reverse transcriptases

Biotechniques

2001

, vol.

(pg.

24, 26, 28

)

Nicolas

Mader

Dervyn

, et al. ,

Condition-dependent transcriptome reveals high-level regulatory architecture in Bacillus subtilis

Science

2012

, vol.

335

(pg.

1103

1106

)

Raghavan

Sloan

Ochman

. ,

Antisense transcription is pervasive but rarely conserved in enteric bacteria

mBio

2012

, vol.

(pg.

pii: e00156

)

Sharma

Vogel

. ,

Experimental approaches for the discovery and characterization of regulatory small RNA

Curr. Opin. Microbiol.

2009

, vol.

(pg.

536

546

)

Huttenhofer

Vogel

. ,

Experimental approaches to identify non-coding RNAs

Nucleic Acids Res.

2006

, vol.

(pg.

635

646

)

Cho

Zengler

Qiu

, et al. ,

The transcription unit architecture of the Escherichia coli genome

Nat. Biotechnol.

2009

, vol.

(pg.

1043

1049

)

Mendoza-Vargas

Olvera

, et al. ,

Genome-wide identification of transcription start sites, promoters and transcription factor binding sites in E. coli

PLoS One

2009

, vol.

pg.

e7526

Lenz

Doron-Faigenboim

Ron

, et al. ,

Sequence features of E. coli mRNAs affect their degradation

PLoS One

2011

, vol.

pg.

e28544

Mackie

Genereaux

. ,

The role of RNA structure in determining RNase E-dependent cleavage sites in the mRNA for ribosomal protein S20 in Vitro

J. Mol. Biol.

1993

, vol.

234

(pg.

998

1012

)

Mackie

Genereaux

Masterman

. ,

Modulation of the activity of RNase E in vitro by RNA sequences and secondary structures 5′ to cleavage sites

J. Biol. Chem.

1997

, vol.

272

(pg.

609

616

)

Mamanova

Turner

. ,

Low-bias, strand-specific transcriptome Illumina sequencing by on-flowcell reverse transcription (FRT-Seq)

Nat. Protoc.

2011

, vol.

(pg.

1736

1747

)

Tjaden

Saxena

Stolyar

, et al. ,

Transcriptome analysis of Escherichia coli using high-density oligonucleotide probe arrays

Nucleic Acids Res.

2002

, vol.

(pg.

3732

3738

)

Roback

Beard

Baumann

, et al. ,

A predicted operon map for Mycobacterium tuberculosis

Nucleic Acids Res.

2007

, vol.

(pg.

5085

5095

)

Sabatti

Rohlin

, et al. ,

Co-expression pattern from DNA microarray experiments as a tool for operon prediction

Nucleic Acids Res.

2002

, vol.

(pg.

2886

2893

)

Kobayashi

Akitomi

Fujii

, et al. ,

The entire organization of transcription units on the Bacillus subtilis genome

BMC Genomics

2007

, vol.

pg.

197

Taboada

Ciria

Martinez-Guerrero

, et al. ,

ProOpDB: prokaryotic operon database

Nucleic Acids Res.

2012

, vol.

Database issue

(pg.

D627

D631

)

Hansen

Brenner

Dudoit

. ,

Biases in Illumina transcriptome sequencing caused by random hexamer priming

Nucleic Acids Res.

2010

, vol.

pg.

e131

Oshlack

Wakefield

. ,

Transcript length bias in RNA-seq data confounds systems biology

Biol. Direct

2009

, vol.

pg.

Gao

Fang

Zhang

, et al. ,

Length bias correction for RNA-seq data in gene set analyses

Bioinformatics

2011

, vol.

(pg.

662

669

)

Mortazavi

Williams

McCue

, et al. ,

Mapping and quantifying mammalian transcriptomes by RNA-Seq

Nat. Methods

2008

, vol.

(pg.

621

628

)

Koren

Tirosh

Barkai

. ,

Autocorrelation analysis reveals widespread spatial biases in microarray experiments

BMC Genomics

2007

, vol.

pg.

164

Lee

Shultz

, et al. ,

Assessing probe-specific dye and slide biases in two-color microarray data

BMC Bioinformatics

2008

, vol.

pg.

314

Kelley

Feizi

Ideker

. ,

Correcting for gene-specific dye bias in DNA microarrays using the method of maximum likelihood

Bioinformatics

2008

, vol.

(pg.

)

Shendure

. ,

The beginning of the end for microarrays? Nat

Methods

2008

, vol.

(pg.

585

587

)

Sengupta

Bolin

Ruotti

, et al. ,

Single read and paired end mRNA-Seq Illumina libraries from 10 nanograms total RNA

J. Vis. Exp.

2011

, vol.

pg.

3340

Cho

Federowicz

Embree

, et al. ,

The PurR regulon in Escherichia coli K-12 Mg1655

Nucleic Acids Res.

2011

, vol.

(pg.

6456

6464

)

Prieto

Kahramanoglou

Ali

, et al. ,

Genomic analysis of DNA binding and gene regulation by homologous nucleoid-associated proteins IHF and HU in Escherichia coli K12

Nucleic Acids Res.

2012

, vol.

(pg.

3524

3537

)

Filenko

Spiro

Browning

, et al. ,

The NsrR regulon of Escherichia coli K-12 includes genes encoding the hybrid cluster protein and the periplasmic, respiratory nitrite reductase

J. Bacteriol.

2007

, vol.

189

(pg.

4410

4417

)

Oshima

Aiba

Masuda

, et al. ,

Transcriptome analysis of all two-component regulatory system mutants of Escherichia coli K-12

Mol. Microbiol.

2002

, vol.

(pg.

281

291

)

Maclellan

Eiamphungporn

Helmann

. ,

ROMA: an in vitro approach to defining target genes for transcription regulators

Methods

2009

, vol.

(pg.

)

Maciag

Peano

Pietrelli

, et al. ,

In vitro transcription profiling of the sigmas subunit of bacterial RNA polymerase: re-definition of the SigmaS regulon and identification of SigmaS-specific promoter sequence elements

Nucleic Acids Res.

2011

, vol.

(pg.

5338

5355

)

Zheng

Constantinidou

Hobman

, et al. ,

Identification of the CRP regulon using in vitro and in vivo transcriptional profiling

Nucleic Acids Res.

2004

, vol.

(pg.

5874

5893

)

Buck

Lieb

. ,

ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments

Genomics

2004

, vol.

(pg.

349

360

)

Collas

Dahl

. ,

Chop it, chip it, check it: the current status of chromatin immunoprecipitation

Front. Biosci.

2008

, vol.

(pg.

929

943

)

Grainger

Busby

. ,

Global regulators of transcription in Escherichia coli: mechanisms of action and methods for study

Adv. Appl. Microbiol.

2008

, vol.

(pg.

113

)

Wade

Struhl

Busby

, et al. ,

Genomic analysis of protein–DNA interactions in bacteria: insights into transcription and chromosome organization

Mol. Microbiol.

2007

, vol.

(pg.

)

Fan

Lamarre-Vincent

Wang

, et al. ,

Extensive chromatin fragmentation improves enrichment of protein binding sites in chromatin immunoprecipitation experiments

Nucleic Acids Res.

2008

, vol.

pg.

e125

Park

. ,

ChIP-Seq: advantages and challenges of a maturing technology

Nat. Rev. Genet.

2009

, vol.

(pg.

669

680

)

Cheung

Down

Latorre

, et al. ,

Systematic bias in high-throughput sequencing data and its correction by beads

Nucleic Acids Res.

2011

, vol.

pg.

e103

Waldminghaus

Skarstad

. ,

ChIP on chip: surprising results are often artifacts

BMC Genomics

2010

, vol.

pg.

414

Lorenz

von Pelchrzim

Schroeder

. ,

Genomic systematic evolution of ligands by exponential enrichment (Genomic SELEX) for the identification of protein-binding RNAs independent of their expression levels

Nat. Protoc.

2006

, vol.

(pg.

2204

2212

)

Shimada

Yamamoto

Ishihama

. ,

Novel members of the Cra regulon involved in carbon metabolism in Escherichia coli

J. Bacteriol.

2011

, vol.

193

(pg.

649

659

)

Schutze

Wilhelm

Greiner

, et al. ,

Probing the SELEX process with next-generation sequencing

PLoS One

2011

, vol.

pg.

e29604

Ogawa

Biggin

. ,

High-throughput SELEX determination of DNA sequences bound by transcription factors in vitro

Methods Mol. Biol.

2012

, vol.

786

(pg.

)

Schneider

Stephens

. ,

Sequence logos: a new way to display consensus sequences

Nucleic Acids Res.

1990

, vol.

(pg.

6097

6100

)

Stormo

. ,

DNA binding sites: representation and discovery

Bioinformatics

2000

, vol.

(pg.

)

Ahmad

Sarai

. ,

PSSM-based prediction of DNA binding sites in proteins

BMC Bioinformatics

2005

, vol.

pg.

GuhaThakurta

. ,

Computational identification of transcriptional regulatory elements in DNA sequence

Nucleic Acids Res.

2006

, vol.

(pg.

3585

3598

)

Tompa

Bailey

, et al. ,

Assessing computational tools for the discovery of transcription factor binding sites

Nat. Biotechnol.

2005

, vol.

(pg.

137

144

)

Stormo

Hartzell

3rd

. ,

Identifying protein-binding sites from unaligned DNA fragments

Proc. Natl Acad. Sci. USA

1989

, vol.

(pg.

1183

1187

)

Medina-Rivera

Abreu-Goodger

Thomas-Chollier

, et al. ,

Theoretical and empirical quality assessment of transcription factor-binding motifs

Nucleic Acids Res.

2011

, vol.

(pg.

808

824

)

Salgado

Peralta-Gil

Gama-Castro

, et al. ,

RegulonDB V8.0: Omics Data Sets, Evolutionary Conservation, Regulatory Phrases, Cross-Validated Gold Standards and More

Nucleic Acids Res.

2013

, vol.

(pg.

D203

D213

)

Weber Sde

Sant'Anna

Schrank

. ,

Unveiling Mycoplasma hyopneumoniae promoters: sequence definition and genomic distribution

DNA Res.

2012

, vol.

(pg.

103

115

)

Thomas-Chollier

Herrmann

Defrance

, et al. ,

RSAT peak-motifs: motif analysis in full-size ChIP-Seq datasets

Nucleic Acids Res.

2012

, vol.

pg.

e31

Janky

van Helden

. ,

Evaluation of phylogenetic footprint discovery for predicting bacterial cis-regulatory elements and revealing their evolution

BMC Bioinformatics

2008

, vol.

pg.

Thomas-Chollier

Defrance

Medina-Rivera

, et al. ,

RSAT 2011: regulatory sequence analysis tools

Nucleic Acids Res.

2011

, vol.

Web Server issue

(pg.

W86

W91

)

Devroede

Thia-Toong

Gigot

, et al. ,

Purine and pyrimidine-specific repression of the Escherichia coli carAB operon are functionally and structurally coupled

J. Mol. Biol.

2004

, vol.

336

(pg.

)

Rolfes

Zalkin

. ,

Regulation of Escherichia coli purF. Mutations that define the promoter, operator, and purine repressor gene

J. Biol. Chem.

1988

, vol.

263

(pg.

19649

19652

)