Abstract

The use of high-throughput RNA sequencing technology (RNA-seq) allows whole transcriptome analysis, providing an unbiased and unabridged view of alternative transcript expression. Coupling splicing variant-specific expression with its functional inference is still an open and difficult issue for which we created the DataBase of Alternative Transcripts Expression (DBATE), a web-based repository storing expression values and functional annotation of alternative splicing variants. We processed 13 large RNA-seq panels from human healthy tissues and in disease conditions, reporting expression levels and functional annotations gathered and integrated from different sources for each splicing variant, using a variant-specific annotation transfer pipeline. The possibility to perform complex queries by cross-referencing different functional annotations permits the retrieval of desired subsets of splicing variant expression values that can be visualized in several ways, from simple to more informative. DBATE is intended as a novel tool to help appreciate how, and possibly why, the transcriptome expression is shaped.

Database URL:http://bioinformatica.uniroma2.it/DBATE/.

Introduction

Alternative splicing (AS) permits the synthesis of multiple transcript variants from a single gene, thus increasing the diversity of RNAs and proteins encoded by a genome (1, 2). The number of known splicing variants in the human transcriptome stored in Ensembl is growing at a dramatic pace. Through the use of recent high-throughput RNA sequencing technologies (RNA-seq), it has been demonstrated that ∼95% of multi-exon genes undergo AS in panels of human tissues (3), shaping the expressed transcriptome in various ways (4) and generating an exceedingly complex repertoire of mRNAs (5). Splicing variant expression deconvolution algorithms, such as Cufflinks (6), IsoEM (7), Scripture (8), RSEM (9) and SpliceSeq (10), allow the reliable (as validated in many instances by RT-PCR) quantitative estimation of the transcription of individual splicing variants of a gene from RNA-seq data. Yet, the functional interpretation of such expression data, or the change of splicing variant-specific expression patterns in different tissues or conditions, is still overly difficult. In the simplest cases, splicing promotes the inclusion or removal of specific exons corresponding to whole-protein domains to which a specific function can be assigned, but often splicing patterns are much more complex and the effect of splicing on the protein product function(s) is much more elusive. A number of indications suggest that in a considerable fraction of cases, splicing can radically change the protein product function and/or fold (11–14), and a non-negligible amount of splicing variants shows structural inconsistencies (e.g. low degrees of residue packing in the protein core or large fractions of hydrophobic residues exposed to the solvent), and lack of known functional regions. As a consequence, despite the large amount of data available about AS variants and protein functional annotations, there are no resources dedicated to the integrated retrieval of such information.

A number of databases offer storage or download of next-generation sequencing data (15, 16), but the splicing variant-level expression analysis is still unfriendly for the biomedical researchers, given the exceedingly large amount of data to be processed, the computational power required and the nature of the analysis algorithms that are usually intended for the computational biologist. A small number of recent databases storing RNA-seq expression data only provide gene-level expression values (e.g. 17). Splicing variant-level annotations are starting to be available in databases, such as Uniprot, but they can be of difficult interpretation without a reference context, and are still largely incomplete. SpliceSeq (10) provides a user-friendly interactive graphic environment, integrated with isoform-specific functional annotations from Uniprot, but splicing variant-level expression estimation must be run by the user. Many databases exist that collect AS variants (18–21), but they do not tackle variant functional annotation. There are no general tools that can be used to infer whether a given variant is actually translated, and its eventual protein product stable and containing functional regions and residues. Various resources have been developed for the analysis of specific effects of AS, for example ProSAS (22) for the analysis of the changes introduced by AS on protein structures, or AS-ALPS and AS-EAST (23, 24) for the analysis of the effect of AS on protein–protein interfaces and other structure-based functional assignments. A web server, MAISTAS (25), provides a framework to test the structural consistency of a splicing variant, but has a limited range of application because it requires a high sequence identity between the variant under analysis and a template with known 3D structure. As a consequence, the integration of transcript-level RNA-seq expression and their functional characterization must be currently approached by combining different tools and cross-referencing heterogeneous databases and data types.

We aim at filling this void with the DataBase of Alternative Transcripts Expression (DBATE). We processed 13 large public RNA-seq panels from human healthy tissues and in disease conditions. For each splicing variant in each sample, we report the estimated transcript expression and its functional annotations, extracted and integrated from different sources: Ensembl (26), Pfam (27), Uniprot/Swiss-Prot (28), GO (29) and mentha (Calderone & Cesareni, in press; http://mentha.uniroma2.it/). The user can access splicing variant expression levels of the genes or transcripts of interest, compare them among different samples and perform more complex queries by cross-referencing the available annotations. The interface is designed to facilitate the data retrieval, available in five different formats: HTML tables, Excel spreadsheets, plain text tab-separated files, barplots and heatmaps.

DBATE content

DBATE provides the expression level for each human transcript annotated in Ensembl (release 67) estimated in 13 different panels of human tissues/cell lines available in the Gene Expression Omnibus (GEO) (30), enriched with functional annotation. These panels have been chosen to cover the largest number of samples from human healthy tissues, organs or cell lines; for seven of them, normal and tumoral condition is provided. Each sample in each panel was processed independently, and we provide tools for the comparison of any given set of samples that the user can select as desired. The list of available data sets is provided in Table 1 reporting, for each data set, its GEO identifier, the samples it contains, the total number of reads, the sequencing technology used, a description of the data set content and the literature reference (when available).

Table 1.

Data sets included in DBATE

GEO GSE identifierSamplesNumber of reads (×106)Read length (bp)DescriptionReference
GSE12946aAdipose, brain, breast, colon, heart, liver, lymph node, skeletal muscle, testes, BT474, HME, MB435, MCF-7, T47D22432eThe Wang data set, from which we selected 14 samples, 9 in normal condition and 5 in tumoral condition31
GSE17274bThree female (HSF1, HSF2, HSF3) and three male liver samples (HSM1, HSM2, HSM3)7235eSex-specific gene expression in liver in three males and three females32
GSE29119bBreast cancer (HCC1954) and normal breast cells (HMEC)9736eGene expression analysis of breast cancer33
GSE29155aProstate epithelial (PrEC) and prostate adenocarcinoma (LNCaP) cell lines936e,40fTranscription profiling of human prostate epithelial and adenocarcinoma cell lines34
GSE29580bNormal and tumor samples from two colorectal cancer patients4036eWhole transcriptome sequencing of colorectal cancerNA
GSE29968bMatched esophageal squamous cells from three carcinoma patients11838eTranscriptome analysis of human esophageal squamous cell carcinoma in three pairs of matched patient-derived tumor samples and their adjacent non-tumorous tissues35
GSE30611cAdipose, adrenal, brain, breast, colon, heart, kidney, liver, lung, lymph node, ovary, prostate, skeletal muscle, testes, thyroid, white blood cells200050fThe Illumina BodyMap 2.0 Project, comprising transcription profiling of individual and mixtures of 16 human tissuesNA
GSE30772cMitochondrion, mitoplasm4535eExamination of the mitochondiral transcriptome36
GSE32689bPooled oocytes, pooled sister polar bodies, single oocyte, single sister polar body12042eTranscriptome of the human polar body, providing four conditions: pooled oocytes and their sister polar bodies and a single oocyte and its sister polar body37
GSE33328dPeripheral brain tissue, tumor brain tissue4975eTranscriptomic profiling of a glioblastoma multiforme patient with control peripheral brain tissue38
GSE37769cTHP1 cells287100eExpression analysis of the THP1 (human monocytic leukemia) cell line39
GSE38685bProstate epithelial (PrEC) and prostate adenocarcinoma (LNCaP) cell lines3575eTranscription profiling of human prostate epithelial and adenocarcinoma cell lines40
GSE43925bTHP1 high glucose, THP1 normal glucose6042eExpression analysis of human THP-1 monocytes in normal conditions and treated with high glucose41
GEO GSE identifierSamplesNumber of reads (×106)Read length (bp)DescriptionReference
GSE12946aAdipose, brain, breast, colon, heart, liver, lymph node, skeletal muscle, testes, BT474, HME, MB435, MCF-7, T47D22432eThe Wang data set, from which we selected 14 samples, 9 in normal condition and 5 in tumoral condition31
GSE17274bThree female (HSF1, HSF2, HSF3) and three male liver samples (HSM1, HSM2, HSM3)7235eSex-specific gene expression in liver in three males and three females32
GSE29119bBreast cancer (HCC1954) and normal breast cells (HMEC)9736eGene expression analysis of breast cancer33
GSE29155aProstate epithelial (PrEC) and prostate adenocarcinoma (LNCaP) cell lines936e,40fTranscription profiling of human prostate epithelial and adenocarcinoma cell lines34
GSE29580bNormal and tumor samples from two colorectal cancer patients4036eWhole transcriptome sequencing of colorectal cancerNA
GSE29968bMatched esophageal squamous cells from three carcinoma patients11838eTranscriptome analysis of human esophageal squamous cell carcinoma in three pairs of matched patient-derived tumor samples and their adjacent non-tumorous tissues35
GSE30611cAdipose, adrenal, brain, breast, colon, heart, kidney, liver, lung, lymph node, ovary, prostate, skeletal muscle, testes, thyroid, white blood cells200050fThe Illumina BodyMap 2.0 Project, comprising transcription profiling of individual and mixtures of 16 human tissuesNA
GSE30772cMitochondrion, mitoplasm4535eExamination of the mitochondiral transcriptome36
GSE32689bPooled oocytes, pooled sister polar bodies, single oocyte, single sister polar body12042eTranscriptome of the human polar body, providing four conditions: pooled oocytes and their sister polar bodies and a single oocyte and its sister polar body37
GSE33328dPeripheral brain tissue, tumor brain tissue4975eTranscriptomic profiling of a glioblastoma multiforme patient with control peripheral brain tissue38
GSE37769cTHP1 cells287100eExpression analysis of the THP1 (human monocytic leukemia) cell line39
GSE38685bProstate epithelial (PrEC) and prostate adenocarcinoma (LNCaP) cell lines3575eTranscription profiling of human prostate epithelial and adenocarcinoma cell lines40
GSE43925bTHP1 high glucose, THP1 normal glucose6042eExpression analysis of human THP-1 monocytes in normal conditions and treated with high glucose41

aPlatform: Genome Analyzer

bPlatform: Genome Analyzer IIx

cPlatform: HiSeq 2000

dPlatform: Genome Analyzer II

eSingle end reads

fPaired end reads

The current DBATE release includes 13 data sets retrieved from the Gene Expression Omnibus (GEO). The Table table reports for each data set its GEO GSE identifier, the samples it contains, the total number of reads (expressed in million reads), the read length, a brief description of the data set content, and the literature reference when available (NA indicates that the data were deposited in GEO but the study is still unpublished). Superscripts indicate the sequencing technology employed used (either GA, GAII, GAIIx, or HiSeq 2000), and whether the reads were sequenced as single or paired ends.

Table 1.

Data sets included in DBATE

GEO GSE identifierSamplesNumber of reads (×106)Read length (bp)DescriptionReference
GSE12946aAdipose, brain, breast, colon, heart, liver, lymph node, skeletal muscle, testes, BT474, HME, MB435, MCF-7, T47D22432eThe Wang data set, from which we selected 14 samples, 9 in normal condition and 5 in tumoral condition31
GSE17274bThree female (HSF1, HSF2, HSF3) and three male liver samples (HSM1, HSM2, HSM3)7235eSex-specific gene expression in liver in three males and three females32
GSE29119bBreast cancer (HCC1954) and normal breast cells (HMEC)9736eGene expression analysis of breast cancer33
GSE29155aProstate epithelial (PrEC) and prostate adenocarcinoma (LNCaP) cell lines936e,40fTranscription profiling of human prostate epithelial and adenocarcinoma cell lines34
GSE29580bNormal and tumor samples from two colorectal cancer patients4036eWhole transcriptome sequencing of colorectal cancerNA
GSE29968bMatched esophageal squamous cells from three carcinoma patients11838eTranscriptome analysis of human esophageal squamous cell carcinoma in three pairs of matched patient-derived tumor samples and their adjacent non-tumorous tissues35
GSE30611cAdipose, adrenal, brain, breast, colon, heart, kidney, liver, lung, lymph node, ovary, prostate, skeletal muscle, testes, thyroid, white blood cells200050fThe Illumina BodyMap 2.0 Project, comprising transcription profiling of individual and mixtures of 16 human tissuesNA
GSE30772cMitochondrion, mitoplasm4535eExamination of the mitochondiral transcriptome36
GSE32689bPooled oocytes, pooled sister polar bodies, single oocyte, single sister polar body12042eTranscriptome of the human polar body, providing four conditions: pooled oocytes and their sister polar bodies and a single oocyte and its sister polar body37
GSE33328dPeripheral brain tissue, tumor brain tissue4975eTranscriptomic profiling of a glioblastoma multiforme patient with control peripheral brain tissue38
GSE37769cTHP1 cells287100eExpression analysis of the THP1 (human monocytic leukemia) cell line39
GSE38685bProstate epithelial (PrEC) and prostate adenocarcinoma (LNCaP) cell lines3575eTranscription profiling of human prostate epithelial and adenocarcinoma cell lines40
GSE43925bTHP1 high glucose, THP1 normal glucose6042eExpression analysis of human THP-1 monocytes in normal conditions and treated with high glucose41
GEO GSE identifierSamplesNumber of reads (×106)Read length (bp)DescriptionReference
GSE12946aAdipose, brain, breast, colon, heart, liver, lymph node, skeletal muscle, testes, BT474, HME, MB435, MCF-7, T47D22432eThe Wang data set, from which we selected 14 samples, 9 in normal condition and 5 in tumoral condition31
GSE17274bThree female (HSF1, HSF2, HSF3) and three male liver samples (HSM1, HSM2, HSM3)7235eSex-specific gene expression in liver in three males and three females32
GSE29119bBreast cancer (HCC1954) and normal breast cells (HMEC)9736eGene expression analysis of breast cancer33
GSE29155aProstate epithelial (PrEC) and prostate adenocarcinoma (LNCaP) cell lines936e,40fTranscription profiling of human prostate epithelial and adenocarcinoma cell lines34
GSE29580bNormal and tumor samples from two colorectal cancer patients4036eWhole transcriptome sequencing of colorectal cancerNA
GSE29968bMatched esophageal squamous cells from three carcinoma patients11838eTranscriptome analysis of human esophageal squamous cell carcinoma in three pairs of matched patient-derived tumor samples and their adjacent non-tumorous tissues35
GSE30611cAdipose, adrenal, brain, breast, colon, heart, kidney, liver, lung, lymph node, ovary, prostate, skeletal muscle, testes, thyroid, white blood cells200050fThe Illumina BodyMap 2.0 Project, comprising transcription profiling of individual and mixtures of 16 human tissuesNA
GSE30772cMitochondrion, mitoplasm4535eExamination of the mitochondiral transcriptome36
GSE32689bPooled oocytes, pooled sister polar bodies, single oocyte, single sister polar body12042eTranscriptome of the human polar body, providing four conditions: pooled oocytes and their sister polar bodies and a single oocyte and its sister polar body37
GSE33328dPeripheral brain tissue, tumor brain tissue4975eTranscriptomic profiling of a glioblastoma multiforme patient with control peripheral brain tissue38
GSE37769cTHP1 cells287100eExpression analysis of the THP1 (human monocytic leukemia) cell line39
GSE38685bProstate epithelial (PrEC) and prostate adenocarcinoma (LNCaP) cell lines3575eTranscription profiling of human prostate epithelial and adenocarcinoma cell lines40
GSE43925bTHP1 high glucose, THP1 normal glucose6042eExpression analysis of human THP-1 monocytes in normal conditions and treated with high glucose41

aPlatform: Genome Analyzer

bPlatform: Genome Analyzer IIx

cPlatform: HiSeq 2000

dPlatform: Genome Analyzer II

eSingle end reads

fPaired end reads

The current DBATE release includes 13 data sets retrieved from the Gene Expression Omnibus (GEO). The Table table reports for each data set its GEO GSE identifier, the samples it contains, the total number of reads (expressed in million reads), the read length, a brief description of the data set content, and the literature reference when available (NA indicates that the data were deposited in GEO but the study is still unpublished). Superscripts indicate the sequencing technology employed used (either GA, GAII, GAIIx, or HiSeq 2000), and whether the reads were sequenced as single or paired ends.

RNA-seq analysis

All the data sets have been checked for read quality using FASTQC (v0.10.1 at http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). We trimmed read ends if base quality scores were <20. At the end of this process, if the total length of the read is <10 bases, we discarded the read.

We used the Tuxedo Suite, comprising Bowtie (v0.12.7) (42), TopHat (v1.4.1) (43) and Cufflinks (v1.3.0) (6), to align the sequence reads produced in each experiment to the reference human genome hg19 seeking only unique genome matches with up to two mismatches (Bowtie), to identify splicing junctions (TopHat), and to evaluate the normalized expression of individual splicing variants (Cufflinks) reported in Fragments Per Kilobase of transcript per Million mapped reads (FPKM). Cufflinks is the de facto standard algorithm for the isoform deconvolution problem, and it is widely used (44). More recent algorithms are available (45–48), but in absence of appropriate benchmarks for this kind of algorithms we deemed more appropriate to select the most popular tool. We actually tested two additional splicing variant expression estimation algorithms, IsoEM (7) and Scripture (8), on the Wang data set (see above), finding a high correlation with the Cufflinks expression estimates (Pearson correlation coefficient >0.8 for both algorithms).

Functional annotation

Various sources of functional information have been integrated in DBATE, gathering functional annotations from the Ensembl, Uniprot/Swiss-Prot, Pfam, Gene Ontology and mentha databases. With the exception of protein interaction information from mentha (described later in the article), all annotations are mapped to individual splicing variants. Each feature can be used to filter the input query, selecting only those genes having at least one splicing variant carrying the desired annotation or set of annotations. It should be noted that, although the annotation transfer pipeline is intended for protein-coding transcripts, DBATE also stores expression levels for non-coding RNAs, which are assuming a central role in many cellular processes (49, 50).

We extracted from the Ensembl database (release 67) the definition (i.e. the exon composition) of each transcript linked to its gene (ENSG), transcript (ENST), associated protein (ENSP) identifiers and its genomic coordinates, defined by chromosome, absolute start–end on the human genome and strand. In total, DBATE stores 22 403 protein-coding Ensembl genes and their associated 100 357 transcripts, for which 94 737 (94.4%) encode for a different protein sequence (the remaining variants differ only in the untranslated regions). Additionally, DBATE stores 96 136 non-coding transcripts, some of which (52%) are non-coding isoforms of protein-coding genes, whereas the remaining 48% are products of genes for which no transcript is translated that might encode for functional RNAs, i.e. RNAs that are not translated because they have no ORF, but exert their function as RNA molecules (51, 52). A rapidly increasing interest in these non-coding RNA (ncRNAs) genes motivated their inclusion in DBATE, even if obviously in these cases we cannot apply our protein annotation transfer pipeline because these genes do not encode for proteins.

We collected all the human entries of the Uniprot/Swiss-Prot database resulting in 20 231 different entries. For each entry, we gathered the Uniprot ID, Uniprot Primary Accession Number, sequence of the main splicing variant, protein existence evidence, features and post-translational modifications. All the information stored in Uniprot is related to the main splicing variant protein product. This splicing variant is usually selected as the longest one, or the more biologically relevant (e.g. more commonly expressed, or better characterized) or the first discovered one. We mapped 16 distinct Swiss-Prot features to Ensemble transcripts, selected as those that are more directly linked to the protein function. These features are TOPO_DOM, NP_BIND, REGION, BINDING, DISULFID, MOTIF, MOD_RES, DOMAIN, DNA_BIND, REPEAT, ZN_FING, LIPID, ACT_SITE, METAL, SITE and CA_BIND. A total of 36 728 Ensembl transcripts encode for proteins that are annotated with at least one Swiss-Prot feature. Additionally, we also mapped 19 post-translational modifications provided by Swiss-Prot entries to Ensembl transcripts: acetylation, ADP-ribosylation, amidation, disulfide bond, gamma-carboxyglutamic acid, glutathionylation, glycation, glycoprotein, hydroxylation, iodination, lipoprotein, methylation, myristate, nitration, oxidation, phosphoprotein, S-nitrosylation, sulfation and Ubl conjugation. A total of 27 867 Ensembl transcripts encode for proteins that are annotated with at least one post-translational modification. The annotation transfer pipeline is based on the mapping between each protein amino acid and its corresponding genomic codon, identified as chromosome, strand and a triplet of genomic positions. We aligned each human Uniprot sequence using the Needleman–Wunsch algorithm with each Ensembl transcript sequence to find correspondence between the Uniprot sequence and all splicing variants of its encoding gene. Using the genomic coordinates of each transcript exon and the alignment between the Uniprot sequence and the most similar transcript (discarding all cases where there is no clear correspondence with any Ensembl transcript), we mapped the genomic location of the codons encoding for each annotated amino acid residue and verified the presence of each codon in the splicing variants of that gene, obtaining an estimate of how many annotated residues are present in each known transcript.

This procedure permits to map each annotation at the amino acid residue level from the main splicing variant (as identified in Uniprot) to each other splicing variant. Each splicing variant of a gene may encode only for a subset of the residues associated with an annotation in the main splicing variant; therefore, we defined the annotation coverage as the fraction of annotated amino acids found in a given splicing variant with respect to the total number of annotated residues in the main Uniprot splicing variant. Coverage varies between 100 (all annotated residues included) and 0 (the functional feature is completely removed by splicing events). For example, in the case of a splicing variant containing 5 annotated residues out of 10 annotated residues in the main splicing variant, the coverage of the annotation on this splicing variant is 50%. Obviously, even if an annotation is found on a transcript with high or complete coverage, that splicing variant is not necessarily able to perform that function because we cannot estimate if and how much the transcript is translated, and also because the function might depend on a specific local or global folding, or on the presence of disjoined regulatory regions, which would be extremely difficult to infer. Yet, if the annotated residues are not encoded by a splicing variant, that variant cannot perform that function, regardless of its translation, folding and regulation. Hence, we provide a transcript-level transfer of functional amino acids, which can be a useful starting point for a more detailed functional characterization.

Protein-domain composition was retrieved from the Pfam database and mapped to individual splicing variants using the server edition of PfamScan: 70 298 transcripts encode for a protein product that has at least one Pfam domain.

Finally, 72 916 transcripts are annotated with at least one Gene Ontology (GO) term. In total, 51 015 transcripts are associated with at least one GO term in the ‘biological process’ domain; 63 868 transcripts are associated with at least one term in the ‘molecular function’ domain; 51 634 transcripts are associated with at least one term in the ‘cellular component’ domain. Ensembl and Biomart recently started to associate GO terms with individual splicing variants. Although such annotations are largely incomplete, yet they provide in many cases variant-specific information. GO term annotations will be frequently updated over time.

The mentha interactome database collected protein–protein interactions (PPI) retrieved from five different PPI databases—IntAct (53), MINT (54), DIP (55), BioGRID (56) and MatrixDB (57)—with the aim of eliminating redundancy between these different sources. The main motivation of the mentha database is that currently available curated databases of protein interactions offer only a limited view of the interactome which can be expanded and made more consistent by their integration. We have combined this database with DBATE through a Java Applet to browse PPIs using the unique Uniprot ID associated to each transcript.

Querying DBATE

The DBATE database archives a wide variety of functional annotations. The user can access the splicing variant expression level of genes or transcripts of interest, or perform more complex queries using an advanced form through the use of cross-referenced annotations.

The DBATE user interface is intended to provide the choice of increasingly complex queries, in an intuitive fashion. In the simplest query, the user inputs a transcript ID, and retrieves pre-computed expression levels in FPKM units for all tissue samples in BodyMap (the default sample group), a simple TAB delimited plain-text file, a Microsoft Excel table file and an HTML table. All functional annotations of the input transcript are reported and organized in panels. When the user inputs a gene ID or a gene name, all its splicing variants are returned. Moreover, the database can be accessed via one or more GO terms or Pfam domain IDs; mixed queries are also allowed. Individual variants can be manually selected from the list of variants matching the query. In addition to the different tabular outputs and annotation panels, if the number of variants is higher than one (and lower than 100), DBATE also offers a heatmap grouping similar expression patterns across all selected samples.

Using the advanced options input form, the user can create more complex queries. First, specific samples from all RNA-seq data sets can be selected. Then, functional annotations can also be chosen to filter the input transcripts for only those matching the selected features. As a case study, we report the analysis of proteins containing the K Homology (KH) domain (Pfam ID: PF00013), a domain able to bind RNA promoting its degradation and that is involved in splicing regulation (58, 59). The KH domain has been found in some cases associated to repeated protein sequences such as the ankyrin repeat (60, 61), and its phosphorylation can modulate its binding ability (62). DBATE can be easily queried to retrieve all splicing variants that encode for protein products that contain the KH domain; then the query can be refined to retain only splicing variants encoding for phosphorylation sites and containing repeat units in their protein sequences, and for retrieving expression values in the desired pool of samples. Such a composite query that cross-links different types of information retrieves 10 transcripts belonging to 4 different genes: ANKRD17, KHSRP, HNRNPK and ANKHD1. Their expression patterns are reported in the heatmap in Figure 1, showing that splicing variants for these proteins can have different and tissue-specific expression. Interestingly, not all the ANKHD1 variants contain the KH domain. An ANKHD1 splicing variant lacking the KH domain, which has important roles in apoptosis, is reported in the literature (63). Retrieving the full list of which ANKHD1 variants contain the KH domain, and their expression patterns, is not a trivial task, but can be immediately obtained from DBATE by simply querying the ANKHD1 gene. DBATE reports that 10 out of 19 ANKHD1 splicing variants lack the KH domain, and their expression patterns can help in elucidating their cellular roles.

Example of combination of complex queries in DBATE. This heatmap reports expression values in the BodyMap panel of human tissues of splicing variants that encode for protein products containing the Pfam KH domain (PF00013), which are phosphorylated and contain repetitive units. The combination of this information can be easily obtained using the web interface of DBATE that returns in this case 10 different splicing variants that belong to genes ANKRD17, KHSRP, HNRNPK and ANKHD1. Their expression patterns show that splicing variants for these different proteins can have tissue-specific behaviors. The heatmap image is generated by an automated procedure using the statistical software R using the heatmap.2 function, and then loaded on the web interface as part of the results page. The color code of the heatmap ranges from red, lower FPKM values; to black, medium expression values; to green, higher expression values.
Figure 1.

Example of combination of complex queries in DBATE. This heatmap reports expression values in the BodyMap panel of human tissues of splicing variants that encode for protein products containing the Pfam KH domain (PF00013), which are phosphorylated and contain repetitive units. The combination of this information can be easily obtained using the web interface of DBATE that returns in this case 10 different splicing variants that belong to genes ANKRD17, KHSRP, HNRNPK and ANKHD1. Their expression patterns show that splicing variants for these different proteins can have tissue-specific behaviors. The heatmap image is generated by an automated procedure using the statistical software R using the heatmap.2 function, and then loaded on the web interface as part of the results page. The color code of the heatmap ranges from red, lower FPKM values; to black, medium expression values; to green, higher expression values.

For each query, protein interaction data from the mentha browser can be visualized through a Java applet integrated in the results page. For each splicing variant derived from the query, a list of unique Uniprot primary accession numbers is used to interrogate mentha. For each query protein, all physically interacting proteins retrieved in mentha are reported as connected by an edge to the query proteins, and the expression FPKM in a selected tissue of each transcript is reported, for both the query proteins and their binding partners. Each network node is color-coded by the expression level of the dominant isoform, whereas clicking the single node reports all the different splicing variants with their expression values. The mentha browser additionally allows expansion and pruning of the network. The interaction network for the 4 KH domain-containing genes selected in the previous paragraph is reported in Figure 2, where they display close connectivity mediated by common binding partners.

Protein interaction network for the ANKRD17, KHSRP, HNRNPK and ANKHD1 genes retrieved from the mentha database and plotted by the mentha browser applet. The mentha database stores manually curated PPIs from five different PPI databases and has been implemented in the DBATE web interface. These four genes have been selected from a complex query search on DBATE to obtain all the splicing variants that encode for protein products containing the KH (K Homology) domain and that are also phosphorylated and contain repeated units. The network includes all primary binding partners of the four genes. Nodes describe genes, and arcs join genes whose protein products are known to physically interact. Nodes corresponding to the query proteins are larger and highlighted with blue circles. Each node is colored according to the expression level of its most expressed splicing variant. Color ranges from red, lower FPKM values; to black, medium expression values; to green, higher expression values. White nodes describe genes for which no splicing variant is expressed in the selected tissue. Protein interaction networks generated by the mentha browser can also be manually expanded and pruned.
Figure 2.

Protein interaction network for the ANKRD17, KHSRP, HNRNPK and ANKHD1 genes retrieved from the mentha database and plotted by the mentha browser applet. The mentha database stores manually curated PPIs from five different PPI databases and has been implemented in the DBATE web interface. These four genes have been selected from a complex query search on DBATE to obtain all the splicing variants that encode for protein products containing the KH (K Homology) domain and that are also phosphorylated and contain repeated units. The network includes all primary binding partners of the four genes. Nodes describe genes, and arcs join genes whose protein products are known to physically interact. Nodes corresponding to the query proteins are larger and highlighted with blue circles. Each node is colored according to the expression level of its most expressed splicing variant. Color ranges from red, lower FPKM values; to black, medium expression values; to green, higher expression values. White nodes describe genes for which no splicing variant is expressed in the selected tissue. Protein interaction networks generated by the mentha browser can also be manually expanded and pruned.

Finally, expression data from all data sets and transcripts can be downloaded as static tab-separated text files.

Data organization and web interface

DBATE has been implemented within the MySQL database management system version 5.1.17 on a Linux Xubuntu server machine. It contains expression levels for 196 494 splicing variants (57 659 genes) computed in 36 samples: 28 in the normal condition and 8 in the tumoral condition. For each splicing variant, information related to Ensembl, Uniprot/Swiss-Prot, Pfam and GO has been integrated as explained in paragraph ‘Functional Annotation’.

The web interface is implemented through python-CGI programming, HTML and JavaScript. All the graphs are generated through the statistical software R v.2.15.2 and loaded to the web interface through python-CGI. The entity–relationship schema of the database is included in the online DBATE help pages.

Conclusions

Next-generation sequencing technologies revolutionized the analysis of the transcriptome, providing a panoramic view of all the transcriptional activity in a given sample. Although such high-throughput experiments provide an enormous wealth of data, there are few tools to make order through them. DBATE, freely available at http://bioinformatica.uniroma2.it/DBATE/, provides an integrated resource that can be valuable for the functional inference of whole transcriptome expression analysis experiments by providing pre-computed expression levels and annotations that can be cumbersome to generate for the biomedical scientist.

A semi-automated pipeline was built to process and populate the database with additional data sets as they become available, and will be expanded to more annotation sources and sequencing technologies. The pipeline is based on initial steps of data retrieval, organization and quality checking, done manually by DBATE curators, followed by automated stages of expression estimates and annotations. We initially selected for inclusion in DBATE public data sets from a list retrieved from GEO, including 53 RNA-seq human panels obtained with Illumina technology (GA, GAII, GAIIx or HiSeq). All remaining unprocessed data sets are currently in a queue and will be progressively added. DBATE updates are planned each semester. Finally, DBATE will be expanded into a web server or web service-based tool for the annotations and characterization of user-submitted RNA-seq panel data.

Funding

Associazione Italiana per la Ricerca sul Cancro (AIRC) (IG10298 to M.H.C.). Funding for open access charge: PRIN 2010 (prot. 20108XYHJS_006 to M.H.C.).

Conflict of interest. None declared.

References

1
Tress
ML
Martelli
PL
Frankish
A
et al. 
The implications of alternative splicing in the ENCODE protein complement
Proc. Natl. Acad. Sci. USA
2007
, vol. 
104
 (pg. 
5495
-
5500
)
2
Kim
E
Goren
A
Ast
G
Alternative splicing: current perspectives
Bioessays
2008
, vol. 
30
 (pg. 
38
-
47
)
3
Pan
Q
Shai
O
Lee
LJ
et al. 
Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing
Nat. Genet.
2008
, vol. 
40
 (pg. 
1413
-
1415
)
4
David
CJ
Manley
JL
The search for alternative splicing regulators: new approaches offer a path to a splicing code
Gene. Dev.
2008
, vol. 
22
 (pg. 
279
-
285
)
5
Ben-Dov
C
Hartmann
B
Lundgren
J
Valcárcel
J
Genome-wide analysis of alternative pre-mRNA splicing
J. Biol. Chem.
2008
, vol. 
283
 (pg. 
1229
-
1233
)
6
Roberts
A
Trapnell
C
Donaghey
J
et al. 
Improving RNA-Seq expression estimates by correcting for fragment bias
Genome Biol.
2011
, vol. 
12
 pg. 
R22
 
7
Nicolae
M
Mangul
S
Măndoiu
II
Zelikovsky
A
Estimation of alternative splicing isoform frequencies from RNA-Seq data
Algorithms Mol. Biol.
2011
, vol. 
6
 pg. 
9
 
8
Guttman
M
Garber
M
Levin
JZ
et al. 
Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs
Nature Biotechnol.
2010
, vol. 
28
 (pg. 
503
-
510
)
9
Li
B
Dewey
CN
RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome
BMC Bioinformatics
2011
, vol. 
12
 pg. 
323
 
10
Ryan
MC
Cleland
J
Kim
R
et al. 
SpliceSeq: a resource for analysis and visualization of RNA-Seq data on alternative splicing and its functional impacts
Bioinformatics
2012
, vol. 
28
 (pg. 
2385
-
2387
)
11
Birzele
F
Csaba
G
Zimmer
R
Alternative splicing and protein structure evolution
Nucleic Acids Res.
2007
, vol. 
36
 (pg. 
550
-
558
)
12
Stetefeld
J
Ruegg
MA
Structural and functional diversity generated by alternative mRNA splicing
Trends Biochem. Sci.
2005
, vol. 
30
 (pg. 
515
-
521
)
13
Melamud
E
Moult
J
Structural implication of splicing stochastics
Nucleic Acids Res.
2009
, vol. 
37
 (pg. 
4862
-
4872
)
14
Leoni
G
Le Pera
L
Ferrè
F
et al. 
Coding potential of the products of alternative splicing in human
Genome Biol.
2011
, vol. 
12
 pg. 
R9
 
15
Durbin
RM
Altshuler
DL
Durbin
RM
et al. 
A map of human genome variation from population-scale sequencing
Nature
2010
, vol. 
467
 (pg. 
1061
-
1073
)
16
Shumway
M
Cochrane
G
Sugawara
H
Archiving next generation sequencing data
Nucleic Acids Res.
2009
, vol. 
38
 (pg. 
D870
-
D871
)
17
Krupp
M
Marquardt
JU
Sahin
U
et al. 
RNA-Seq Atlas—a reference database for gene expression profiling in normal tissue by next-generation sequencing
Bioinformatics
2012
, vol. 
28
 (pg. 
1184
-
1185
)
18
Bhasi
A
Pandey
RV
Utharasamy
SP
Senapathy
P
EuSplice: a unified resource for the analysis of splice signals and alternative splicing in eukaryotic genes
Bioinformatics
2007
, vol. 
23
 (pg. 
1815
-
1823
)
19
Kim
N
Alekseyenko
AV
Roy
M
Lee
C
The ASAP II database: analysis and comparative genomics of alternative splicing in 15 animal species
Nucleic Acids Res.
2007
, vol. 
35
 (pg. 
D93
-
D98
)
20
Koscielny
G
Le Texier
V
Gopalakrishnan
C
et al. 
ASTD: the alternative splicing and transcript diversity database
Genomics
2009
, vol. 
93
 (pg. 
213
-
220
)
21
Martelli
PL
D'Antonio
M
Bonizzoni
P
et al. 
ASPicDB: a database of annotated transcript and protein variants generated by alternative splicing
Nucleic Acids Res.
2011
, vol. 
39
 (pg. 
D80
-
D85
)
22
Birzele
F
Küffner
R
Meier
F
et al. 
ProSAS: a database for analyzing alternative splicing in the context of protein structures
Nucleic Acids Res.
2008
, vol. 
36
 (pg. 
D63
-
D68
)
23
Shionyu
M
Yamaguchi
A
Shinoda
K
et al. 
AS-ALPS: a database for analyzing the effects of alternative splicing on protein structure, interaction and network in human and mouse
Nucleic Acids Res.
2009
, vol. 
37
 (pg. 
D305
-
D309
)
24
Shionyu
M
Takahashi
K
Go
M
AS-EAST: a functional annotation tool for putative proteins encoded by alternatively spliced transcripts
Bioinformatics
2012
, vol. 
28
 (pg. 
2076
-
2077
)
25
Floris
M
Raimondo
D
Leoni
G
et al. 
MAISTAS: a tool for automatic structural evaluation of alternative splicing products
Bioinformatics
2011
, vol. 
27
 (pg. 
1625
-
1629
)
26
Flicek
P
Amode
MR
Barrell
D
et al. 
Ensembl 2012
Nucleic Acids Res.
2011
, vol. 
40
 (pg. 
D84
-
D90
)
27
Punta
M
Coggill
PC
Eberhardt
RY
et al. 
The Pfam protein families database
Nucleic Acids Res.
2011
, vol. 
40
 (pg. 
D290
-
D301
)
28
The UniProt Consortium
Reorganizing the protein space at the Universal Protein Resource (UniProt)
Nucleic Acids Res.
2011
, vol. 
40
 (pg. 
D71
-
D75
)
29
Ashburner
M
Ball
CA
Blake
JA
et al. 
Gene ontology: tool for the unification of biology. The Gene Ontology Consortium
Nat. Genet.
2000
, vol. 
25
 (pg. 
25
-
29
)
30
Barrett
T
Troup
DB
Wilhite
SE
et al. 
NCBI GEO: archive for functional genomics data sets—10 years on
Nucleic Acids Res.
2010
, vol. 
39
 (pg. 
D1005
-
D1010
)
31
Wang
ET
Sandberg
R
Luo
S
et al. 
Alternative isoform regulation in human tissue transcriptomes
Nature
2008
, vol. 
456
 (pg. 
470
-
476
)
32
Blekhman
R
Marioni
JC
Zumbo
P
et al. 
Sex-specific and lineage-specific alternative splicing in primates
Genome Res.
2010
, vol. 
20
 (pg. 
180
-
189
)
33
Hon
GC
Hawkins
RD
Caballero
OL
et al. 
Global DNA hypomethylation coupled to repressive chromatin domain formation and gene silencing in breast cancer
Genome Res.
2012
, vol. 
22
 (pg. 
246
-
258
)
34
Kim
JH
Dhanasekaran
SM
Prensner
JR
et al. 
Deep sequencing reveals distinct patterns of DNA methylation in prostate cancer
Genome Res.
2011
, vol. 
21
 (pg. 
1028
-
1041
)
35
Ma
S
Bao
JY
Kwan
PS
et al. 
Identification of PTK6, via RNA sequencing analysis, as a suppressor of esophageal squamous cell carcinoma
Gastroenterology
2012
, vol. 
143
 (pg. 
675
-
686
e1–e12
36
Mercer
TR
Neph
S
Dinger
ME
et al. 
The human mitochondrial transcriptome
Cell
2011
, vol. 
146
 (pg. 
645
-
658
)
37
Reich
A
Klatsky
P
Carson
S
Wessel
G
The transcriptome of a human polar body accurately reflects its sibling oocyte
J. Biol. Chem.
2011
, vol. 
286
 (pg. 
40743
-
40749
)
38
Chen
LY
Wei
KC
Huang
AC
et al. 
RNASEQR—a streamlined and accurate RNA-seq sequence analysis program
Nucleic Acids Res.
2012
, vol. 
40
 pg. 
e42
 
39
Mullokandov
G
Baccarini
A
Ruzo
A
et al. 
High-throughput assessment of microRNA activity and function using microRNA sensor and decoy libraries
Nat. Methods
2012
, vol. 
9
 (pg. 
840
-
846
)
40
Bert
SA
Robinson
MD
Strbenac
D
et al. 
Regional activation of the cancer genome by long-range epigenetic remodeling
Cancer cell
2013
, vol. 
23
 (pg. 
9
-
22
)
41
Miao
F
Chen
Z
Zhang
L
et al. 
RNA-sequencing analysis of high glucose treated monocytes reveals novel transcriptome signatures and associated epigenetic profiles
Physiol. Genomics
2013
, vol. 
45
 (pg. 
287
-
299
)
42
Langmead
B
Trapnell
C
Pop
M
Salzberg
SL
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome
Genome Biol.
2009
, vol. 
10
 pg. 
R25
 
43
Trapnell
C
Pachter
L
Salzberg
SL
TopHat: discovering splice junctions with RNA-Seq
Bioinformatics
2009
, vol. 
25
 (pg. 
1105
-
1111
)
44
Trapnell
C
Roberts
A
Goff
L
et al. 
Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks
Nat. Protoc.
2012
, vol. 
7
 (pg. 
562
-
578
)
45
Kim
H
Bi
Y
Pal
S
et al. 
IsoformEx: isoform level gene expression estimation using weighted non-negative least squares from mRNA-Seq data
BMC Bioinformatics
2011
, vol. 
12
 pg. 
305
 
46
Li
B
Dewey
CN
RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome
BMC Bioinformatics
2011
, vol. 
12
 pg. 
323
 
47
Glaus
P
Honkela
A
Rattray
M
Identifying differentially expressed transcripts from RNA-seq data with biological variation
Bioinformatics
2012
, vol. 
28
 (pg. 
1721
-
1728
)
48
Du
J
Leng
J
Habegger
L
et al. 
IQSeq: integrated isoform quantification analysis based on next-generation sequencing
PLoS One
2012
, vol. 
7
 pg. 
e29175
 
49
Mercer
TR
Dinger
ME
Mattick
JS
Long non-coding RNAs: insights into functions
Nat. Rev. Genet.
2009
, vol. 
10
 (pg. 
155
-
159
)
50
Ponting
CP
Oliver
PL
Reik
W
Evolution and functions of long noncoding RNAs
Cell
2009
, vol. 
136
 (pg. 
629
-
641
)
51
Prensner
JR
Chinnaiyan
AM
The emergence of lncRNAs in cancer biology
Cancer Discov.
2011
, vol. 
1
 (pg. 
391
-
407
)
52
Ambros
V
The functions of animal microRNAs
Nature
2004
, vol. 
431
 (pg. 
350
-
355
)
53
Kerrien
S
Aranda
B
Breuza
L
et al. 
The IntAct molecular interaction database in 2012
Nucleic Acids Res.
2012
, vol. 
40
 (pg. 
D841
-
D846
)
54
Licata
L
Briganti
L
Peluso
D
et al. 
MINT, the molecular interaction database: 2012 update
Nucleic Acids Res.
2012
, vol. 
40
 (pg. 
D857
-
D861
)
55
Salwinski
L
Miller
CS
Smith
AJ
et al. 
The database of interacting proteins: 2004 update
Nucleic Acids Res.
2004
, vol. 
32
 (pg. 
D449
-
D451
)
56
Stark
C
Breitkreutz
BJ
Chatr-Aryamontri
A
et al. 
The BioGRID interaction database: 2011 update
Nucleic Acids Res.
2011
, vol. 
39
 (pg. 
D698
-
D704
)
57
Chautard
E
Fatoux-Ardore
M
Ballut
L
et al. 
MatrixDB, the extracellular matrix interaction database
Nucleic Acids Res.
2011
, vol. 
39
 (pg. 
D235
-
D240
)
58
García-Mayoral
MF
Díaz-Moreno
I
Hollingworth
D
Ramos
A
The sequence selectivity of KSRP explains its flexibility in the recognition of the RNA targets
Nucleic Acids Res.
2008
, vol. 
36
 (pg. 
5290
-
5296
)
59
Iijima
T
Wu
K
Witte
H
et al. 
SAM68 regulates neuronal activity-dependent alternative splicing of neurexin-1
Cell
2011
, vol. 
147
 (pg. 
1601
-
1614
)
60
Smith
RK
Carroll
PM
Allard
JD
Simon
MA
MASK, a large ankyrin repeat and KH domain-containing protein involved in Drosophila receptor tyrosine kinase signaling
Development
2002
, vol. 
129
 (pg. 
71
-
82
)
61
Traina
F
Favaro
PM
Medina Sde
S
et al. 
ANKHD1, ankyrin repeat and KH domain containing 1, is overexpressed in acute leukemias and is associated with SHP2 in K562 cells
Biochim. Biophys. Acta
2006
, vol. 
1762
 (pg. 
828
-
834
)
62
Díaz-Moreno
I
Hollingworth
D
Frenkiel
TA
et al. 
Phosphorylation-mediated unfolding of a KH domain regulates KSRP localization via 14-3-3 binding
Nat. Struct. Mol. Biol.
2009
, vol. 
16
 (pg. 
238
-
246
)
63
Miles
MC
Janket
ML
Wheeler
ED
et al. 
Molecular and functional characterization of a novel splice variant of ANKHD1 that lacks the KH domain and its role in cell survival and apoptosis
FEBS J.
2005
, vol. 
272
 (pg. 
4091
-
4102
)

Author notes

Present address: Valerio Bianchi, Center for Genomic Science of IIT@SEMM, Istituto Italiano di Tecnologia, via Adamello 16, Milan 20139, Italy.

Citation details: Bianchi,V., Colantoni,A., Calderone,A., et al. DBATE: database of alternative transcripts expression. (2013) Vol. 2013: article ID bat050; doi:10.1093/database/bat050.