dbWFA: a web-based database for functional annotation of Triticum aestivum transcripts

Author Notes

Abstract

The functional annotation of genes based on sequence homology with genes from model species genomes is time-consuming because it is necessary to mine several unrelated databases. The aim of the present work was to develop a functional annotation database for common wheat Triticum aestivum (L.). The database, named dbWFA, is based on the reference NCBI UniGene set, an expressed gene catalogue built by expressed sequence tag clustering, and on full-length coding sequences retrieved from the TriFLDB database. Information from good-quality heterogeneous sources, including annotations for model plant species Arabidopsis thaliana (L.) Heynh. and Oryza sativa L., was gathered and linked to T. aestivum sequences through BLAST-based homology searches. Even though the complexity of the transcriptome cannot yet be fully appreciated, we developed a tool to easily and promptly obtain information from multiple functional annotation systems (Gene Ontology, MapMan bin codes, MIPS Functional Categories, PlantCyc pathway reactions and TAIR gene families). The use of dbWFA is illustrated here with several query examples. We were able to assign a putative function to 45% of the UniGenes and 81% of the full-length coding sequences from TriFLDB. Moreover, comparison of the annotation of the whole T. aestivum UniGene set along with curated annotations of the two model species assessed the accuracy of the annotation provided by dbWFA. To further illustrate the use of dbWFA, genes specifically expressed during the early cell division or late storage polymer accumulation phases of T. aestivum grain development were identified using a clustering analysis and then annotated using dbWFA. The annotation of these two sets of genes was consistent with previous analyses of T. aestivum grain transcriptomes and proteomes.

Database URL:urgi.versailles.inra.fr/dbWFA/

Introduction

Triticum aestivum (L.), common wheat or bread wheat, is one of the most important staple crops in the world. It is cultivated worldwide and provides >20% of the calories and proteins in the human diet (http://faostat.fao.org). Although ongoing sequencing efforts have already produced important genomic resources (1–4), the complete sequencing and annotation of the hexaploid (2n = 6× = 42, AABBDD) T. aestivum genome has yet to be achieved. A first version of the genome of the bread wheat cv. Chinese Spring has recently been published (4), providing the scientific community with highly valuable genomic and evolutionary information, which will facilitate genome-wide analysis of bread wheat. However, because of the low-coverage (5-fold) shotgun sequencing method used in this project, this resource does not represent a high-quality draft of the wheat genome in terms of sequence completion, quality and annotation. Genome-wide analysis of gene expression by means of expression microarrays or transcriptome sequencing (RNA-Seq) is now being adopted in T. aestivum (5, 6), but analysing such large datasets requires extensive annotation efforts. Data fragmentation and technical and semantic heterogeneity can severely limit the efficient extraction and interpretation of biological data (7, 8).

More and more genomic information is becoming available for T. aestivum research. Various resources and associated tools grant the user a structural overview of expressed sequence tag (EST) (ITEC, http://avena.pw.usda.gov/genome/) (9, 10) or bacterial artificial chromosome clone libraries (11, 12) for instance (13–15). Important initiatives are underway to facilitate the breeding of improved Triticeae varieties. The TriticeaeGenome project (www.triticeaegenome.eu) grants access to comprehensive information extracted from experimental data to provide a better understanding of Triticeae genomes (16). The global database GrainGenes (http://wheat.pw.usda.gov/) provides a variety of services and bioinformatics tools for the Triticeae and Avena sativa research communities. The HarvEST database (http://harvest.ucr.edu/) (17), dedicated to several crop species, including T. aestivum and Hordeum vulgare, provides access to curated EST assemblies, comparative analysis tools and links to orthologues in related model plant species. Together, these resources compile and cross-reference a great deal of information on physical and genetic mapping, markers, sequence variations and quantitative trait loci. To some extent, they also provide information leading indirectly to predicted gene product functions, but none of them is focused on functional gene annotation, and it is necessary to navigate through numerous unlinked resources to extract functional information.

Recently, pipelines for the automated annotation of genomic sequences of T. aestivum and related species have been developed (3, 18). These pipelines are based on the prediction of gene models within genome sequences, so they are not able to functionally annotate sequences originating from transcriptome sequencing like ESTs. Because no reference genome sequence is yet available for T. aestivum, a massive sequencing effort has produced more than a million ESTs (http://wheat.pw.usda.gov/genome/). To deal with the high level of redundancy of this resource, these sequences were clustered (i.e. overlapping and partial polyA-tailed expressed sequences are grouped) to provide a reference set of unique expressed genes, NCBI UniGenes (http://www.ncbi.nlm.nih.gov/UniGene). The assembly conditions used to build NCBI UniGenes make it the most comprehensive coding DNA (cDNA) assembly available to date, and UniGene assemblies have been used as a reference set of sequences for many species. An additional effort was made to construct full-length cDNA sequences that were included in the TriFLDB database (19). Full-length cDNA sequences are most commonly used in genome annotation as a resource for cross-species comparative analyses. Currently, TriFLDB is the most reliable source of full-length cDNA sequences in T. aestivum. TriFLDB includes annotations based on homologies found by searching protein databases, extensive Gene Ontology (GO) annotations and InterProScan results. Recently, a new collection of nearly 1 million ESTs, assembled into contigs and singlets, was annotated with GO terms (20), but meaningful prediction of gene function requires more than one system of annotation.

After the sequencing of the first plant genome, Arabidopsis thaliana in 2000 (21), several plant sequencing projects have been successful. The sequenced genomes most closely related to the T. aestivum genome are those of Oryza sativa ssp. indica (22), Oryza sativa ssp. japonica (23), Zea mays (24), Glycine max (25), Sorghum bicolor (26), Brachypodium distachyon (27) and Hordeum vulgare (28). Both structural and functional annotation resources for these species are developing steadily. One of the most effective methods to annotate a transcript is to find its orthologous counterparts in well-annotated closely related genomes (29, 30). Although H. vulgare (L.) would be expected to be the most useful reference because it is more closely related to T. aestivum, comprehensive and high-quality annotations of gene function are only available for O. sativa and A. thaliana (2), essentially owing to the lengthy and accurate annotation efforts undertaken.

To annotate ESTs or transcripts using sequence homology, it is necessary to navigate through unrelated databases. Some tools using this homology approach have been developed. For instance, Blast2GO (31) can be queried using T. aestivum sequences to give GO results. ONDEX (7), developed with the challenges of functional annotation of T. aestivum genome in mind, combines data integration from various sources and various mining methods, including graph-based analyses, to annotate wheat gene functions according to a wisely chosen set of annotation standards. However, ONDEX does not provide easy access to workable static results that are often required in research. To fill this gap, we developed dbWFA, an open-access database relating the T. aestivum UniGene set and the full-length cDNA sequences from TriFLDB to A. thaliana (TAIR10) (32) and O. sativa (pseudomolecules version 7.0) (33) annotation through BLAST (34) results. dbWFA also includes the inventory database of T. aestivum transcription factors (wDBTF) (35) and hand-curated gene families (36). As an all-in-one interface for the annotation of T. aestivum sequences, dbWFA will be useful to the researchers working on T. aestivum and more generally on cereals, particularly for comparative cereal genomics and functional genomics. The web implementation of dbWFA provides an easy-to-use interface to annotate transcript sequences from T. aestivum, with functional information from multiple pervasive annotation systems. Here, the use of dbWFA is illustrated with several query examples, and the quality of the annotation method is assessed by comparing the MapMan bin annotation of all the transcripts of the T. aestivum NimbleGen 40 k microarray (37) with that of A. thaliana and O. sativa. The use of dbWFA is further illustrated by analysing the annotation of 433 genes specifically expressed during either the early cell division or the late storage polymer accumulation (SPA) phases of grain development.

Data Content, Database Architecture and Web Interface

Five functional classification/annotation systems were integrated (Figure 1) in dbWFA to offer a fast and efficient functional annotation tool for T. aestivum UniGenes:

GO (http://www.geneontology.org) (38), a non-redundant structured hierarchy of ontologies, which is the most widely used functional annotation system in bioinformatics. The GO project provides an efficient annotation standard that can be applied to numerous species. It is built on a controlled vocabulary of terms for describing gene function. dbWFA includes GO annotation data (OBO version 1.2) for A. thaliana and O. sativa.
Plant Metabolic Network (PMN; http://www.plantcyc.org) (39), which provides a broad network of curated databases on primary and secondary plant metabolism, including pathways, enzymes, genes, compounds and reactions from several plant species. dbWFA contains data from AraCyc (version 9.0) for A. thaliana and RiceCyc (version 3.2) for O. sativa.
MapMan (http://mapman.gabipd.org) (40), which is a user-driven tool for large datasets (e.g. gene expression data from microarrays) visualized in the context of diagrams of metabolic pathways or other processes. MapMan annotation data (bin tree version 1.1) for both A. thaliana and O. sativa are stored in dbWFA. The dbWFA database also provides a function to automatically generate MapMan T. aestivum mapping files.
Munich Information Center for Protein Sequences Functional Catalogue (MIPS FunCat; http://www.helmholtz-muenchen.de/en/mips/projects/funcat) (41), which provides a hierarchical scheme for the functional description of proteins of prokaryotic and eukaryotic origin. MIPS FunCat annotations for A. thaliana (MAtDB version 2.1) are stored in dbWFA.
A. thaliana Gene Family Information (TAIR version 10; http://www.arabidopsis.org/browse/genefamily) (42), which provides gene family information for the plant model species A. thaliana.

The 17 541 full-length cDNA sequences from the TriFLDB and other public databases and all the transcript sequences from the T. aestivum UniGene set (builds #55, #58, #59 and #60) were processed using the BLASTx algorithm against A. thaliana and O. sativa predicted cDNA sequences (Figure 1). Build #55 (the one used to develop the T. aestivum NimbleGen 40 k microarray) (37) and the following major releases were retained, as users may have developed resources based on different builds of the UniGene even though NCBI only stores the most recent build. BLAST results with an e-value >10⁻³ were not stored in the database, as we considered this would be too poor a match for most research. No other filter was applied to the BLAST results before their insertion into the database. All the parameters from the BLAST tabular results were kept, and >30 × 10⁶ BLAST results for the UniGenes and 95 × 10⁶ BLAST results for the ESTs shaping the UniGene clusters were stored, so they could be rapidly screened when querying the database.

Figure 1

Simplified diagram of the data integration process.

Open in new tab Download slide

The database also contains curated information on T. aestivum transcription factors (2891 transcripts), E3 ubiquitin ligases of the ubiquitin-proteasome system (876 transcripts), hormone-responsive genes (467 transcripts) and seed storage proteins (55 transcripts; Figure 1). Transcription factor UniGenes were retrieved from the wDBTF database (34). E3 ligase and hormone-responsive UniGenes were recovered from the NCBI and TAIR databases using all A. thaliana and O. sativa E3 ligase and hormone-responsive sequences as the query in homology searches using the BLASTn, BLASTx and tBLASTx programs (36). The BLAST hits were filtered using an e-value threshold of 10⁻⁵ and an alignment length exceeding 80 bp. All sequences were checked for consistency and for the presence of specific protein signatures using the InterProScan program (http://www.ebi.ac.uk/Tools/pfa/iprscan/). For seed storage protein UniGenes, homology searches were performed on the whole UniGene build #55, using BLASTx and T. aestivum seed storage protein sequences as reference. No preliminary filter was applied to BLASTx results. Instead, all the alignments were carefully examined, and similarity in known conserved critical regions of seed storage proteins was given priority over e-value and BLAST score alone. In dbWFA, curated UniGene annotations are assigned to T. aestivum transcripts without any intermediate BLAST result.

Following the recommendations of the International Wheat Genome Sequencing Consortium (IWGSC) for annotating T. aestivum genomic sequences (3), the percentages of coverage (with respect to the length of the orthologous proteins) and identity are used to assign functional annotations to a transcript. In dbWFA, users can define the value of these two parameters, but we strongly recommend using the cutoff values suggested by the IWGSC, where BLAST results with an identity >45% and coverage >50% are assigned a ‘putative function’ and BLAST results with identity and coverage >90% are assigned a ‘known function’.

All data are stored in a MySQL database. The integration of the database allows one to assign the functional annotation from any of the systems described above to the transcripts of interest and vice versa. The dbWFA database thus provides a very powerful resource for the annotation of T. aestivum UniGenes. To find the most commonly sought types of information from dbWFA, simple yet pertinent queries with their parameters can be sent through a web-based interface (Figure 2). The results are delivered as html pages, and an export procedure is available to retrieve data in spreadsheet. The html result pages provide links redirecting the user to websites of the different annotation systems, allowing a global analysis of the annotation results. The web interface can also be used to automatically create MapMan mapping files for the search results. Although the dbWFA web interface only allows data mining of common queries, specific queries can be performed using the SQL database, which can be downloaded from the dbWFA website. The modularity of the database will facilitate the integration of new T. aestivum data as transcripts are sequenced and annotated through different pipelines.

Figure 2

Screen capture of the web interface of the dbWFA database. (A) Page for querying PMN pathways. Similar pages can be used to query the MIPS Functional Category, TAIR gene families, GO and MapMan bins. A list of GO can be queried simultaneously. (B) Page for querying UniGene or Full-length cDNA sequences annotations. (C) Result page for annotated UniGenes.

Open in new tab Download slide

Using dbWFA: Percentage of Annotated UniGenes, Comparison of T. aestivum UniGene and A. thaliana and O. sativa Whole-Genome Annotation and Query Examples

Thirty-four percent (13 713 transcript sequences), 40% (14 843), 35% (20 016) and 35% (20 034) of the transcript sequences of the UniGene builds #55, #58, #59 and #60, respectively, have a putative functional annotation in at least one of the annotation resources. Eighty-one percent of the 17 541 full-length cDNA sequences from TriFLDB have a putative functional annotation in at least one of the annotation resources. The number of transcripts and full-length cDNA sequences annotated in the different resources are given in Table 1. BLASTn analysis revealed that 12 478 full-length cDNA sequences matched a sequence in the UniGene set (build #60) with a coverage and identity threshold value >50 and 90%, respectively. Among these 12 478 correspondences, 10 996 and 5 932 full-length cDNA sequences and UniGene sequences, respectively, have a putative functional annotation in at least one of the annotation resources. This result highlights the additional information brought by the full-length cDNA sequences.

Table 1

Open in new tab

Number of T. aestivum transcripts from the NCBI UniGene set (build #60) and full-length cDNA (FL cDNA) sequences retrieved from the TriFLDB database, annotated with a putative function (coverage >50%, identity >45%) in at least one annotation system

Functional annotation systems	Number of annotated transcripts
	O. sativa		A. thaliana		Total^a
	NCBI UniGene	FL cDNA	NCBI UniGene	FL cDNA	NCBI UniGene	FL cDNA
MIPS functional classification			12 943	10 864	12 943	10 864
PlantCyc pathway reactions	2193	2106	2093	2208	3067	2911
GOs	13 142	8014	10 444	10 850	16 079	12 279
TAIR A. thaliana gene families			4498	3797	4498	3797
MapMan bins	19 248	14 032	13 202	10 897	20 033	14 224
Curated pathways or functions
Hormone-responsive genes					467
Ubiquitin-proteasome system					876
Transcription factors					2891

Functional annotation systems	Number of annotated transcripts
	O. sativa		A. thaliana		Total^a
	NCBI UniGene	FL cDNA	NCBI UniGene	FL cDNA	NCBI UniGene	FL cDNA
MIPS functional classification			12 943	10 864	12 943	10 864
PlantCyc pathway reactions	2193	2106	2093	2208	3067	2911
GOs	13 142	8014	10 444	10 850	16 079	12 279
TAIR A. thaliana gene families			4498	3797	4498	3797
MapMan bins	19 248	14 032	13 202	10 897	20 033	14 224
Curated pathways or functions
Hormone-responsive genes					467
Ubiquitin-proteasome system					876
Transcription factors					2891

^aNumber of transcripts and full-length cDNA sequences annotated with a putative function in at least one model species.

Table 1

Open in new tab

Functional annotation systems	Number of annotated transcripts
	O. sativa		A. thaliana		Total^a
	NCBI UniGene	FL cDNA	NCBI UniGene	FL cDNA	NCBI UniGene	FL cDNA
MIPS functional classification			12 943	10 864	12 943	10 864
PlantCyc pathway reactions	2193	2106	2093	2208	3067	2911
GOs	13 142	8014	10 444	10 850	16 079	12 279
TAIR A. thaliana gene families			4498	3797	4498	3797
MapMan bins	19 248	14 032	13 202	10 897	20 033	14 224
Curated pathways or functions
Hormone-responsive genes					467
Ubiquitin-proteasome system					876
Transcription factors					2891

Functional annotation systems	Number of annotated transcripts
	O. sativa		A. thaliana		Total^a
	NCBI UniGene	FL cDNA	NCBI UniGene	FL cDNA	NCBI UniGene	FL cDNA
MIPS functional classification			12 943	10 864	12 943	10 864
PlantCyc pathway reactions	2193	2106	2093	2208	3067	2911
GOs	13 142	8014	10 444	10 850	16 079	12 279
TAIR A. thaliana gene families			4498	3797	4498	3797
MapMan bins	19 248	14 032	13 202	10 897	20 033	14 224
Curated pathways or functions
Hormone-responsive genes					467
Ubiquitin-proteasome system					876
Transcription factors					2891

^aNumber of transcripts and full-length cDNA sequences annotated with a putative function in at least one model species.

The quality of the annotation method is illustrated by comparing the MapMan bin annotation of all the transcripts of the T. aestivum NimbleGen 40 k microarray (developed with UniGene build #55) and the full-length cDNA sequences from TriFLDB with the annotation of A. thaliana and O. sativa imported from MapMan and recorded in the database. The MapMan bins were used here because this annotation system is available for the three species. Overall, there was no clear bias between the three species (Figure 3), and the percentages of genes in the 26 categories for the three species were well correlated (T. aestivum versus A. thaliana: r = 0.96, P < 0.001; T. aestivum versus O. sativa: r = 0.69, P < 0.001), with no significant bias (P < 0.001). The higher correlation found with A. thaliana compared with O. sativa is mainly because there are fewer annotated transcripts in the DNA bin for O. sativa than for A. thaliana and T. aestivum (r = 0.90 for T. aestivum versus O. sativa when this bin is not considered). For the full-length coding sequences retrieved from TriFLDB and other public databases, the correlations between T. aestivum and A. thaliana and between T. aestivum and O. sativa were the same (r = 0.90, P < 0.001). The pairwise correlations between the four MapMan bin annotations presented were remarkably high, all >0.9 when the DNA bin was omitted. Similar results were obtained for the PlantCyc pathway reactions and GO (data not shown).

Figure 3

Radar plot (log scale) of the MapMan bin annotations for A. thaliana, O. sativa and T. aestivum UniGene (build #60) and full-length coding sequences. Data are percent of the total number of MapMan bin annotations (Table 1). Similar results were obtained with builds #55, #58 and #59 (data not shown). Some bins have been merged to make the figure clearer.

Open in new tab Download slide

Unlike many annotation tools, dbWFA makes it possible to query multiple annotation systems simultaneously. To demonstrate various features of the dbWFA database, some query examples are presented in Box 1, using either the website or the database installed on a local machine.

Box 1. Query Examples

To demonstrate the usefulness of dbWFA, several biologically relevant queries that can be performed using the current system are presented. In these examples, the UniGene build #55 was used, with coverage and identity thresholds of 50 and 45%, respectively, as recommended by the IWGSC to assign a putative function to a transcript.

Query 1

Find all T. aestivum transcripts likely to have a phytoene synthase activity

UniGene			Matching sequences		Alignment parameters
Id number	Representing sequence	Description	Id number	Description	Coverage (%)	Identity (%)
Ta.41960	Ta_S16057905	T. aestivum clone wr1.pk0139.g3:fis, full insert mRNA sequence	LOC_OS06G51290	Phytoene synthase, chloroplast precursor, putative, expressed	59.7	81.4
			AT5G17230	Phytoene synthase	58.0	79.6

Ta.66029	Ta_S26027774	FGAS000498 T. aestivum FGAS: Library 2 Gate 3? T. aestivum cDNA, mRNA sequence	LOC_OS06G51290	phytoene synthase, chloroplast precursor, putative, expressed	55.3	48.9
			AT5G17230	Phytoene synthase	59.7	47.08

UniGene			Matching sequences		Alignment parameters
Id number	Representing sequence	Description	Id number	Description	Coverage (%)	Identity (%)
Ta.41960	Ta_S16057905	T. aestivum clone wr1.pk0139.g3:fis, full insert mRNA sequence	LOC_OS06G51290	Phytoene synthase, chloroplast precursor, putative, expressed	59.7	81.4
			AT5G17230	Phytoene synthase	58.0	79.6

Ta.66029	Ta_S26027774	FGAS000498 T. aestivum FGAS: Library 2 Gate 3? T. aestivum cDNA, mRNA sequence	LOC_OS06G51290	phytoene synthase, chloroplast precursor, putative, expressed	55.3	48.9
			AT5G17230	Phytoene synthase	59.7	47.08

The first committed step in the biosynthesis of carotenoids is the condensation of two geranylgeranyl disphosphate molecules by phytoene synthase to produce phytoene, which catalyses a rate-controlling step in the plastid-localized carotenoid pathway (43). We could query the database for the PlantCyc pathway reaction 2.5.1.32 using its web interface. The result of this query is shown in the above table. Two T. aestivum transcripts were annotated with a putative phytoene synthase activity. In good agreement with this result, previous studies showed that Poaceae species possess a duplicated phytoene synthase gene (44). A thorough analysis of the two annotated UniGene sequences confirmed that they correspond to the duplicated phytoene synthase gene found in Poaceae. A third phytoene synthase has been isolated in Z. mays and T. aestivum (45, 46). Although the three O. sativa phytoene synthase genes are present in the database, the T. aestivum UniGene of this phytoene synthase gene was not found in dbWFA.

The phytoene synthase activity also corresponds to GO:0016767 MapMan bin 16.1.4.1. Searching dbWFA for this GO or MapMan bin yields the same results as above. It is possible to combine several overlying systems (e.g. PlantCyc pathway reaction and GO) in a single MySQL query when the database is installed on a local machine. It is also possible to compare and make the union or intersection of queries using MySQL, depending on the intended outcome.

Query 2

Find as much information as possible about a list of transcripts

UniGene		GO	TAIR	MIPS	PlantCyc	MapMan
Id number	Match	GO	TAIR	MIPS	PlantCyc	MapMan
Ta.41960	AT5G17230	GO:0009507		01.06.06.13	2.5.1.32	16.1.4.1
	Phytoene synthase	GO:0016117		70.26.03	2.5.1.32
		GO:0016767
		GO:0046905

UniGene		GO	TAIR	MIPS	PlantCyc	MapMan
Id number	Match	GO	TAIR	MIPS	PlantCyc	MapMan
Ta.41960	AT5G17230	GO:0009507		01.06.06.13	2.5.1.32	16.1.4.1
	Phytoene synthase	GO:0016117		70.26.03	2.5.1.32
		GO:0016767
		GO:0046905

The efficiency of the database stems from its multiple systems of annotation. The cross-system annotation feature of dbWFA is integrated in the web interface in the ‘Transcript(s) annotation’ search method. This type of query could be used to obtain information for a list of UniGenes of interest in the different annotation systems integrated in dbWFA. Querying the UniGene set for the first phytoene synthase transcript retrieved in Query 1 yields the annotation shown in the above table. On the web interface, the user can choose to display only the best hit (as in the above table) or the five best hits with percentages of coverage and identity greater than the thresholds set by the user. The user can also choose the systems of annotation to include in the query and the model species. The results redirect the user to the web pages of the different annotation systems, which allows more detailed information to be obtained on the annotation of the list of transcripts of interest.

Query 3

Find all the transcripts putatively involved in the glycolytic pathway for a transcriptome analysis in MapMan

Bin code	Name	Identifier	Description	Type
4.1	Glycolysis.cytosolic branch	Ta_S16058223	Similar to UTP–glucose-1-phosphate uridylyltransferase, putative, expressed	T
4.1	Glycolysis.cytosolic branch	Ta_S16058223	Coverage: 99.5745%, identity: 92.75%	T
4.1.10	Glycolysis.cytosolic branch.non-phosphorylating glyceraldehyde 3-phosphate dehydrogenase (NPGAP-DH)	Ta_S13048872	Similar to aldehyde dehydrogenase	T
4.1.10		Ta_S13048872	Coverage: 100%, identity: 87.1%	T
4.1.10	Glycolysis.cytosolic branch.non-phosphorylating glyceraldehyde 3-phosphate dehydrogenase (NPGAP-DH)	Ta_S13048873	Similar to aldehyde dehydrogenase	T
4.1.10		Ta_S13048873	Coverage: 100%, identity: 79.23%	T
4.1.11	Glycolysis.cytosolic branch.aldolase	Ta_S15902802	Similar to aldolase superfamily protein	T
4.1.11	Glycolysis.cytosolic branch.aldolase	Ta_S15902802	Coverage: 50.1873%, identity: 85.07%	T
4.1.11	Glycolysis.cytosolic branch.aldolase	Ta_S17888674	Similar to aldolase superfamily protein	T
4.1.11	Glycolysis.cytosolic branch.aldolase	Ta_S17888674	Coverage: 88.5475%, identity: 48.91%	T

Bin code	Name	Identifier	Description	Type
4.1	Glycolysis.cytosolic branch	Ta_S16058223	Similar to UTP–glucose-1-phosphate uridylyltransferase, putative, expressed	T
4.1	Glycolysis.cytosolic branch	Ta_S16058223	Coverage: 99.5745%, identity: 92.75%	T
4.1.10	Glycolysis.cytosolic branch.non-phosphorylating glyceraldehyde 3-phosphate dehydrogenase (NPGAP-DH)	Ta_S13048872	Similar to aldehyde dehydrogenase	T
4.1.10		Ta_S13048872	Coverage: 100%, identity: 87.1%	T
4.1.10	Glycolysis.cytosolic branch.non-phosphorylating glyceraldehyde 3-phosphate dehydrogenase (NPGAP-DH)	Ta_S13048873	Similar to aldehyde dehydrogenase	T
4.1.10		Ta_S13048873	Coverage: 100%, identity: 79.23%	T
4.1.11	Glycolysis.cytosolic branch.aldolase	Ta_S15902802	Similar to aldolase superfamily protein	T
4.1.11	Glycolysis.cytosolic branch.aldolase	Ta_S15902802	Coverage: 50.1873%, identity: 85.07%	T
4.1.11	Glycolysis.cytosolic branch.aldolase	Ta_S17888674	Similar to aldolase superfamily protein	T
4.1.11	Glycolysis.cytosolic branch.aldolase	Ta_S17888674	Coverage: 88.5475%, identity: 48.91%	T

In the search method ‘MapMan mapping file generator’, the user can select a metabolic pathway and automatically create a mapping file to visualise the results of transcriptomic experiences performed with the T. aestivum custom NimbleGen 40 k microarray using the -omic data viewing and analysing tool MapMan. The glycolytic pathway corresponds to the bin code 4. The first five lines of the table generated by dbWFA for this query are shown above. When the database is installed on a local machine, several pathways could be queried simultaneously to create a custom T. aestivum mapping file for MapMan.

Identification and Annotation of UniGenes Specifically Expressed During Either the Early or Late Stage of Grain Development

A total of 39 029 transcripts from the UniGene set (build #55) and 1613 transcription factors from the wDBTF database not present in the UniGene set are spotted on the custom T. aestivum NimbleGen 40 k microarray (36). Previous studies have shown that 18 140 (44.6%) of these transcripts are expressed during T. aestivum grain development (47). In dbWFA, 34–40% (depending on the build) of these transcripts have a putative functional annotation.

T. aestivum grain development comprises several distinct phases, starting with a syncytial then a cellularization phase (ca. 0–100°Cdays after anthesis), followed by a first differentiation phase of active endosperm cell division (ECD), expansion and differentiation (ca. 100–250°Cdays after anthesis), a second differentiation phase when storage polymers rapidly accumulate (ca. 250–750°Cdays after anthesis) and a maturation phase when grain rapidly desiccates (ca. 750–900°Cdays after anthesis) (48, 49). The transitions between these phases are associated with major changes in the grain transcriptome (5, 36, 50, 51) and proteome (52, 53).

To validate the principle underpinning the database and provide another example of the usefulness of dbWFA, we analysed the functional annotation of the transcripts specifically expressed during either the ECD or SPA phase of grain development. We used transcriptome data obtained with the custom T. aestivum NimbleGen 40 k microarray for the T. aestivum cultivar Recital grown under standard conditions in a greenhouse and sampled every 34–117°Cdays between 132 and 686°Cdays after anthesis (35). Transcripts with different patterns of expression were classified with J-Express 2012 software package (54) using Euclidean distance–based k-means clustering. For this analysis, the number of clusters was empirically set at 25 because it allowed us to clearly discriminate between gene expression clusters specific to the ECD and SPA phases of grain development. One cluster of 238 genes contained genes expressed exclusively during ECD stages (Figure 4A and B). Two other clusters contained genes expressed exclusively during SPA stages. The latter two clusters were merged to form a single SPA cluster of 195 genes. dbWFA was then used to retrieve the functional classification of the transcripts from both clusters. The MIPS Functional Classification was used, as it was the most informative and straightforward for comparisons with previous studies.

Figure 4

Functional annotation of genes specifically expressed during either the early cell division or late SPA phases of T. aestivum grain development. (A) Heat map of expression for early- and late-development-specific genes. (B) Normalized expression of the early and late development specific gene clusters. Transcripts with normalized expression <7 were not considered to be expressed (i.e. not different from the background noise). Data are medians ± 1 SD. (C) MIPS Functional Categories of genes from both UniGene clusters.

Open in new tab Download slide

Using IWGSC-recommended coverage (50%) and identity (45%) percentages, 68 (29%) ECD-specific transcripts and 129 (66%) SPA-specific transcripts were assigned to an MIPS functional category, respectively (Figure 4C). The annotation results were consistent with previous transcriptome (5, 51) and proteome (52, 53) studies of developing T. aestivum and H. vulgare (55) grain. The functional classifications of the gene clusters were different. Not surprisingly, 12 transcripts involved in cell fate and cell type, tissue differentiation and organ differentiation were specifically expressed during the ECD phase, whereas no SPA phase–specific transcripts were annotated as being in these MIPS functional categories. Also in good agreement with our knowledge of grain development, 55 seed storage protein transcripts were specifically expressed during the SPA phase, while none was found among the annotated ECD-specific genes.

Quantitative differences in the annotation of these two clusters of transcripts were also observed. Several transcripts in the SPA-specific cluster were involved in cell rescue, defence and virulence and in the interaction with the environment. In particular, transcripts involved in plant hormonal regulation were overrepresented in the SPA-specific gene cluster. Transcripts coding for proteins involved in protein synthesis and proteins with metabolic functions were overrepresented in the ECD cluster. These results coincide with previous transcriptome (5, 36, 56) and proteome analyses (53). Finally, we note a substantial difference in the MIPS functional category ‘protein with binding function or co-factor requirement’, with more transcripts involved in DNA binding in the ECD cluster and more transcripts involved in RNA binding in the SPA cluster.

All these data show great similarity to published results for T. aestivum and H. vulgare, reflecting the accuracy of the automatic annotation provided by dbWFA. Unlike several other T. aestivum transcriptome analyses where a complex process had to be carried out to assign a functional annotation to selected transcripts and/or proteins (5, 57), here only a single request was made to the dbWFA database to retrieve the functional classification of 45% of the transcripts of interest, and 40% of the 40 642 transcripts of the T. aestivum NimbleGen 40 k microarray. This percentage of annotated transcripts is similar to that previously reported (38%) for the T. aestivum Affymetrix GeneChip® microarray (4).

Outlook

The dbWFA database was created by integrating numerous data sources. As a result, it is a practical source of heterogeneous data for functional annotation of T. aestivum transcripts. The website grants access to the most common queries that can be applied to the database, and the freely available MySQL database is a powerful tool for more specific requests. Although further analyses are required to confirm the dbWFA annotation results, the database provides an efficient and fast solution for acquiring a wide range of functional information. cDNA resources are useful to predict exonic regions from genomic sequences; thus, efforts to annotate the UniGene resources will significantly contribute to the analysis of sequence data produced by ongoing T. aestivum initiatives and other genome sequencing projects.

The version of dbWFA presented here is operational, but the aim is not to restrict the database to storing O. sativa and A. thaliana annotations but to expand it to include data from other plant species genomes as their functional annotation becomes more consistent. Integration of InterProScan (58) in the workflow could be a valuable way of augmenting the process. Also, the integration of AFAWE (59) would provide an annotation workflow with different function prediction tools. However, the current version of AFAWE cannot be used independently from its web interface and would thus have to be implemented using the tools called in its workflow, which are available as web services. Finally, the upcoming integration of a BLAST program in the workflow will allow users to annotate their own sequences and will also make dbWFA applicable to other species.

Acknowledgements

The authors thank Dr Etienne Paux and Dr Catherine Feuillet (INRA, UMR1095 GDEC, Clermont-Ferrand, France) for useful discussions and advice, and Mr Sébastien Reboux, Ms Claire Viseux and Mr Michael Alaux (INRA, URGI, Versailles, France) for installing and maintaining the database on the URGI server.

Funding

This work was supported by a PhD grant from the French Ministry for Higher Education and Research to J.V.

Conflict of interest. None declared.

References

Feuillet

Eversole

. ,

Physical mapping of the wheat genome: a coordinated effort to lay the foundation for genome sequencing and develop tools for breeders

Isr. J. Plant Sci.

2007

, vol.

(pg.

307

313

)

Google Scholar

Crossref

WorldCat

Feuillet

Leach

Rogers

, et al. ,

Crop genome sequencing: lessons and rationales

Trends Plant Sci.

2011

, vol.

(pg.

)

Leroy

Guilhot

Sakai

, et al. ,

TriAnnot: a versatile and high performance pipeline for the automated annotation of plant genomes

Front. Plant Sci.

2012

, vol.

(pg.

)

Brenchley

Spannagl

Pfeifer

, et al. ,

Analysis of the bread wheat genome using whole-genome shotgun sequencing

Nature

2012

, vol.

491

(pg.

705

710

)

Wan

Poole

Huttly

, et al. ,

Transcriptome analysis of grain development in hexaploid wheat

BMC Genomics

2008

, vol.

pg.

121

Pellny

Lovegrove

Freeman

, et al. ,

Cell walls of developing wheat starchy endosperm: comparison of composition and RNA-Seq transcriptome

Plant Physiol.

2012

, vol.

158

(pg.

612

627

)

Köhler

Baumbach

Taubert

, et al. ,

Graph-based analysis and visualization of experimental results with ONDEX

Bioinformatics

2006

, vol.

(pg.

1383

1390

)

Lysenko

Hindle

Taubert

, et al. ,

Data integration for plant genomics–exemplars from the integration of Arabidopsis thaliana databases

Briefings Bioinformatics

2009

, vol.

(pg.

676

693

)

Google Scholar

Crossref

WorldCat

Lazo

Chao

Hummel

, et al. ,

Development of an expressed sequence tag (EST) resource for wheat (Triticum aestivum L.): EST generation, unigene analysis, probe selection and bioinformatics for a 16,000-locus bin-delineated map

Genetics

2004

, vol.

168

(pg.

585

593

)

Zhang

Sreenivasulu

Weschke

, et al. ,

Large-scale analysis of the barley transcriptome based on expressed sequence tags

Plant J.

2004

, vol.

(pg.

276

290

)

Allouis.

Moore.

Bellec.

, et al. ,

Construction and characterisation of a hexaploid wheat (Triticum aestivum L.) BAC library from the reference germplasm ‘Chinese Spring'

Cereal Res. Commun.

2003

, vol.

(pg.

331

338

)

Google Scholar

OpenURL Placeholder Text

WorldCat

Safár

Bartos

Janda

, et al. ,

Dissecting large and complex genomes: flow sorting and BAC cloning of individual chromosomes from bread wheat

Plant J.

2004

, vol.

(pg.

960

968

)

Wilkinson

Winfield

Barker

GLA

, et al. ,

CerealsDB 2.0: an integrated resource for plant breeders and scientists

BMC Bioinformatics

2012

, vol.

pg.

219

Lai

Berkman

Lorenc

, et al. ,

WheatGenome.info: an integrated database and portal for wheat genome information

Plant Cell Physiol.

2012

, vol.

pg.

Dong

Schlueter

Brendel

. ,

PlantGDB, plant genome database and analysis tools

Nucleic Acids Res.

2004

, vol.

Database issue

(pg.

D354

D359

)

Feuillet

Stein

Rossini

, et al. ,

Integrating cereal genomics to support innovation in the Triticeae

Funct. Integr. Genomics

2012

, vol.

(pg.

573

583

)

Wanamaker

Roose

, et al. ,

HarvEST

Methods Mol Biol.

2007

, vol.

406

(pg.

161

177

)

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Estill

Bennetzen

. ,

The DAWGPAWS pipeline for the annotation of genes and transposable elements in plant genomes

Plant Methods

2009

, vol.

pg.

Mochida

Yoshida

Sakurai

, et al. ,

TriFLDB: a database of clustered full-length coding sequences from triticeae with applications to comparative grass genomics

Plant Physiol.

2009

, vol.

150

(pg.

1135

1146

)

Manickavelu

Kawaura

Oishi

, et al. ,

Comprehensive functional analyses of expressed sequence tags in common wheat (Triticum aestivum)

DNA Res.

2012

, vol.

(pg.

165

177

)

The Arabidopsis Genome Initiative

Analysis of the genome sequence of the flowering plant Arabidopsis thaliana

Nature

2000

, vol.

408

(pg.

796

815

)

Crossref

PubMed

WorldCat

Wang

, et al. ,

A draft sequence of the rice genome (Oryza sativa L. ssp. indica)

Science

2002

, vol.

296

(pg.

)

The International Rice Genome Sequencing Project

The map-based sequence of the rice genome

Nature

2005

, vol.

436

(pg.

793

800

)

Crossref

PubMed

WorldCat

Schnable

Ware

Fulton

, et al. ,

The B73 maize genome: complexity, diversity, and dynamics

Science

2009

, vol.

326

(pg.

1112

1115

)

Schmutz

Cannon

Schlueter

, et al. ,

Genome sequence of the palaeopolyploid soybean

Nature

2010

, vol.

463

(pg.

178

183

)

Paterson

Bowers

Bruggmann

, et al. ,

The Sorghum bicolor genome and the diversification of grasses

Nature

2009

, vol.

457

(pg.

551

556

)

The International Brachypodium Initiative

Genome sequencing and analysis of the model grass Brachypodium distachyon

Nature

2010

, vol.

463

(pg.

763

768

)

Crossref

PubMed

WorldCat

Mayer

KFX

Waugh

Langridge

, et al. ,

A physical, genetic and functional sequence assembly of the barley genome

Nature

2012

, vol.

491

(pg.

711

716

)

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Van Bel

Proost

Wischnitzki

, et al. ,

Dissecting plant genomes with the PLAZA comparative genomics platform

Plant Physiol.

2012

, vol.

158

(pg.

590

600

)

Dassanayake

Haas

, et al. ,

The genome of the extremophile crucifer Thellungiella parvula

Nat. Genet.

2011

, vol.

(pg.

913

918

)

Conesa

Götz

Garcia-Gomez

, et al. ,

Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research

Bioinformatics

2005

, vol.

(pg.

3674

3676

)

Lamesch

Berardini

, et al. ,

The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools

Nucleic Acids Res.

2011

, vol.

(pg.

D1202

D1210

)

Ouyang

Zhu

Hamilton

, et al. ,

The TIGR rice genome annotation resource: improvements and new features

Nucleic Acids Res.

2007

, vol.

(pg.

D883

D887

)

Altschul

Madden

Schäffer

, et al. ,

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

Nucleic Acids Res.

1997

, vol.

(pg.

3389

3402

)

Romeuf

Tessier

Dardevet

, et al. ,

wDBTF: an integrated database resource for studying wheat transcription factor families

BMC Genomics

2010

, vol.

pg.

185

Capron

Mouzeyar

Boulaflous

, et al. ,

Transcriptional profile analysis of E3 ligase and hormone-related genes expressed during wheat grain development

BMC Plant Biol.

2012

, vol.

pg.

Rustenholz

Choulet

Laugier

, et al. ,

A 3,000-loci transcription map of chromosome 3B unravels the structural and functional features of gene islands in hexaploid wheat

Plant Physiol.

2011

, vol.

157

(pg.

1596

1608

)

Ashburner

Ball

Blake

, et al. ,

Gene Ontology: tool for the unification of biology

Nat. Genet.

2011

, vol.

(pg.

)

Google Scholar

Crossref

WorldCat

Zhang

Dreher

Karthikeyan

, et al. ,

Creation of a genome-wide metabolic pathway database for Populus trichocarpa using a new approach for reconstruction and curation of metabolic pathways for plants

Plant Physiol.

2010

, vol.

153

(pg.

1479

1491

)

Thimm

Bläsing

Gibon

, et al. ,

Mapman: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes

Plant J.

2004

, vol.

(pg.

914

939

)

Ruepp

Zollner

Maier

, et al. ,

The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes

Nucleic Acids Res.

2004

, vol.

(pg.

5539

5545

)

Rhee

Beavis

Berardini

, et al. ,

The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community

Nucleic Acids Res.

2003

, vol.

(pg.

224

228

)

Cunningham

Gantt

. ,

Genes and enzymes of carotenoids biosynthesis in plants

Annu. Rev. Plant Physiol. Plant Mol. Biol.

1998

, vol.

(pg.

557

583

)

Gallagher

Matthews

, et al. ,

Gene duplication in the carotenoid biosynthetic pathway preceded evolution of the grasses

Plant Physiol.

2004

, vol.

135

(pg.

1776

1783

)

Vallabhaneni

Wurtzel

. ,

PSY3, a new member of the phytoene synthase gene family conserved in the poaceae and regulator of abiotic stress-induced root carotenogenesis

Plant Physiol.

2008

, vol.

146

(pg.

1333

1345

)

Dibari

Murat

Chosson

, et al. ,

Deciphering the genomic structure, function and evolution of carotenogenesis related phytoene synthases in grasses

BMC Genomics

2012

, vol.

pg.

221

Romeuf

. ,

Identification in silico des facteurs de transcription du blé tendre (Triticum aestivum) et mise en évidence des facteurs de transcription impliqués dans la synthèse des protéines de réserve

2010

Ph.D. Thesis. Université Clermont-Ferrand II, Blaise Pascal, Clermont-Ferrand, France, pp. 223

Google Scholar

Bennett

Rao

Smith

, et al. ,

Cell development in the anther, the ovule, and the young seed of Triticum aestivum L. Var. chinese spring

Philos. T. R. Soc. B

1975

, vol.

266

(pg.

)

Google Scholar

OpenURL Placeholder Text

WorldCat

Evers

Millar

. ,

Cereal grain structure and development: some implications for quality

J. Cereal Sci.

2002

, vol.

(pg.

261

284

)

Google Scholar

Crossref

WorldCat

Drea

Leader

Arnold

, et al. ,

Systematic spatial analysis of gene expression during wheat Caryopsis

Plant Cell.

2005

, vol.

(pg.

2172

2185

)

Laudencia-Chingcuanco

Stamova

You

, et al. ,

Transcriptional profiling of wheat caryopsis development using cDNA microarrays

Plant Mol. Biol.

2007

, vol.

(pg.

651

668

)

Nadaud

Girousse

Debiton

, et al. ,

Proteomic and morphological analysis of early stages of wheat grain development

Proteomics

2010

, vol.

(pg.

2901

2910

)

Tasleem-Tahir

Nadaud

Chambon

, et al. ,

Expression profiling of starchy endosperm metabolic proteins at 21 stages of wheat grain development

J. Proteome Res.

2012

, vol.

(pg.

2754

2773

)

Dysvik

Jonassen

. ,

J-Express: exploring gene expression data using Java

Bioinformatics

2001

, vol.

(pg.

369

370

)

Sreenivasulu

Radchuk

Strickert

, et al. ,

Gene expression patterns reveal tissue-specific signaling networks controlling programmed cell death and ABA- regulated maturation in developing barley seeds

Plant J.

2006

, vol.

(pg.

310

327

)

Clarke

Hobbs

Skylas

, et al. ,

Genes active in developing wheat endosperm

Funct. Integr. Genomics

2000

, vol.

(pg.

)

Szucs

Jäger

Jurca

, et al. ,

Histological and microarray analysis of the direct effect of water shortage alone or combined with heat on early grain development in wheat (Triticum aestivum)

Physiol Plant.

2010

, vol.

140

(pg.

174

188

)

Goujon

McWilliam

, et al. ,

A new bioinformatics analysis tools framework at EMBL–EBI

Nucleic Acids Res.

2010

, vol.

(pg.

W695

W699

)

Jöcker

Hoffmann

Groscurth

, et al. ,

Protein function prediction and annotation in an integrated environment powered by web services (AFAWE)

Bioinformatics

2008

, vol.

(pg.

2393

2394

)

Author notes

Present address: Zhanwu Dai, INRA, ISVV, UMR1287 Écophysiologie et Génomique Fonctionnelle de la Vigne (EGFV), F-33 882 Villenave d’Ornon, France

Citation details: Vincent,J., Dai,Z.W., Ravel,C. et al. dbWFA: a web-based database for functional annotation of Triticum aestivum transcripts. Database (2013) Vol. 2013: article ID bat014; doi:10.1093/database/bat014

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
November 2016	2
December 2016	8
January 2017	1
February 2017	6
March 2017	3
April 2017	1
May 2017	7
June 2017	2
July 2017	2
August 2017	15
September 2017	5
October 2017	7
November 2017	6
December 2017	214
January 2018	199
February 2018	192
March 2018	174
April 2018	15
May 2018	26
June 2018	24
July 2018	21
August 2018	21
September 2018	22
October 2018	9
November 2018	19
December 2018	13
January 2019	9
February 2019	16
March 2019	13
April 2019	18
May 2019	12
June 2019	21
July 2019	19
August 2019	21
September 2019	20
October 2019	40
November 2019	16
December 2019	9
January 2020	7
February 2020	15
March 2020	22
April 2020	16
May 2020	19
June 2020	18
July 2020	23
August 2020	20
September 2020	20
October 2020	8
November 2020	14
December 2020	15
January 2021	13
February 2021	15
March 2021	35
April 2021	18
May 2021	20
June 2021	12
July 2021	12
August 2021	22
September 2021	6
October 2021	15
November 2021	9
December 2021	11
January 2022	8
February 2022	5
March 2022	24
April 2022	15
May 2022	13
June 2022	9
July 2022	10
August 2022	10
September 2022	11
October 2022	33
November 2022	13
December 2022	7
January 2023	3
February 2023	8
March 2023	11
April 2023	10
May 2023	11
June 2023	15
July 2023	6
August 2023	28
September 2023	8
October 2023	7
November 2023	28
December 2023	16
January 2024	32
February 2024	38
March 2024	19
April 2024	10
May 2024	26
June 2024	25
July 2024	20
August 2024	9
September 2024	11
October 2024	17
November 2024	25
December 2024	7
January 2025	10
February 2025	14
March 2025	10
April 2025	1
May 2025	20
June 2025	15
July 2025	17
August 2025	21
September 2025	4
October 2025	10
November 2025	4
December 2025	9
January 2026	7
February 2026	16
March 2026	1

Article Contents

dbWFA: a web-based database for functional annotation of Triticum aestivum transcripts

Abstract

Introduction

Data Content, Database Architecture and Web Interface

Using dbWFA: Percentage of Annotated UniGenes, Comparison of T. aestivum UniGene and A. thaliana and O. sativa Whole-Genome Annotation and Query Examples

Identification and Annotation of UniGenes Specifically Expressed During Either the Early or Late Stage of Grain Development

Outlook

Acknowledgements

Funding

References

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

dbWFA: a web-based database for functional annotation of Triticum aestivum transcripts

Abstract

Introduction

Data Content, Database Architecture and Web Interface

Using dbWFA: Percentage of Annotated UniGenes, Comparison of T. aestivum UniGene and A. thaliana and O. sativa Whole-Genome Annotation and Query Examples

Identification and Annotation of UniGenes Specifically Expressed During Either the Early or Late Stage of Grain Development

Outlook

Acknowledgements

Funding

References

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

Gift article access

Gift article access

Gift article access

Gift article access