Tetrahymena Functional Genomics Database (TetraFGD): an integrated resource for Tetrahymena functional genomics

Author Notes

Abstract

The ciliated protozoan Tetrahymena thermophila is a useful unicellular model organism for studies of eukaryotic cellular and molecular biology. Researches on T. thermophila have contributed to a series of remarkable basic biological principles. After the macronuclear genome was sequenced, substantial progress has been made in functional genomics research on T. thermophila, including genome-wide microarray analysis of the T. thermophila life cycle, a T. thermophila gene network analysis based on the microarray data and transcriptome analysis by deep RNA sequencing. To meet the growing demands for the Tetrahymena research community, we integrated these data to provide a public access database: Tetrahymena functional genomics database (TetraFGD). TetraFGD contains three major resources, including the RNA-Seq transcriptome, microarray and gene networks. The RNA-Seq data define gene structures and transcriptome, with special emphasis on exon–intron boundaries; the microarray data describe gene expression of 20 time points during three major stages of the T. thermophila life cycle; the gene network data identify potential gene–gene interactions of 15 049 genes. The TetraFGD provides user-friendly search functions that assist researchers in accessing gene models, transcripts, gene expression data and gene–gene relationships. In conclusion, the TetraFGD is an important functional genomic resource for researchers who focus on the Tetrahymena or other ciliates.

Database URL:http://tfgd.ihb.ac.cn/

Introduction

Tetrahymena thermophila is a free-living ciliated protozoan that normally has two types of functionally distinct nuclei (1), the silent germ line micronucleus (MIC) and the actively transcribed somatic macronucleus (MAC) in each cell. Its typical eukaryotic biology and many molecular genetic tools have enabled Tetrahymena researchers to contribute to landmark discoveries of fundamental eukaryotic cellular mechanisms, such as the first cytoskeletal motor (2), catalytic RNA (3), telomere structure (4) and telomerase (5) and the role of small RNAs in programmed somatic genome rearrangement (6). Although its genome has been sequenced and its genetics and molecular biology have been extensively studied, research on T. thermophila has been limited by the lack of some basic genomic resources, in particular, functional genomics data.

In 2006, the MAC genome sequence project of T. thermophila was completed (7), which provides the first ciliate genome sequence. After this, the Tetrahymena genome database (TGD and TGD Wiki, http://ciliate.org), containing the genome sequence and the predicted gene models, was established (8–9). The first insights into the Tetrahymena transcriptome came from expressed sequence tag (EST) sequences generated by the initial genome sequence project and the protist EST program, and they could be accessed through the NCBI EST database and the PEP database (TBestDB, http://www.bch.umontreal.ca/pepdb/pepdb.html), respectively. Miao et al. (10) initiated Tetrahymena functional genomics by establishing the first T. thermophila microarray platform performing a genome-wide investigation of gene expression during the three major physiological and developmental stages of the T. thermophila life cycle. These microarray data could be accessed via the Tetrahymena gene expression database (TGED, http://tged.ihb.ac.cn/) (11). Recently, the microarray data were extended to infer a T. thermophila gene network (12), providing the first insight of gene–gene relationships in T. thermophila. Finally, an analysis using deep RNA sequencing (RNA-Seq) provided more comprehensive and detailed analysis of the transcriptome and greatly improved the gene models (13). Given these recent genomic studies, an integrated database is needed to provide easy access to functional genomics data for members of the scientific research community interested in T. thermophila.

In this study, we describe the Tetrahymena functional genomics database (TetraFGD), which provides user-friendly search functions for accessing gene models, transcripts, gene expression data and gene–gene relationships in Tetrahymena.

Content

The TetraFGD is an online resource containing three major functional genomic data sets of T. thermophila: (i) RNA-Seq data; (ii) microarray data; and (iii) gene network.

RNA-Seq data

These data were obtained from six RNA samples in three major physiological and developmental stages of T. thermophila, including one in growth (∼3.5 × 10⁵ cells/ml), three in starvation (mating type V and VI in 3 h, mating type VI in 15 h) and two in conjugation (2 and 8 h after mixing of two mating types) by using Illumina deep RNA sequencing (13). To ensure that the RNA-Seq data could be compared with the microarray data, time points were selected a subset of those in the previous microarray expression studies that covered 20 states of the three stages of the T. thermophila life cycle. More than 96% of the predicted genes have detectable reads after mapping the RNA-Seq data to the genome. More than 30 000 transcripts were assembled, including >1000 new transcripts, which were not found by gene scanning. These transcripts were used to improve the previous gene models. Over 7000 predicted gene models showed errors when the RNA-Seq data and gene annotation (13) were compared, greatly improving identification of coding sequences, untranslated regions and exon–intron boundaries. Although the RNA-Seq technology is powerful, some assembled transcripts do not contain complete open reading frames because of limited coverage and assembly. Compared with the RNA-Seq, gene scanning may be relatively less accurate when it was used to predict the transcription information, whereas it supplies the complete open reading frames. Thus, we shared the RNA-Seq data with the Tetrahymena team at the Broad Institute, who are using them to update gene prediction and genome annotation.

Microarray data

The TetraFGD now contains the microarray gene expression data containing 20 time points during the three major physiological and developmental stages of the T. thermophila life cycle, including 3 points in growth, 7 points in starvation and 10 points in conjugation. Because the microarray expression values might be wrong if a gene was mis-predicted, and the RNA-Seq assembled transcripts provide more correct transcription information. Therefore, the TetraFGD now integrates two types of microarray expression values: (i) normalization based on the predicted genes (gene model may be incorrectly predicted); and (ii) normalization based on the RNA-Seq assembled transcripts (some of them are transcription fragments). These two types of microarray expression values were both normalized using the microarrays reported previously (10).

There are two kinds of gene expression values normalized by the predicted genes. Take the expression values of the gene TTHERM_00257230 (http://tfgd.ihb.ac.cn/search/detail/gene/TTHERM_00257230) for example, two expression profile are showed when you search the database and are represented by the blue and red line. The blue line represents the expression values from the Tetrahymena gene expression database (TGED) (11), and the red line represents the expression values that were normalized by Prof. Ronald Pearlman laboratory using a different method (most of the raw data are the same as those used in the blue line, and 10 new microarrays in the Tetrahymena conjugation stage were added).

For the RNA-Seq assembled transcripts (each transcript was regarded as a gene model, although it may be only a transcription fragment), all previous designed microarray probes (10) were re-mapped to the RNA-Seq assembled transcripts, and the microarray expression values for these transcripts were re-normalized as in the study by Miao et al. (10) and provided in the TetraFGD.

Gene network

Network analysis can be used to identify genes in the same biological processes or pathways, to infer interactions of bio-molecules, such as their physical association, metabolite flow, regulatory relationships and co-expression relationships and so forth. An important resource in the TetraFGD is the Tetrahymena gene network (TGN). The TGN was constructed using the context likelihood of relatedness algorithm (CLR, mutual information-based method, which is an extension of the relevance networks) (14) based on 67 Roche NimbleGen single-channel microarray expression data (12), which means that the connected genes in the TGN have similar expression profiles. After gene filtering strategies (12), 15 049 genes were used to infer the gene network. By determining an appropriate threshold with a CLR Z-score threshold 3.49, 1 958 477 gene–gene interactions were included in the TGN (12). The larger the Z-score value between the two genes in the TGN, the more reliable interaction (more similar expression profile) between them. Several experimentally verified cases showed that the TGN-predicted gene connections were likely to have related functions, such as the proteasome complex (12), the adenosine triphosphate synthase complex (15) and genes involved in DNA rearrangement during Tetrahymena MAC development (6). Thus, the TGN presents an important resource to study Tetrahymena genes at the pathway level.

Construction

The schema of the TetraFGD is showed in Figure 1. Gene models, transcript sequences, microarray expression data and gene network data were stored in the MySQL database, and transcript sequences were formatted as a Basic Local Alignment Search Tool (BLAST) database. The web interface for searching these data was written by using Hypertext Preprocessor (PHP). Apache2 (http://httpd.apache.org/) in a CENTOS operating system was used as the web server. To provide a convenient way to check the gene models and the transcripts, Gbrowse2 (http://gmod.org/wiki/GBrowse) was set-up for graphically viewing the RNA-Seq data. A BLAST web server was also applied for the sequence-based searching against the formatted transcript database. Search functions make it easy to access to the three major resources in the TetraFGD.

Figure 1

The schema of the TetraFGD.

Open in new tab Download slide

Use and discussion

The TetraFGD can be accessed via the World Wide Web at http://tfgd.ihb.ac.cn/. An integrated searching box was designed to provide quick access to the data on the top of each page of the TetraFGD website. You can use ‘Gene ID’, ‘Keyword’ or ‘Transcript ID’ to search the database (Figure 2A). If you use the ‘Gene ID’ to search (Figure 2A), you should provide a ‘TTHERM_XXXXXXXX’ style ID, which was originally generated by gene prediction, and this style ID is also used in the Tetrahymena genome database (TGD, http://www.ciliate.org/). If you use a ‘Keyword’ to search (Figure 2A), you can type any word(s) in the search box, and the database will return the record(s) with a gene name (based on the gene annotation, but not the gene name from individual published studies) containing your keyword. If you use the ‘Transcript ID’ to search (Figure 2A), you should provide a transcript fragment ID from the RNA-Seq database. This ID can be found by either of two ways: (i) using the ‘TTHERM_XXXXXXXX’ to find its related transcript fragment ID(s) through searching the database or the look-up table (http://tfgd.ihb.ac.cn/index/version) and (ii) using your sequence to BLAST against the RNA-Seq assembled transcript database (http://tfgd.ihb.ac.cn/tool/blast).

Figure 2

Screenshots of searching function interfaces of the TetraFGD website. (A) Integrated searching box. The red arrow indicates the pull-down button to the searching menu for choosing the ‘Gene ID’, ‘Keyword’ or ‘Transcript ID’. (B–D) Individual searching boxes for RNA-Seq, microarray and gene network.

Open in new tab Download slide

In addition to the integrated search box, we have also designed the individual searching function, such as RNA-Seq (Figure 2B), microarray (Figure 2C) and gene network (Figure 2D).

Searching RNA-Seq data

The TetraFGD displays the RNA-Seq data graphically and performs its search function through Gbrowse (http://gmod.org/wiki/Main_Page). Typically, Gene ID can be used to search the data (Figure 2B), and it also accepts a keyword (gene annotation), a transcript ID or a scaffold region. On the Gbrowse search result page, four tracks are shown, including a predicted gene model track, a RNA-Seq assembled transcript track (linked to the transcript sequence and its microarray expression information), a RNA-Seq coverage plot track and a microarray probe track (Figure 3A). Through these tracks, you can check whether there are any gene prediction mistakes, and retrieve the transcript sequence and the gene exon–intron structure information. These data are useful for studying downstream gene function. Through the Gbrowse, you can choose any specific interesting region and export the FASTA format sequence of any selected region by clicking the ‘Download Decorated FASTA File’ in the pull-down box. In addition, Gbrowse also allows the user to export the high-resolution image.

Figure 3

Screenshots of search result interfaces of the TetraFGD website. (A) Gbrowse snapshot showing the RNA-Seq search result for the gene TTHERM_00257230. (B) Microarray result page for RNA-Seq assembled transcript, taking the gene_000012474 as an example (searching this ID in the top searching box by choosing the ‘Transcript ID’). (C and D) The screenshot montages of results for ‘Single-Search’ and ‘Multi-Search’ in TGN.

Open in new tab Download slide

Searching microarray data

TetraFGD currently contains the microarray data of 20 time points during the three major physiological and developmental stages of the T. thermophila life cycle. Gene ID can be used to search the data either in the top search box or on the individual microarray search page (Figure 2C). You can search the microarray expression values either based on the predicted gene models or the RNA-Seq assembled transcripts. For the predicted gene models, the microarray result page follows the style of the TGED, and it also includes the Gene ID (hyperlink to the TGD), description, sequence (cDNA and protein) and shows the expression profile [see detail in (11)]. For the RNA-Seq assembled transcripts, re-normalized microarray expression data were added to the TetraFGD to help researchers get more accurate expression information, and the result gives the summary information, sequence, as well as the expression profile (Figure 3B). In addition, a new function was designed to retrieve the detailed gene expression values for each state by clicking the button ‘Value Table=>’ (Figure 3B).

Searching Tetrahymena gene network data

TGN was constructed with 15 049 genes (12) using the CLR algorithm (14), and it supplies a source to retrieve possible functionally related genes. On the gene network page, one can enter a gene ID for ‘Single-Search’ and multiple gene IDs for ‘Multi-Search’ (Figure 2D). These two types of search functions were designed for different needs. The ‘Single-Search’ was designed to find all the potential interacted candidates of the query gene, and the result will return a list of candidate genes with expression patterns similar to your query gene in the TGN (Figure 3C). The ‘Multi-Search’ was designed to find the potential interactions among a set of genes, and the result will return gene–gene interactions among your supplied genes in the TGN (Figure 3D). All the search results can be downloaded as a tab-delimited text file that can be visualized by using the Cytoscape software (http://www.cytoscape.org/) (16).

Other services in the TetraFGD

Besides the previously described functions, the TetraFGD now set-up a BLAST web server (17), allowing the RNA-Seq assembled transcripts to be found using the nucleotide sequence by BlastN or the protein sequence by TBlastN (http://tfgd.ihb.ac.cn/tool/blast). Moreover, detailed ‘Help’ information is provided to assist in using the database easily. On the TetraFGD ‘Search Help’ page (http://tfgd.ihb.ac.cn/index/schhelp), we have supplied detailed introductions and explanations of search functions to access to all the resources. The page ‘Sample Preparation’ (http://tfgd.ihb.ac.cn/index/smphelp) describes the standardization of culture conditions of growth, starvation and conjugation, used in preparing RNA for microarrays and RNA-Seq.

It is worth noting that T. thermophila gene annotations have been updated several times. The microarray platform was designed according to the 2006 version genome annotation, whereas the RNA-Seq data were analysed using the 2008 version genome annotation. There are some differences between these two versions of genome annotation. Therefore, we provide a look-up table (http://tfgd.ihb.ac.cn/index/version) for the conversions among 2006 version genome annotation IDs, 2008 version genome annotation IDs, transcript IDs and gene descriptions to conveniently use this database. The user will receive a warning when the search uses an improper gene ID.

Further development of the TetraFGD

Future development of the TetraFGD will include uploading and integrating additional Tetrahymena functional genomics data sets, such as the microarray gene expression, transcriptome, re-sequencing data of T. thermophila under stress from exposure to pollutants or displaying the effects of specific gene mutations and also the phosphorylation proteomics.

Conclusions

The TetraFGD website makes substantial improvement to the original TGED website through the addition of databases of RNA-Seq and gene network as well as BLAST searching. To facilitate access to these resources, a user-friendly web interface was developed. The TetraFGD website (also the earlier TGED website) has already attracted considerable interest from the worldwide scientists, and web traffic records indicated that they receive, on average, >150 unique visits from >40 countries per day during the past 2 years. In conclusion, the TetraFGD is an important integrated functional genomics resource, which is freely available to interested researchers.

Funding

Knowledge Innovation Program of CAS (KSCX2-EW-G-6-4); the Scientific Research Foundation for the Returned Overseas Chinese Scholars State Education Ministry; the open foundation of the State Key Laboratory of Genetics Resources and Evolution (GREKF10-09); National Scientific Data Sharing Platform for population and health, biologic medicine information center of China (2005DKA32402 to W.M.).

Conflict of interest. None declared.

Acknowledgements

The authors thank Prof. Martin Gorovsky (University of Rochester) for his critical review of the manuscript. They also thank the Gbrowse community members for their Email helps for installing the Gbrowse in the server and the Tetrahymena community for suggestions and for pointing out bugs in the database.

References

Asai

Forney

. ,

Tetrahymena thermophila. Methods in Cell Biology

2000

Orlando, FL

Academic press

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Gibbons

Rowe

. ,

Dynein—a protein with adenosine triphosphatase activity from cilia

Science

1965

, vol.

149

(pg.

424

426

)

Kruger

Grabowski

Zaug

, et al. ,

Self-splicing RNA: autoexcision and autocyclization of the ribosomal RNA intervening sequence of Tetrahymena

Cell

1982

, vol.

(pg.

147

157

)

Blackburn

Gall

. ,

A tandemly repeated sequence at the termini of the extrachromosomal ribosomal RNA genes in Tetrahymena

J. Mol. Biol.

1978

, vol.

120

(pg.

)

Greider

Blackburn

. ,

Identification of a specific telomere terminal transferase activity in Tetrahymena extracts

Cell

1985

, vol.

(pg.

405

413

)

Yao

Chao

. ,

RNA-guided DNA deletion in Tetrahymena: an RNAi-based mechanism for programmed genome rearrangements

Annu. Rev. Genet.

2005

, vol.

(pg.

537

559

)

Eisen

Coyne

, et al. ,

Macronuclear genome sequence of the ciliate Tetrahymena thermophila, a model eukaryote

PLoS Biol.

2006

, vol.

pg.

e286

Stover

Krieger

Binkley

, et al. ,

Tetrahymena genome database (TGD): a new genomic resource for Tetrahymena thermophila research

Nucleic Acids Res.

2006

, vol.

(pg.

D500

D5003

)

Stover

Punia

Bowen

, et al. ,

Tetrahymena Genome Database Wiki: a community-maintained model organism database

Database (Oxford)

2012

, vol.

2012

pg.

bas007

Miao

Xiong

Bowen

, et al. ,

Microarray analyses of gene expression during the Tetrahymena thermophila life cycle

PLoS One

2009

, vol.

pg.

e4429

Xiong

, et al. ,

Tetrahymena Gene Expression Database (TGED): a resource of microarray data and co-expression analyses for Tetrahymena

Sci. China Life Sci.

2011

, vol.

(pg.

)

Xiong

Yuan

Fillingham

, et al. ,

Gene network landscape of the ciliate Tetrahymena thermophila

PLoS One

2011

, vol.

pg.

e20124

Xiong

Zhou

, et al. ,

Transcriptome analysis of the model protozoan, Tetrahymena thermophila, using deep RNA sequencing

PLoS One

2012

, vol.

pg.

e30630

Faith

Hayete

Thaden

, et al. ,

Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles

PLoS Biol.

2007

, vol.

(pg.

)

Google Scholar

Crossref

WorldCat

Nina

Dudkina

Kane

, et al. ,

Highly divergent mitochondrial ATP synthase complexes in Tetrahymena thermophila

PLoS Biol.

2010

, vol.

pg.

e1000418

Smoot

Ono

Ruscheinski

, et al. ,

Cytoscape 2.8: new features for data integration and network visualization

Bioinformatics

2011

, vol.

(pg.

431

432

)

Altschul

Madden

Schaffer

, et al. ,

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

Nucleic Acids Res.

1997

, vol.

(pg.

3389

3402

)

Author notes

^†These authors contributed equally to this work.

Citation details: Xiong,J., Lu,Y., Feng1,J., et al. Tetrahymena Functional Genomics Database (TetraFGD): an integrated resource for Tetrahymena functional genomics. Database (2013) Vol. 2013: article ID bat008; doi: XX.XXXX/database/bat008

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
December 2016	5
January 2017	1
February 2017	5
March 2017	14
April 2017	8
May 2017	6
June 2017	4
July 2017	3
August 2017	7
September 2017	5
October 2017	1
November 2017	4
December 2017	14
January 2018	25
February 2018	11
March 2018	25
April 2018	29
May 2018	12
June 2018	8
July 2018	20
August 2018	20
September 2018	10
October 2018	17
November 2018	37
December 2018	13
January 2019	22
February 2019	32
March 2019	25
April 2019	29
May 2019	29
June 2019	17
July 2019	25
August 2019	16
September 2019	17
October 2019	21
November 2019	22
December 2019	10
January 2020	13
February 2020	10
March 2020	5
April 2020	6
May 2020	10
June 2020	17
July 2020	21
August 2020	18
September 2020	24
October 2020	7
November 2020	24
December 2020	17
January 2021	5
February 2021	11
March 2021	26
April 2021	41
May 2021	20
June 2021	13
July 2021	18
August 2021	18
September 2021	26
October 2021	10
November 2021	34
December 2021	11
January 2022	21
February 2022	13
March 2022	23
April 2022	31
May 2022	31
June 2022	21
July 2022	17
August 2022	26
September 2022	32
October 2022	149
November 2022	21
December 2022	9
January 2023	30
February 2023	21
March 2023	19
April 2023	11
May 2023	12
June 2023	8
July 2023	5
August 2023	25
September 2023	14
October 2023	11
November 2023	16
December 2023	23
January 2024	73
February 2024	37
March 2024	34
April 2024	18
May 2024	21
June 2024	26
July 2024	37
August 2024	16
September 2024	17
October 2024	23
November 2024	36
December 2024	19
January 2025	11
February 2025	8
March 2025	18
April 2025	11
May 2025	12
June 2025	14
July 2025	20
August 2025	18
September 2025	21
October 2025	18
November 2025	22
December 2025	23
January 2026	10
February 2026	7
March 2026	25
April 2026	22
May 2026	24
June 2026	19
July 2026	5

Article Contents

Tetrahymena Functional Genomics Database (TetraFGD): an integrated resource for Tetrahymena functional genomics

Abstract

Introduction

Content

RNA-Seq data

Microarray data

Gene network

Construction

Use and discussion

Searching RNA-Seq data

Searching microarray data

Searching Tetrahymena gene network data

Other services in the TetraFGD

Further development of the TetraFGD

Conclusions

Funding

Acknowledgements

References

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

New and popular articles

Article Contents

Tetrahymena Functional Genomics Database (TetraFGD): an integrated resource for Tetrahymena functional genomics

Abstract

Introduction

Content

RNA-Seq data

Microarray data

Gene network

Construction

Use and discussion

Searching RNA-Seq data

Searching microarray data

Searching Tetrahymena gene network data

Other services in the TetraFGD

Further development of the TetraFGD

Conclusions

Funding

Acknowledgements

References

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

New and popular articles

More from Oxford Academic

This Feature Is Available To Subscribers Only

Gift article access

Gift article access

Gift article access

Gift article access