- Split View
-
Views
-
Cite
Cite
Xiao Wen, Lin Gao, Xingli Guo, Xing Li, Xiaotai Huang, Ying Wang, Haifu Xu, Ruijie He, Chenglong Jia, Feixiang Liang, lncSLdb: a resource for long non-coding RNA subcellular localization, Database, Volume 2018, 2018, bay085, https://doi.org/10.1093/database/bay085
- Share Icon Share
Abstract
While long non-coding RNAs (lncRNAs) may play important roles in cellular function and biological process, we still know little about them. Growing evidences indicate that subcellular localization of lncRNAs may provide clues to their functionality. To facilitate researchers functionally characterize thousands of lncRNAs, we developed a database-driven application, lncSLdb, which stores and manages user-collected qualitative and quantitative subcellular localization information of lncRNAs from literature mining. The current release contains >11 000 transcripts from three species. Based on the accumulated region of lncRNAs, we classify transcripts into three basic localization types (nucleus, cytoplasm and nucleus/cytoplasm). In some conditions, the nucleus and cytoplasm types can be divided into three more accurate subtypes (chromosome, nucleoplasm and ribosome). Besides browsing and downloading data in lncSLdb, our system provides a set of comprehensive tools to search by gene symbols, genome coordinates or sequence similarity. We hope that lncSLdb will provide a convenient platform for researchers to investigate the functions and the molecular mechanisms of lncRNAs in the view of subcellular localization.
Introduction
Long non-coding RNAs (lncRNAs) are non-coding transcripts whose lengths are >200 nucleotides (1, 2). In recent years, with the development of biological technique, especially the broad application of high-throughput RNA sequencing (RNA-Seq) (3, 4), more and more novel lncRNAs have been identified and annotated in genomes (5–7). Growing evidences suggest that lncRNAs have important function in various aspects of cellular function and biological process (8–10). However, the function of most lncRNAs is still unclear (10).
Unlike mRNAs, which are transported to cytoplasm and translated into proteins on ribosomes, lncRNAs have little coding potential. Similar to proteins, the function of lncRNAs heavily depends on their subcellular localization (10, 11). The accumulated lncRNAs in nucleus may take part in the nuclear organization or regulate the gene expression before transcription (11, 12), whereas the accumulated lncRNAs in cytoplasm have important roles in the post-transcriptional regulation and post-translational modification (11, 12). For example, lncRNA Airn, accumulated in nucleus, is involved in silencing Igf2r by overlapping with its promoter (13); Neat1 is an essential component to form paraspeckles and related with the nuclear retention of structured or edited mRNAs (14). Cytoplasmic lncRNA NKILA can influence NF-κB activation via inhibiting IKK-induced IκBα phosphorylation (15); TUG1 and CTB-89H12.4 can regulate the PTEN expression by acting as the sponge regulators to complete the microRNA with PTEN transcripts (16).
Therefore, the subcellular localization of lncRNAs is a very important property to understand the function of lncRNAs. Nowadays, researchers have investigated the subcellular localization of a set of lncRNAs. There is a great need for integrated platforms to manage, search and analyse these data. Amaral et al. (17) published the lncRNAdb, which contains subcellular localization information of ∼80 lncRNAs gene. Zhang et al. (18) has developed a database, RNALocate, to collect the subcellular localization of all kinds of RNA, which contains >1700 lncRNAs genes from 10 different species. Mas Ponte et al. (19) publish the LncATLAS, which collects the subcellular localization of 7267 human lncRNAs genes in 15 cell lines and define the RCI (Relative concentration index) for measuring the localization types. However, these systems usually focus on the lncRNA genes instead of lncRNA transcripts and only cover a small fraction of available lncRNAs in different species. We also note that these systems only provide limited support for qualitative and/or quantitative experimental results, such as photos or expression levels in different cell compartments. More details are shown in Table 1.
. | #LncRNA gene . | #LncRNA transcript . | #Localization entry . | #Species . | Data type . | Data source . | #Paper . |
---|---|---|---|---|---|---|---|
lncSLdb | 9494a | 11 698b | 14 973 | 7 | Figure, expression ratio, description | Text Mining | 99 |
lncATLAS | 7267 | - | 30 580 | 1 | Expression ratio | ENCODE | 1 |
RNALocate | 1792 | - | 2383 | 10 | Description | Text Mining | 192 |
lncRNAdb | ∼80 | - | 91 | ∼2 | Description | Text Mining | - |
. | #LncRNA gene . | #LncRNA transcript . | #Localization entry . | #Species . | Data type . | Data source . | #Paper . |
---|---|---|---|---|---|---|---|
lncSLdb | 9494a | 11 698b | 14 973 | 7 | Figure, expression ratio, description | Text Mining | 99 |
lncATLAS | 7267 | - | 30 580 | 1 | Expression ratio | ENCODE | 1 |
RNALocate | 1792 | - | 2383 | 10 | Description | Text Mining | 192 |
lncRNAdb | ∼80 | - | 91 | ∼2 | Description | Text Mining | - |
a5581 with annotation, 3913 without annotation
b5356 with official names, 6342 without official names
#refers to “The number of”
. | #LncRNA gene . | #LncRNA transcript . | #Localization entry . | #Species . | Data type . | Data source . | #Paper . |
---|---|---|---|---|---|---|---|
lncSLdb | 9494a | 11 698b | 14 973 | 7 | Figure, expression ratio, description | Text Mining | 99 |
lncATLAS | 7267 | - | 30 580 | 1 | Expression ratio | ENCODE | 1 |
RNALocate | 1792 | - | 2383 | 10 | Description | Text Mining | 192 |
lncRNAdb | ∼80 | - | 91 | ∼2 | Description | Text Mining | - |
. | #LncRNA gene . | #LncRNA transcript . | #Localization entry . | #Species . | Data type . | Data source . | #Paper . |
---|---|---|---|---|---|---|---|
lncSLdb | 9494a | 11 698b | 14 973 | 7 | Figure, expression ratio, description | Text Mining | 99 |
lncATLAS | 7267 | - | 30 580 | 1 | Expression ratio | ENCODE | 1 |
RNALocate | 1792 | - | 2383 | 10 | Description | Text Mining | 192 |
lncRNAdb | ∼80 | - | 91 | ∼2 | Description | Text Mining | - |
a5581 with annotation, 3913 without annotation
b5356 with official names, 6342 without official names
#refers to “The number of”
We develop an lncRNA subcellular localization system (lncSLdb), which collects qualitative and quantitative subcellular localization information of lncRNAs by manually curating the literatures. The current release contains subcellular location information of >11 000 lncRNA transcripts from 9494 genes and three main species (human, mouse and fruit fly), classified into three basic subcellular localization types (nucleus, cytoplasm and nucleus/cytoplasm) and three subtypes (ribosome, chromosome and nucleoplasm), all of which are supported by biological experiments. Our aim is to provide a comprehensive platform to help researchers investigate the subcellular localization of lncRNAs and further for function and potential molecular mechanism. lncSLdb collects a set of information of lncRNAs, including gene IDs/symbols, transcript IDs, genome coordinates, gene/transcript biotype, subcellular localization and relative expression ratio or experimental pictures. The data set used by our system can be downloaded freely. Furthermore, researchers can submit new subcellular localization of lncRNAs to lncSLdb.
Data collection and implementation
We searched published papers in the PubMed Central (PMC) database by using ‘long non coding RNA subcellular localization’ and ‘lncRNA subcellular localization’ as keywords, which leads to >3000 papers. All papers are filtered manually to find if they are related to lncRNA subcellular localization. Papers that are not included in the result set but cited by some paper in the result set are also considered. The current release includes ∼100 papers, filtered from the first 1000 search results and their reference (Figure 1). We also collected the gene/transcript genome information from other database such as FlyBase (20), Ensembl (21), UCSC (22), MGD (23), GenBank (24) and Gencode (25).
lncSLdb is developed with HTML/JSP and Java languages using MySQL (http://www.mysql.com/) as the database manage system. The web interface is based on the Bootstrap (http://getbootstrap.com/2.3.2/) and AdminLTE (https://www.almsaeedstudio.com/) frameworks, and JavaScript scripts developed to support user interaction.
Database structure and content
For every localization item in lncSLdb, we consider three aspects, including transcript information, gene information and subcellular localization information. All information contained in lncSLdb are listed in the Table 2.
Transcript information . | |
---|---|
Transcript ID | The transcript id of the transcript |
Chromosome | The chromosome of the transcript |
Start | The transcript start position of the transcript |
End | The transcript end position of the transcript |
Strand | The strand of the transcript |
Biotype | The biotype of the transcript |
Sequence source | The source of transcript sequences |
Gene information | |
Gene symbol | The official symbol of the gene |
ensembl id | The ensembl id of the gene |
alias | The alias of the gene |
chromosome | The chromosome of the gene |
start | The transcript start position of the gene |
end | The transcript start position of the gene |
strand | The strand of the gene |
biotype | The biotype of the gene |
species | The species of the gene |
version | The reference version of the genomic information |
Subcellular Localization Information | |
cell | The cell line or tissue used for experiments |
method | The method used for experiments |
localization | The subcellular localization of the transcript in this experiment |
pmid | The pmid of this experiment |
title | The article title of this experiment |
source | The qualitative or quantitative results of this experiment |
Transcript information . | |
---|---|
Transcript ID | The transcript id of the transcript |
Chromosome | The chromosome of the transcript |
Start | The transcript start position of the transcript |
End | The transcript end position of the transcript |
Strand | The strand of the transcript |
Biotype | The biotype of the transcript |
Sequence source | The source of transcript sequences |
Gene information | |
Gene symbol | The official symbol of the gene |
ensembl id | The ensembl id of the gene |
alias | The alias of the gene |
chromosome | The chromosome of the gene |
start | The transcript start position of the gene |
end | The transcript start position of the gene |
strand | The strand of the gene |
biotype | The biotype of the gene |
species | The species of the gene |
version | The reference version of the genomic information |
Subcellular Localization Information | |
cell | The cell line or tissue used for experiments |
method | The method used for experiments |
localization | The subcellular localization of the transcript in this experiment |
pmid | The pmid of this experiment |
title | The article title of this experiment |
source | The qualitative or quantitative results of this experiment |
Transcript information . | |
---|---|
Transcript ID | The transcript id of the transcript |
Chromosome | The chromosome of the transcript |
Start | The transcript start position of the transcript |
End | The transcript end position of the transcript |
Strand | The strand of the transcript |
Biotype | The biotype of the transcript |
Sequence source | The source of transcript sequences |
Gene information | |
Gene symbol | The official symbol of the gene |
ensembl id | The ensembl id of the gene |
alias | The alias of the gene |
chromosome | The chromosome of the gene |
start | The transcript start position of the gene |
end | The transcript start position of the gene |
strand | The strand of the gene |
biotype | The biotype of the gene |
species | The species of the gene |
version | The reference version of the genomic information |
Subcellular Localization Information | |
cell | The cell line or tissue used for experiments |
method | The method used for experiments |
localization | The subcellular localization of the transcript in this experiment |
pmid | The pmid of this experiment |
title | The article title of this experiment |
source | The qualitative or quantitative results of this experiment |
Transcript information . | |
---|---|
Transcript ID | The transcript id of the transcript |
Chromosome | The chromosome of the transcript |
Start | The transcript start position of the transcript |
End | The transcript end position of the transcript |
Strand | The strand of the transcript |
Biotype | The biotype of the transcript |
Sequence source | The source of transcript sequences |
Gene information | |
Gene symbol | The official symbol of the gene |
ensembl id | The ensembl id of the gene |
alias | The alias of the gene |
chromosome | The chromosome of the gene |
start | The transcript start position of the gene |
end | The transcript start position of the gene |
strand | The strand of the gene |
biotype | The biotype of the gene |
species | The species of the gene |
version | The reference version of the genomic information |
Subcellular Localization Information | |
cell | The cell line or tissue used for experiments |
method | The method used for experiments |
localization | The subcellular localization of the transcript in this experiment |
pmid | The pmid of this experiment |
title | The article title of this experiment |
source | The qualitative or quantitative results of this experiment |
Transcript information records the basic information of transcripts, including transcript ID, genomic coordinates and biotype. Since novel lncRNAs are being identified daily, many of these transcripts may still have no official names. We add the genomic coordinates, including transcript start site position, transcript end site position, chromosome and strand, as an identifier for every transcript. We fetch the genomic coordinates from Ensembl (21), UCSC (22), MGD (23), GenBank (24) and FlyBase (20), according to their transcript IDs. For transcripts without official IDs, we use the genomic coordinates described in corresponding articles. GRCh37 and GRCh38 are used as the reference genome for human, while GRCm38 for mouse and BDGP6 for fruit fly, respectively. We also get the transcript biotype from Ensembl database for those with Ensembl IDs. For the transcripts with accession number in GenBank, we use FEELnc (26), a tool for lncRNA annotation, to classify transcript into different biotype by comparing the genome location of transcripts with that of Gencode (25) transcripts. The biotype of other transcripts is obtained based on the description in corresponding papers or marked as ‘lncRNA’ if no description.
Gene information consists of gene symbol, Ensembl ID, alias and genomic coordinates and gene biotype. Since an lncRNA gene may have plenty of isoforms, which may have different subcellular localization types, we gather all transcripts belonging to the same gene to show its localization type. For intronic lncRNAs, information of host genes is used as gene information. In order to avoid the mismatch due to alias names, we convert all names to Ensembl ID and get gene symbol from Ensembl database. All other names are thought to be alias. For genes that cannot be found in Ensembl database, the Ensembl ID field will be unknown, while the known gene names are used as gene symbol. For some transcripts that do not belong to any genes, the genes are marked as unknown.
We think there are three basic types of subcellular localization in a cell, accumulated in nucleus, accumulated in cytoplasm and accumulated in both (nucleus/cytoplasm). In some condition, where the location region is more accurate, our system includes the most specific sub regions in nucleus or cytoplasm. According to the data we collect, we indicate that some lncRNAs are accumulated in chromosome or nucleoplasm in nucleus and some lncRNAs are accumulated in ribosome in cytoplasm. The type of the lncRNA subcellular localization is fetched directly from the papers. If authors did not state the type explicitly, we provide the reference types by considering the transcripts are nuclear accumulated if the nuclear expression level is more than 2-fold of the cytoplasm expression and cytoplasm accumulated if cytoplasm expression level is >2-fold of the nuclear expression and accumulated in both in other situations, similar with the definition in (30).
The current release contains >11 000 transcripts from ∼100 papers, mainly involving three species. Specifically, there are 9003 transcripts for human, 2630 for mouse, 59 for fruit fly and 6 for other species. In total, we collect >14 000 subcellular localization information. The distribution of localization types is shown in Figure 2.
Querying the database
lncSLdb is available online at: http://bioinformatics.xidian.edu.cn/lncSLdb. Users can browse, query and download data through the web interface.
In the browse page, all items are listed, which can be filtered by certain subtypes, including species, localization and transcript biotype. Every item has a detail page about the transcripts and localization, including transcript ID, transcript genome coordinates, subtype, method and cell used for experiment, reference article and its PMID, localization conclusion and the specific result. Transcripts belonging to the same gene are listed in the same detail page, where the gene information is shown in the beginning.
In the search page, we provide a comprehensive query tool. Users can query the lncRNA localization by using the gene name or transcript name as the keywords, selecting the specific species, biotype and subcellular localization type. We also offer a tool to search transcripts in a genome region in order to find novel transcripts without official names. In addition, there is a tool for searching the location type of homologous transcript via supplying the sequences in the fasta format.
All data can be downloaded from the download page with txt format or Microsoft Excel format. We also open the SQL interface to allow users to develop their program to access our database.
Researchers can submit new subcellular localization to lncSLdb online. More details can be found on the submission and help page.
Discussion and future prospects
Increasing evidence has proven that lncRNAs play important roles in cell activities. But we still have little knowledge about their basic properties, such as the subcellular localization. The study in the protein subcellular localization helps researchers understand the function of protein. We hope the effort in lncRNA subcellular localization can provide another view to explain their function and biogenesis (11). Although some researchers have developed some databases containing lncRNA subcellular localization (17–19), they only cover a small fraction of available lncRNAs in different species. Here, we developed lncSLdb, an lncRNA subcellular localization database, collecting the qualitative and quantitative localization information of >10 000 of lncRNAs subcellular localization information from published articles from three species, classified into three basic subcellular localization types and three subtypes. To our knowledge, this is the most complete database for lncRNA subcellular localization up to now. We hope that lncSLdb can provide researchers an integrated platform for studying the basic property and subcellular localization of lncRNAs, and further for figuring out if lncRNAs share the same or similar exportation mechanism with mRNAs and other potential molecular roles. We are interested in mining the features of transcripts in different cellular compartments and predicting the distribution of lncRNAs in different cell compartments. We will continue to update an improve lncSLdb in the future.
Acknowledgements
We thank Xiaofei Yang for the help of development of lncSLdb. We thank Peizhuo Wang and Ran Duan for discussion about the design of web server. We are grateful to Hao Lin and Quan Zou for suggestion about the manuscript.
Funding
National Natural Science Foundation of China (61532014, 61672407, 61432010 and 91530113). Funding for open access charge: National Natural Science Foundation of China.
Conflict of interest. None declared.
Database URL: http://bioinformatics.xidian.edu.cn/lncSLdb