- Split View
-
Views
-
Cite
Cite
Nor Afiqah-Aleng, Sarahani Harun, Mohd Rusman Arief A-Rahman, Nor Azlan Nor Muhammad, Zeti-Azura Mohamed-Hussein, PCOSBase: a manually curated database of polycystic ovarian syndrome, Database, Volume 2017, 2017, bax098, https://doi.org/10.1093/database/bax098
- Share Icon Share
Abstract
Polycystic ovarian syndrome (PCOS) is one of the main causes of infertility and affects 5–20% women of reproductive age. Despite the increased prevalence of PCOS, the mechanisms involved in its pathogenesis and pathophysiology remains unclear. The expansion of omics on studying the mechanisms of PCOS has lead into vast amounts of proteins related to PCOS resulting to a challenge in collating and depositing this deluge of data into one place. A knowledge-based repository named as PCOSBase was developed to systematically store all proteins related to PCOS. These proteins were compiled from various online databases and published expression studies. Rigorous criteria were developed to identify those that were highly related to PCOS. They were manually curated and analysed to provide additional information on gene ontologies, pathways, domains, tissue localizations and diseases that associate with PCOS. Other proteins that might interact with PCOS-related proteins identified from this study were also included. Currently, 8185 PCOS-related proteins were identified and assigned to 13 237 gene ontology vocabulary, 1004 pathways, 7936 domains, 29 disease classes, 1928 diseases, 91 tissues and 320 472 interactions. All publications related to PCOS are also indexed in PCOSBase. Data entries are searchable in the main page, search, browse and datasets tabs. Protein advanced search is provided to search for specific proteins. To date, PCOSBase has the largest collection of PCOS-related proteins. PCOSBase aims to become a self-contained database that can be used to further understand the PCOS pathogenesis and towards the identification of potential PCOS biomarkers.
Database URL: http://pcosbase.org
Introduction
Polycystic ovarian syndrome (PCOS) is an endocrine disorder that is characterized by a combination of two out of three features, i.e. ovulatory dysfunction, hyperandrogenism and/or the presence of polycystic ovaries (1). PCOS is difficult to diagnose as these features might lead to various phenotypic manifestations (2). Clinical findings showed that women with PCOS have higher risk to develop other complications such as endometrial cancer (3), diabetes (4), hypertension (5) and depression (6). These phenotypic manifestations and disease associations would significantly interrupt the progress in deciphering the cause of PCOS (7).
Transcriptomics (8) and proteomics (9) were used to identify genes and proteins differences between non-PCOS and PCOS women and the resulting data analysis could be used to elucidate the cause of PCOS. At present, numbers of published expression studies has increased significantly since 2003, and this contributes to the vast amount of PCOS-related molecular data. Unfortunately, these molecular data were randomly distributed in various general biological databases (GenBank and UniProt) and literatures thus contribute to the difficulties in finding all genes and proteins that are related to PCOS. This limitation has led us to develop PCOSBase to house 8185 PCOS-related proteins that were manually curated. These proteins were filtered from 17 492 identified proteins from 30 expression studies and 9 databases. Bioinformatic analyses were performed on these proteins to characterize and classify them into specific datasets based on their molecular characteristics. PCOSBase also provides indexed publications related to PCOS. These features signify the differences of PCOSBase to previously published, PCOSKB (10) (PCOSKB statistics as of July 2017 contains 241 sequences). Detailed information on proteins and diseases related to PCOS can be found in PCOSBase but none on the proteins-drugs association as described in Open Targets (www.targetvalidation.org). Open Targets has listed 1119 proteins identified as drug targets for PCOS and 73% of those can be found in PCOSBase (11). PCOS is a focus in this study due to inadequate information and understanding on its complex molecular mechanism and at the same time it associates with many well-described diseases identified from clinical findings. For this reason, PCOSBase serves as a comprehensive medically oriented repository that will be an excellent aid in providing and integrating accurate molecular information for in depth understanding on PCOS.
Herein, the development and current status of PCOSBase were described. The provided web interfaces were further systematically discussed. PCOSBase can be accessed online at http://pcosbase.org (PCOSBase v1.0, last updated on 21 November 2017).
Materials and methods
Data collection
Keywords including ‘Polycystic Ovary Syndrome,’ ‘Polycystic Ovary Syndrome 1,’ ‘PCOS,’ ‘polycystic ovaries,’ ‘PCOS,’ ‘PCO,’ ‘PCO1,’ ‘Stein-Leventhal,’ ‘Stein Leventhal,’ ‘Stein-Leventhal Syndrome,’ ‘Polycystic Ovary Disease,’ ‘Polycystic Ovarian Disease,’ ‘PCOD,’ ‘Sclerocystic Ovarian Degeneration,’ ‘Sclerocystic Ovary Syndrome,’ ‘Sclerocystic Ovarian Disease’ and ‘Bilateral PCOS’ were searched in nine disease-associated databases including OMIM (12), HGMD (13), DisGeNET (14), MalaCards (15), PhenomicDb (16), DISEASES (17), DGA (18), GWASdb (19) and GWAS catalog (20).
Previous keywords of PCOS and another keywords such as ‘gene expression,’ ‘protein expression,’ ‘expression,’ ‘transcriptomics,’ ‘proteomics’ or ‘microarray’ were also used to search for relevant publications from PubMed (21), ArrayExpress (22), ScienceDirect and Scopus. Genes and proteins that were significantly expressed in those publications were included as PCOS-related proteins. These publications were indexed and listed in PCOSBase.
All genes and proteins from disease-associated databases and published expression publications were compared against NCBI Gene (23) and UniProt (24) databases to obtain their unique Gene ID and UniProt ID. The overlapping data that were obtained in more than one database or studies were combined.
Functional annotations
To better understand the function of PCOS-related proteins, extensive information on the proteins such as chromosomal location, gene ontology (GO), pathway, proteins structural information, tissue localization, disease-related information and protein-protein interaction (PPI) were retrieved from online databases such as NCBI Gene (23), UniProt (24), Gene Ontology Consortium (25), KEGG (26), BioCarta (27), WikiPathways (28), Interpro (29), Human Protein Atlas (30), DisGeNET (14) and HIPPIE (31), or were obtained from our bioinformatics analysis (where necessary).
Database organization and architecture
All collected data including relevant information on PCOS-related proteins, functional annotation information and PCOS publications were organized in 29 tables. The 28 tables were linked to each other except for PCOS publications table (Figure 1).
PCOSBase was built as a relational database using MySQL Server 5.0.11. The web interfaces were designed using Laravel 5.4 (PHP web framework), HTML and JavaScript.
Results and Discussion
Database summary
Figure 2 depicts the organization of three data types in PCOSBase; i.e. PCOS-related proteins, diseases and publications. Currently, PCOSBase contains 8185 PCOS-related proteins retrieved from nine databases and 30 expression studies. Characterization on these proteins have resulted to the classification into 13 237 GOs, 7936 domains, 91 tissues with cell types, 320 472 interactions and 1004 pathways where most of the proteins are located in the metabolic pathways. Prediction on the diseases associated to PCOS reveals 1928 diseases. These were classified into 29 disease classes. Publications of 14 368 articles on PCOS are indexed in this database. Numbers of entries in each dataset were summarized in Table 1.
Dataset . | Entries . |
---|---|
PCOS-related proteins | 8185 |
Gene ontologies | 13 237 |
Biological processes | 8971 |
Cellular components | 1305 |
Molecular functions | 2961 |
Domains | 7936 |
Pathways | 1004 |
Interactions | 320 472 |
PCOS-related diseases | 1928 |
Disease classes | 29 |
Tissues | 91 |
Databases | 9 |
Resources | 30 |
Transcriptomics | 19 |
Proteomics | 11 |
Publications | 14 368 |
Dataset . | Entries . |
---|---|
PCOS-related proteins | 8185 |
Gene ontologies | 13 237 |
Biological processes | 8971 |
Cellular components | 1305 |
Molecular functions | 2961 |
Domains | 7936 |
Pathways | 1004 |
Interactions | 320 472 |
PCOS-related diseases | 1928 |
Disease classes | 29 |
Tissues | 91 |
Databases | 9 |
Resources | 30 |
Transcriptomics | 19 |
Proteomics | 11 |
Publications | 14 368 |
Dataset . | Entries . |
---|---|
PCOS-related proteins | 8185 |
Gene ontologies | 13 237 |
Biological processes | 8971 |
Cellular components | 1305 |
Molecular functions | 2961 |
Domains | 7936 |
Pathways | 1004 |
Interactions | 320 472 |
PCOS-related diseases | 1928 |
Disease classes | 29 |
Tissues | 91 |
Databases | 9 |
Resources | 30 |
Transcriptomics | 19 |
Proteomics | 11 |
Publications | 14 368 |
Dataset . | Entries . |
---|---|
PCOS-related proteins | 8185 |
Gene ontologies | 13 237 |
Biological processes | 8971 |
Cellular components | 1305 |
Molecular functions | 2961 |
Domains | 7936 |
Pathways | 1004 |
Interactions | 320 472 |
PCOS-related diseases | 1928 |
Disease classes | 29 |
Tissues | 91 |
Databases | 9 |
Resources | 30 |
Transcriptomics | 19 |
Proteomics | 11 |
Publications | 14 368 |
Database interface and access
PCOSBase interface contains six main menus, i.e. About, Search, Browse, Datasets, Network and Help that will help the user to easily navigate the respective pages.
Homepage displays total data statistics in every table and five menus that will navigate the users to the pages as described below. Search box is also provided on this page.
Information on PCOSBase and PCOS can be accessed on ‘About page.’
‘Search page’ provides two search options, i.e. Simple Search and Protein Advanced Search. The function of Simple Search is similar to the Search box on the homepage. Users can search for protein, GO, pathway, disease, domain and tissue that match to a particular keyword. For example, if ‘androgen’ keyword is searched, all entries in PCOSBase that contain ‘androgen’ term will appear. However, Protein Advanced Search allows the users to retrieve information of protein(s) with a particular combination of annotation. For instance, protein(s) associated with both GO term of ‘single fertilization’ and disease of ‘female infertility.’ Protein Advanced Search gives the users an option to find protein(s) that contain any combination from six different fields (protein description, GO, pathway, domain, tissue and disease).
Users can assess all 11 datasets in PCOSBase by ‘Browse page.’ These datasets were classified based on their biological information, as described below:
PCOS-related proteins dataset: contains lists of 8185 proteins related to PCOS that were retrieved from various sources.
GO dataset: contains GO vocabulary information on all PCOS-related proteins.
Pathways dataset: contains all identified pathways where PCOS-related proteins are involved in.
Interactions dataset: contains information on PPIs of PCOS-related proteins.
Domains dataset: contains information on the domains present in all PCOS-related proteins.
Tissues dataset: provides information on which tissues and cell types where PCOS-related proteins were expressed.
Databases dataset: contains list of publicly available databases, where PCOS-related proteins were obtained.
Resources dataset: contains the expression studies of all PCOS-related proteins retrieved from transcriptomic and proteomic data.
PCOS-related diseases dataset: contains identified diseases that are related to PCOS-related proteins.
Disease classes dataset: contains information on PCOS-related diseases based on Medical Subject Headings tree.
Publications dataset: provides all publications from PubMed that relates to PCOS.
Datasets dropdown menu links all datasets in PCOSBase. Datasets tab are placed at the header and appear on every page of PCOSBase, which allow the users to quickly select and redirect to their desired datasets page.
Network menu contains all networks constructed using PCOS-related proteins, Interactions and PCOS-related diseases datasets. Currently, PCOSBase only provides several static PCOS networks. Figure 3 is one of the networks that can be found in this menu, where this network clearly depicted the association of PCOS with other diseases.
Help menu provides the user manual of PCOSBase, database schema and all the references that were used to retrieve the data. All terms, definition and references that were used in PCOSBase were also provided in the Help page.
Conclusion and future perspective
In the next few years, the size of PCOS molecular data is expected to increase, especially with the application of new sequencing technologies such as next-generation sequencing in analysing in PCOS samples. To ensure PCOSBase is always up-to-date, all information in this database will be periodically updated. It is very important to consider a comprehensive cataloging on all types of data in any PCOS publications so as to ensure they are accessible to PCOS researchers and clinicians for their quick and easy reference. Ultimately, genomic and molecular information in this database will serve as a reliable repository that can be used to search for potential PCOS biomarker towards the development of improved diagnostics and treatment for PCOS.
Acknowledgements
Pusat Penyelidikan Bioinformatik, Institut Biologi Sistem (INBIOSIS), Universiti Kebangsaan Malaysia provides the computing facilities used in this project. The authors thank the reviewers in providing constructive suggestions to improve this manuscript and PCOSBase.
Funding
This work was supported by Malaysia Ministry of Higher Education (FRGS/1/2014/SG05/UKM/02/6 and ERGS/1/2013/STG07/UKM/02/3) and Ministry of Science, Technology and Innovation (UKM-MGI-NBD0005-2007) awarded to ZAMH. The Ph.D. scholarship to NAA is funded by MyBrain15, Ministry of Higher Education.
Conflict of interest. None declared.
References
Author notes
Citation details: Afiqah-Aleng,N., Harun,S., A-Rahman,M.R.A. et al. PCOSBase: a manually curated database of polycystic ovarian syndrome. Database (2017) Vol. 2017: article ID bax098; doi:10.1093/database/bax098