CNSA: a data repository for archiving omics data Open Access

Definitions and main fields of data objects

Data object	Definition	Main fields
Project	An overall description of a single research initiative	Project name, project title, public description, sample scope, data type, submitter, funding information, publication
Sample	A description of biological source material	Sample type, sample name, organism, taxonomy ID, collection, isolate, tissue, location, phenotype, disease
Experiment	A description of sample-specific sequencing library, instrument and sequencing methods	File type, sequencing platform, library strategy, library source, library layout
Run	A description of the sequencing data files that belong to the related experiment	File name, MD5 value
Assembly	A collection of genomic sequences that are used to represent the genome of an organism.	Molecule type, coverage, sequencing technology, assembly method
Variation	Genome variations of any species	Variation type, position, variation, detection method, clinical significance, phenotype, condition

Data object	Definition	Main fields
Project	An overall description of a single research initiative	Project name, project title, public description, sample scope, data type, submitter, funding information, publication
Sample	A description of biological source material	Sample type, sample name, organism, taxonomy ID, collection, isolate, tissue, location, phenotype, disease
Experiment	A description of sample-specific sequencing library, instrument and sequencing methods	File type, sequencing platform, library strategy, library source, library layout
Run	A description of the sequencing data files that belong to the related experiment	File name, MD5 value
Assembly	A collection of genomic sequences that are used to represent the genome of an organism.	Molecule type, coverage, sequencing technology, assembly method
Variation	Genome variations of any species	Variation type, position, variation, detection method, clinical significance, phenotype, condition

Table 1

Open in new tab Download slide

Definitions and main fields of data objects

Data object	Definition	Main fields
Project	An overall description of a single research initiative	Project name, project title, public description, sample scope, data type, submitter, funding information, publication
Sample	A description of biological source material	Sample type, sample name, organism, taxonomy ID, collection, isolate, tissue, location, phenotype, disease
Experiment	A description of sample-specific sequencing library, instrument and sequencing methods	File type, sequencing platform, library strategy, library source, library layout
Run	A description of the sequencing data files that belong to the related experiment	File name, MD5 value
Assembly	A collection of genomic sequences that are used to represent the genome of an organism.	Molecule type, coverage, sequencing technology, assembly method
Variation	Genome variations of any species	Variation type, position, variation, detection method, clinical significance, phenotype, condition

Data object	Definition	Main fields
Project	An overall description of a single research initiative	Project name, project title, public description, sample scope, data type, submitter, funding information, publication
Sample	A description of biological source material	Sample type, sample name, organism, taxonomy ID, collection, isolate, tissue, location, phenotype, disease
Experiment	A description of sample-specific sequencing library, instrument and sequencing methods	File type, sequencing platform, library strategy, library source, library layout
Run	A description of the sequencing data files that belong to the related experiment	File name, MD5 value
Assembly	A collection of genomic sequences that are used to represent the genome of an organism.	Molecule type, coverage, sequencing technology, assembly method
Variation	Genome variations of any species	Variation type, position, variation, detection method, clinical significance, phenotype, condition

Data model in CNSA A. At present, CNSA has six data objects, and the corresponding prefixes of accession numbers are marked in red. B. Correlation model for Ruili Botanical Garden project.

Figure 1

It is worth mentioning that CNSA has created a correlation model of living samples, sample and analytical data on some projects such as the Ruili Botanical Garden project (https://db.cngb.org/search/project/CNPhis0000538) [15] and Culturable Genome Reference (https://db.cngb.org/search/project/CNP0000126/) [16]. Figure 1B illustrates the interrelationship of the living samples, sample and analytical data for the Ruili Botanical Garden project. CNSebb is the prefix of the accession number of the living samples which can be applied to the E-BioBank (EBB: https://db.cngb.org/ebb/), a shared platform for sample resources in the CNGB. Take a sample (CNSebb2000255) from the Ruili Botanical Garden project as an example, CNSebb2000255 is the living sample accession. On the detail page of this living sample (https://db.cngb.org/search/sample/CNSebb2000255/), the associated sample information (CNShis0046812) and its related analysis data (experiment/run: CNXhis0119000/CNRhis0123700) can be retrieved, so that all data can be traced throughout the life cycle from the living sample to the sample information to the analytical data.

Data submission and curation

Users need to register, login and fill in the submitter’s information before creating a submission. Generally, the order of data submission is project, sample and related experiments, runs, assemblies and variations (Figure 2). The data to be submitted includes metadata and data files. At its most basic level, metadata is ‘data about data’. Here metadata is data that describes a data object, such as the attributes listed in the ‘Main Fields’ column of Table 1. To collect metadata, CNSA provides a user-friendly and easy-to-use submission process that supports Chinese and English bilingual interfaces. It is worth mentioning that CNSA supports batch submission of samples, experiments, runs, assemblies and variations. Therefore, users can fill in a batch submission template and submit a batch of data in one submission process. Compared to a single submission, batch submission greatly improves the efficiency of data submission. Users can choose single submission or batch submission according to the number of entries. To simplify the submission of data files, CNSA supports data files to be uploaded via FTP. Moreover, in order to ensure the integrity of the submitted data, users need to submit the corresponding MD5 checksums of data files while submitting the metadata. The system will also automatically check the standardization of some field information submitted by users, such as format and character limit. It verifies the MD5 checksums of the data files and gives corresponding prompts if there are errors. After the data is submitted successfully, each piece of data will be assigned an accession number, and each project will be assigned a DOI which is a persistent identifier.

Process of data submission to CNSA.

Figure 2

Open in new tab Download slide

In addition, like INSDC members, there are two data management manners for projects, public and controlled. The data submitter can choose a data management manner of the project. All metadata and data files associated with the project set to be public will be public on the release date set by the submitter. The public data will be open to the world, and users can access or use it freely without logging in or registering. However, the metadata associated with the project set to be controlled will be public on the metadata release date set by the submitter, and the data files will be under controlled access. Other registered users can submit an application to the CNGB Data Access (CDA: https://db.cngb.org/data_access/) to apply for access to controlled data. Data applicants can use the controlled data only after the data access application has been reviewed and approved. After the project is successfully submitted, it will undergo materials review, such as review of materials related to ethics or human genetic resources. All submitted data must pass the material review before data biocuration. It can only be released or controlled after being approved by biocurators. In addition to online check of standardization of some field information, the metadata will also be manually reviewed to ensure its completeness, relevance and correctness. In order to increase the reusability of data, CNSA only accepts data files in commonly used formats, such as FASTQ, BAM, FASTA, VCF. Moreover, quality control such as checking the correctness of the formats and statistic the data quality using fastp [17] is performed on raw sequencing data files in FASTQ format.

Data statistics of CNSA A. Numbers of Projects, Samples, Assemblies, Experiments and runs in CNSA. B. File sizes of Runs and Assemblies in CNSA. All statistics are based on data submitted from November 2017 to May 2020.

Figure 3

Open in new tab Download slide

Summary of sequence types and amount of several sequence archive databases

Table 2

Summary of sequence types and amount of several sequence archive databases

Database	Sequence types	Amount
INSDC	Next-generation reads, capillary reads, annotated sequences	7.2 trillion bases^a
TCGA	Genomic, epigenomic, transcriptomic and proteomic sequence reads for tumor and normal samples	1.4 petabyte^b
GSA	Raw sequence reads of omics	2.3 petabyte^c
CNSA	Raw sequence reads of omics and assemblies	2.6 petabyte^d

Database	Sequence types	Amount
INSDC	Next-generation reads, capillary reads, annotated sequences	7.2 trillion bases^a
TCGA	Genomic, epigenomic, transcriptomic and proteomic sequence reads for tumor and normal samples	1.4 petabyte^b
GSA	Raw sequence reads of omics	2.3 petabyte^c
CNSA	Raw sequence reads of omics and assemblies	2.6 petabyte^d

^aBased on the data statistics of release 119: https://www.ddbj.nig.ac.jp/stats/release-e.html#data_category

^bBased on the data statistics of May 24, 2020: https://portal.gdc.cancer.gov/repository?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_category%22%2C%22value%22%3A%5B%22sequencing%20reads%22%5D%7D%7D%5D%7D&searchTableTab=files

^cBased on the statistics of May 24, 2020: https://bigd.big.ac.cn/gsa/

^dBased on the statistics of May 24, 2020: https://db.cngb.org/cnsa/statistic/

Table 2

Summary of sequence types and amount of several sequence archive databases

Database	Sequence types	Amount
INSDC	Next-generation reads, capillary reads, annotated sequences	7.2 trillion bases^a
TCGA	Genomic, epigenomic, transcriptomic and proteomic sequence reads for tumor and normal samples	1.4 petabyte^b
GSA	Raw sequence reads of omics	2.3 petabyte^c
CNSA	Raw sequence reads of omics and assemblies	2.6 petabyte^d

Database	Sequence types	Amount
INSDC	Next-generation reads, capillary reads, annotated sequences	7.2 trillion bases^a
TCGA	Genomic, epigenomic, transcriptomic and proteomic sequence reads for tumor and normal samples	1.4 petabyte^b
GSA	Raw sequence reads of omics	2.3 petabyte^c
CNSA	Raw sequence reads of omics and assemblies	2.6 petabyte^d

^aBased on the data statistics of release 119: https://www.ddbj.nig.ac.jp/stats/release-e.html#data_category

^cBased on the statistics of May 24, 2020: https://bigd.big.ac.cn/gsa/

^dBased on the statistics of May 24, 2020: https://db.cngb.org/cnsa/statistic/

Data archive and statistics

Currently, CNSA archives omics data from around the world, including six objects (Project, Sample, Experiment, Run, Assembly, Variation). The data is made public or controlled based on the submitter’s settings. To ensure data security, CNSA adopts high-performance distributed object storage for data archiving, and double data backup on physically independent disks.

As of May 24, 2020, CNSA has archived a total of 2177 projects, 334 824 samples, 237 859 experiments, 371 066 runs, and 13 890 assemblies for 3079 species (Figure 3A), submitted by 470 submitters from 108 institutions. The total amount of archived run files and assembly files have reached 2631 TB (Figure 3B). Moreover, CNSA has supported 124 articles published in 82 journals.

We summarized the type and amount of sequence data archived in several sequence archive databases such as INSDC, TCGA and Genome Sequence Archive (GSA) [18] (Table 2), which will be helpful for the users when selecting specific databases for bioinformatics research.

Data retrieval and reference

As mentioned in the submission process, each piece of data will be assigned an accession number, and each project will be assigned a DOI which is a persistent identifier. All public data can be searched in CNSA by accession numbers of Project, Sample, Experiment, Run, Assembly or any other combination of keywords. For example, if users are interested in collecting data for a genome-wide association study (GWAS) on breast cancer in women, they can directly enter any keyword in the search input box on the CNSA homepage (https://db.cngb.org/cnsa/), such as GWAS. Since CNSA and CNGBdb share the same search engine, the search page will automatically jump to CNGBdb. Users can also use CNGBdb’s advanced search (https://db.cngb.org/search/advanced/all/). First, select a database and then add multiple search keywords to filter, such as (GWAS) AND (breast cancer) AND (young women), users can also filter the results by checking some filter conditions on the left side of the page. All public data files can be freely accessed through the FTP site (ftp://ftp.cngb.org/). Moreover, both the data accession number and DOI assigned by CNSA can be used to reference the submitted data to support the publication of scientific research results.

Conclusions and perspectives

In conclusion, CNSA is a data repository for archiving omics data, including raw sequencing data and its analysis result. Currently, online submissions of projects, samples, experiments, runs, assemblies and variations are available. Moreover, compared with similar databases, an advantage worth mentioning is that CNSA has created a correlation model of living samples, sample information and analytical data on some projects. From now on, CNSA will practice the correlation model on more projects to make all data can be traced throughout the life cycle from the living sample to the sample information to the analytical data and promote the scientific and rational use of biological living samples. All public data resources of CNSA are freely worldwide scientific communities. In compliance with data standards commonly used in the life sciences, CNSA is committed to building a comprehensive and curated data repository for the storage, management and sharing of omics data, and improving the data standards to alleviate the growing management pressure of biological big data and support academic research and the bio-industry.

In order to promote the sharing and exchange of information, technology and resources of life science and facilitate the rational and efficient use of life resources, CNSA will continue to upgrade and expand. The infrastructure will be upgraded to improve the efficiency of the system and user experience. In addition, new data types associated with the omics data such as sequence, protein, metabolism, expression, clinic and image will be gradually added to the database to enrich our database and meet the needs of more users in the future.

Acknowledgements

We gratefully thank other colleagues in the CNGB who helped to create and maintain the CNSA.

Funding

This study was funded by the Guangdong Provincial Key Laboratory of Genome Read and Write (No. 2017B030301011).

Conflict of interest. None declared.

X.G. and F.C. contributed equally to this work.

References

Samuel

G.N.

and

Farsides

(

2017

)

The UK’s 100,000 Genomes Project: manifesting policymakers’ expectations

New Genet. Soc.

336

–

353

International Cancer Genome C, Hudson TJ, Anderson W, et al

(

2010

)

International network of cancer genome projects

Nature

464

993

–

998

PubMed

Tomczak

Czerwinska

and

Wiznerowicz

(

2015

)

The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge

Contemp. Oncol.

A68

–

A77

Levy

Chen

Clarke

et al. (

2020

)

Socioeconomic differences in health-care use and outcomes for stroke and ischaemic heart disease in China during 2009-16: a prospective cohort study of 0.5 million adults

Lancet Glob. Health

e591

–

e602

Exposito-Alonso

Drost

H.G.

Burbano

H.A.

et al. (

2020

)

The earth BioGenome project: opportunities and challenges for plant genomics and conservation

Plant J. Cell and Molec Biol

102

222

–

229

Karsch-Mizrachi

Takagi

Cochrane

et al. (

2018

)

The international nucleotide sequence database collaboration

Nucleic Acid. Res.

D48

–

D51

Kodama

Mashima

Kosuge

et al. (

2018

)

DNA Data Bank of Japan: 30th anniversary

Nucleic Acid. Res.

D30

–

D35

Cook

C.E.

Bergman

M.T.

Cochrane

et al. (

2018

)

The European Bioinformatics Institute in 2017: data coordination and integration

Nucleic Acid. Res.

D21

–

D29

Sayers

E.W.

Beck

Brister

J.R.

et al. (

2020

)

Database resources of the National Center for Biotechnology Information

Nucleic Acid. Res.

–

D16

10.

Droege

Barker

Seberg

et al. (

2016

)

The global genome biodiversity network (GGBN) data standard specification

Database

2016

, baw125.

11.

Terry

S.F.

(

2014

)

The global alliance for genomics & health

Genet. Test Mol Bioma

375

–

376

12.

Neumann

and

Brase

(

2014

)

DataCite and DOI names for research data

J. Comput. Aid. Mol. Des.

1035

–

1041

13.

National Genomics Data Center M, Partners

(

2020

)

Database resources of the National Genomics Data Center in 2020

Nucleic Acid. Res.

D24

–

D33

14.

Wang

Liu

Zhang

E.C.

et al. (

2019

)

The China National GeneBank horizontal line owned by all, completed by all and shared by all

Hereditas

761

–

772

PubMed

15.

Liu

Wei

Yang

et al. (

2019

)

Molecular digitization of a botanical garden: high-depth whole-genome sequencing of 689 vascular plant species from the Ruili Botanical Garden

GigaScience

, giz007.

16.

Zou

Xue

Luo

et al. (

2019

)

1,520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses

Nat. Biotech.

179

–

185

17.

Chen

Zhou

Chen

et al. (

2018

)

fastp: an ultra-fast all-in-one FASTQ preprocessor

Bioinformatics

i884

–

i890

18.

Members

B.I.G.D.C.

(

2018

)

Database resources of the BIG Data Center in 2018

Nucleic Acid. Res.

D14

–

D20