Abstract

Data generation, driven by rapid advances in genomic technologies, is fast outpacing our analysis capabilities. Faced with this flood of data, more hardware and software resources are added to accommodate data sets whose structure has not specifically been designed for analysis. This leads to unnecessarily lengthy processing times and excessive data handling and storage costs. Current efforts to address this have centered on developing new indexing schemas and analysis algorithms, whereas the root of the problem lies in the format of the data itself. We have developed a new data structure for storing and analyzing genotype and phenotype data. By leveraging data normalization techniques, database management system capabilities and the use of a novel multi-table, multidimensional database structure we have eliminated the following: (i) unnecessarily large data set size due to high levels of redundancy, (ii) sequential access to these data sets and (iii) common bottlenecks in analysis times. The resulting novel data structure horizontally divides the data to circumvent traditional problems associated with the use of databases for very large genomic data sets. The resulting data set required 86% less disk space and performed analytical calculations 6248 times faster compared to a standard approach without any loss of information.

Database URL:http://castor.pharmacogenomics.ca

Introduction

Since the release of a working draft of the human genome project, there has been a proliferation of technologies to perform large-scale genotyping. The research possibilities provided by genome-wide analysis have created a data deluge reminiscent of Moore’s Law (1). In a single year, one massively parallel sequencing machine can produce nearly nine times the amount of data currently housed in the US Library of Congress (2–4).

Currently, lengthy analysis times required for the vast quantities of genotype data generated make interactive analysis impractical (3). Sequential access, such as retrieving data from flat files, e.g. PLINK input files (5), has the limitation that all prior data must be processed in order to access datum at the end of the file, and this process must be repeated for each variation of the original analysis. Furthermore, they are memory-bound. Although PLINK provides a solution for many users there are no fixed limits to the size of the data file (5). Larger data sets will require an ever increasing amount of RAM. For example, a sample set of 20 000 individuals and 1-million SNPs would require ∼8–16 GB of RAM (http://pngu.mgh.harvard.edu/∼purcell/plink/faq.shtml#faq5).

Alternately, the use of databases has been hampered due to challenges in data loading time and performance. The development of new techniques and tools is therefore necessary, as historical solutions have been rendered impractical due to the extreme volume of data generated (2–4).

Efforts at reworking this analysis process have focused on three main areas: data structures, data indexing and data analysis. Initially, the improvement of data structures began with a logical model of genomic and phenotypic data using object-oriented structures (2,6), relational databases (2), or mark-up languages (7), which add a lot of model description metadata. These structures are more suited to providing data context and long-term storage than high-speed analysis (8), although some allow basic analytical querying (9). Recent data indexing efforts seek to improve pattern or sequence search performance. While these have shown a significant performance increase for specific targeted tasks, they have the drawback of increasing the data set size by up to 10 times (10–12). Hardware-based solutions such as cloud computing, peer-to-peer networks, and other distributed computational concepts are now used to prolong the useful lifespan of software by increasing processing power. Other solutions circumvent the problem of large data sets altogether at the cost of losing content (13). Current data warehousing and data sharing methodologies are making progress but fall short of providing a solution for rapid analysis (14).

There are three obvious areas for improvement: (i) reduce data set sizes without any loss of information (also reducing long-term data storage costs), (ii) eliminate the need for sequential access and (iii) organize data to allow for rapid analysis. Our solution attempts to address all these areas of concern. Using established computer science principles we have developed the comprehensive analysis and storage (CASTOR) methodology, a normalized, multi-table and multidimensional database structure for storing and analyzing genotype and phenotype data.

Methods

Data normalization

Data normalization restructuring techniques reduce redundancy and increase the flexibility of a poorly structured data set without loss of information (15). These techniques are frequently used to make a data structure suitable for implementation in a relational database management system (RDBMS). Common genomic data sets such as Illumina’s Genomestudio output files, certain PLINK input files and the Gencode GTF format are all examples of data structures that, despite being produced by, or for, automated analysis contain a significant amount of redundant data and therefore violate the principles of normalization.

In a typical Illumina Genomestudio results file 63% of the output file is composed of unnecessarily redundant data. Although only a single instance of each datum is required to communicate the necessary information, fields such as sample identifier, the name of the single nucleotide polymorphism (SNP) in question, and SNP position are needlessly repeated for each row contained in the sequential file (Table 1). Since both SNP name and SNP position are associated with the SNP in question and not the sample, their inclusion on each row violates the second normal form. Because of this, data which should take up a total of 17-million characters (9 character SNP name + 8 character SNP location = 17 characters × 1 000 000 SNPs), or 0.009% of the final data set, instead takes up 119-billion characters (17 characters × 1 000 000 SNPs × 7000 samples), or 63% of the final data set.

Table 1.

Genomic data structure with a large amount of duplicate data

Sample identifierSNPSNP valueSNP position
Sample 1rs3094315CC742 429
Sample 1rs41480945CC21 227 772
Sample 1rs4040617CG95 952 929
Sample 2rs3094315TT742 429
Sample 2rs41480945AT21 227 772
Sample 2rs4040617CC95 952 929
Sample identifierSNPSNP valueSNP position
Sample 1rs3094315CC742 429
Sample 1rs41480945CC21 227 772
Sample 1rs4040617CG95 952 929
Sample 2rs3094315TT742 429
Sample 2rs41480945AT21 227 772
Sample 2rs4040617CC95 952 929
Table 1.

Genomic data structure with a large amount of duplicate data

Sample identifierSNPSNP valueSNP position
Sample 1rs3094315CC742 429
Sample 1rs41480945CC21 227 772
Sample 1rs4040617CG95 952 929
Sample 2rs3094315TT742 429
Sample 2rs41480945AT21 227 772
Sample 2rs4040617CC95 952 929
Sample identifierSNPSNP valueSNP position
Sample 1rs3094315CC742 429
Sample 1rs41480945CC21 227 772
Sample 1rs4040617CG95 952 929
Sample 2rs3094315TT742 429
Sample 2rs41480945AT21 227 772
Sample 2rs4040617CC95 952 929

This data set is a result of a combination of two different data structures: SNP information (SNP and SNP positions) and sample information (sample identifier) in order to accommodate one piece of datum that depends on both (SNP value).

In order to address this redundancy, we have separated the data set into two individual but related tables. A SNP reference table, containing a list of all SNPs used in the study and their associated position, and a genotype results table containing sample information and all genotypic results.

The SNP reference table uses SNP name as the primary key and related fields as non-prime attributes. This results in one row of information for each SNP present in the study. The genotype results table contains a single row for each sample in the study, with each column representing the results of an individual SNP. This format is similar to the PLINK PED file format (5), which also has one sample per row, using columns to represent the SNPs. This approach leads to a large number of columns. A study involving 1 000 000 SNPs would result in a data set with 1 000 001 columns (one column for sample id, and one column for each SNP). This is impractical as a sequential file, and impossible to implement as a database structure, as a table with these dimensions is not supported by any current database management system (DBMS).

Multi-table

While a DBMS cannot accommodate an unlimited number of columns per table, most can accommodate a nearly unlimited number of tables per database. The number of tables is limited by the capacity of the underlying filesystem or, in the case of Microsoft’s SQL Server, by the number of database objects permitted (over 2 billion). It is this property that we exploit to accommodate our 1 million SNP wide structure, horizontally dividing the single, large, genotype results table of 1 million SNPS into 2000 tables, each with 501 columns (500 SNPs and a sample identifier as primary key). This new structure is currently supported by all major DBMSs.

Each column in the genotype results table is denoted generically (snp1, snp2, etc.) and is included in the SNP reference table allowing rapid identification of the specific SNP. In doing so, uploading a new data set would require no structure changes (such as renaming each column).

Multidimensional encoding

Multidimensional databases are optimized for rapid and ad hoc computer-aided analysis or online analytical processing (OLAP) (16) by encoding all alphanumeric data as numeric data, and isolating descriptive data from the data required for the analysis. Using this methodology we have divided the information into dimension tables and fact tables. Dimension tables contain descriptive data including all the original alphanumeric descriptors and the code that replaces them in the fact tables. The fact tables contain only numeric data and are used to conduct the bulk of the analysis. Each possible combination of two alleles is encoded numerically into 10 values (Table 2).

Table 2.

Genotype dimension table (see genotypes_dim in Figure 1)

CodeGenotypeAllele_aAllele_cAllele_gAllele_t
1AA2000
2CC0200
3GG0020
4TT0002
5AC1100
6AG1010
7AT1001
8CG0110
9CT0101
10GT0011
CodeGenotypeAllele_aAllele_cAllele_gAllele_t
1AA2000
2CC0200
3GG0020
4TT0002
5AC1100
6AG1010
7AT1001
8CG0110
9CT0101
10GT0011
Table 2.

Genotype dimension table (see genotypes_dim in Figure 1)

CodeGenotypeAllele_aAllele_cAllele_gAllele_t
1AA2000
2CC0200
3GG0020
4TT0002
5AC1100
6AG1010
7AT1001
8CG0110
9CT0101
10GT0011
CodeGenotypeAllele_aAllele_cAllele_gAllele_t
1AA2000
2CC0200
3GG0020
4TT0002
5AC1100
6AG1010
7AT1001
8CG0110
9CT0101
10GT0011

This encoding results in a smaller, faster, and more flexible data set, which is more suitable for analysis. While the structure and content change, none of the information contained in the initial data set is lost.

Phenotypes are similarly encoded. Phenotype data already in numeric format remains unchanged; however, an entry is made in the phenotype_dim table (Table 3) to ensure that the context of the phenotype is not lost. Each alphanumeric phenotype is assigned an integer code in the phenotypes_discrete_dim table and a parent entry is added to the phenotypes_dim table (Tables 3 and 4). Numeric codes from phenotypes_discrete_dim are used to populate the phenotype fact tables. Using this methodology, almost all alphanumeric values are converted to numeric values, making these tables suitable for automated analysis. Note that free-form text entries however, cannot be encoded in this way and therefore should be avoided whenever possible if automated analysis is the goal.

Table 3.

Phenotype dimension table (see phenotypes_dim, Figure 1)

IdNameDiscreteDescriptionColumn
1Medication dosing (units)0Medication dose per day in unitsptype1
2Pain severity0Severity of patient painptype2
3Smoking status1Never, former, currentptype3
IdNameDiscreteDescriptionColumn
1Medication dosing (units)0Medication dose per day in unitsptype1
2Pain severity0Severity of patient painptype2
3Smoking status1Never, former, currentptype3
Table 3.

Phenotype dimension table (see phenotypes_dim, Figure 1)

IdNameDiscreteDescriptionColumn
1Medication dosing (units)0Medication dose per day in unitsptype1
2Pain severity0Severity of patient painptype2
3Smoking status1Never, former, currentptype3
IdNameDiscreteDescriptionColumn
1Medication dosing (units)0Medication dose per day in unitsptype1
2Pain severity0Severity of patient painptype2
3Smoking status1Never, former, currentptype3
Table 4.

Discrete phenotype dimension table (see phenotypes_discrete_dim, Figure 1)

CodePhenotypes_dim_idLabel
13Never smoked
23Former smoker
33Current smoker
CodePhenotypes_dim_idLabel
13Never smoked
23Former smoker
33Current smoker
Table 4.

Discrete phenotype dimension table (see phenotypes_discrete_dim, Figure 1)

CodePhenotypes_dim_idLabel
13Never smoked
23Former smoker
33Current smoker
CodePhenotypes_dim_idLabel
13Never smoked
23Former smoker
33Current smoker

Test platforms

The test platforms for all tests were Dell 2× Quad Core Xeon E54102 × 6 MB cache, 2.33 GHz, 1333 MHz FSB, PE2900, with 16 GB 667 MHz Dual Ranked DIMMS and 8 ×300 GB 15K RPM SCSI 3 Gbps mounted in RAID 1+0 for 1.2 mirrored terabytes of disk space. The operating system was RedHat Enterprise Linux 5, and the MySQL Community Server 5.0.67 compiled for RHEL5 (MyISAM) or Oracle 11G were used as the DBMS.

Tests were conducted using both DBMSs, but only Oracle 11G was able to manage the 7 billion rows contained in the original data set, and therefore was used for all load, statistical and data return comparisons on the original data set.

Two computer hosts were used. The first host handles only the dimension tables and the software client responsible for issuing the database queries and the collection of results. The second host, configured to maximize the performance of the DBMS responsible for manipulating the fact tables, performs the analysis.

Evaluation

In the absence of a sufficiently large publicly available data set, a very large data set composed of 7000 subjects, each with 7000 phenotypes (both quantitative and dichotomous) and 1 000 000 bi-allelic genotypes for a total of 7 049 000 000 data points was randomly generated and used to evaluate the performance of our novel database structure. A test suite was written in Perl (17), which created the database structure, disabled indices before the data set was loaded, loaded the data set and then re-enabled the indices. Load time was defined as the sum of the time required to perform these operations.

Our test suite then measured the impact of the new schema using the same computer hardware and operating system, for a direct comparison. To avoid comparing the speed of a DBMS versus a sequential file, which would require the evaluation of a great number of hardware and operating system variables, the original data set was also loaded into the DBMS for evaluation (see Supplemental Data for more detail).

The database management system and multidimensional nature of the data were kept constant for both the original and CASTOR data sets to measure only the efficiencies of the novel database structure. A variety of statistical and common GWAS analyses were performed on the data sets (mean, square root, minimum, standard deviation, variance, allele count with a phenotypic filter) (Table 5). In addition, we tested how rapidly data could be located and retrieved from the databases.

Table 5.

Query return times of common statistical analyses based on a single table query (genotype or phenotype)

QueryCASTOR (s)Original (s)
Query (gtypes) avg(int)0.347017871.454132
Query (gtypes) sqrt(int)0.0967010.050104
Query (gtypes) min(int)0.319485716.520514
Query (ptypes) stddev(int)0.0148371341.081771
Query (ptypes) avg(float)0.0106751417.641397
Query (ptypes) sqrt(float)0.00306212.227921
Query (ptypes) min(float)0.0092960.014992
Query (ptypes) stddev(float)0.0138953.202807
Query (ptypes) var_pop(float)0.01016416.41966
Query (gtypes) count(int) where int is 10.32598416.669058
Query (gtypes) count(int) where int is 30.358017641.80022
Query (gtypes) count(int) where int is 4 and patient_id = 12340.027244668.470442
QueryCASTOR (s)Original (s)
Query (gtypes) avg(int)0.347017871.454132
Query (gtypes) sqrt(int)0.0967010.050104
Query (gtypes) min(int)0.319485716.520514
Query (ptypes) stddev(int)0.0148371341.081771
Query (ptypes) avg(float)0.0106751417.641397
Query (ptypes) sqrt(float)0.00306212.227921
Query (ptypes) min(float)0.0092960.014992
Query (ptypes) stddev(float)0.0138953.202807
Query (ptypes) var_pop(float)0.01016416.41966
Query (gtypes) count(int) where int is 10.32598416.669058
Query (gtypes) count(int) where int is 30.358017641.80022
Query (gtypes) count(int) where int is 4 and patient_id = 12340.027244668.470442

All queries performed on the Oracle 11G DBMS.

Table 5.

Query return times of common statistical analyses based on a single table query (genotype or phenotype)

QueryCASTOR (s)Original (s)
Query (gtypes) avg(int)0.347017871.454132
Query (gtypes) sqrt(int)0.0967010.050104
Query (gtypes) min(int)0.319485716.520514
Query (ptypes) stddev(int)0.0148371341.081771
Query (ptypes) avg(float)0.0106751417.641397
Query (ptypes) sqrt(float)0.00306212.227921
Query (ptypes) min(float)0.0092960.014992
Query (ptypes) stddev(float)0.0138953.202807
Query (ptypes) var_pop(float)0.01016416.41966
Query (gtypes) count(int) where int is 10.32598416.669058
Query (gtypes) count(int) where int is 30.358017641.80022
Query (gtypes) count(int) where int is 4 and patient_id = 12340.027244668.470442
QueryCASTOR (s)Original (s)
Query (gtypes) avg(int)0.347017871.454132
Query (gtypes) sqrt(int)0.0967010.050104
Query (gtypes) min(int)0.319485716.520514
Query (ptypes) stddev(int)0.0148371341.081771
Query (ptypes) avg(float)0.0106751417.641397
Query (ptypes) sqrt(float)0.00306212.227921
Query (ptypes) min(float)0.0092960.014992
Query (ptypes) stddev(float)0.0138953.202807
Query (ptypes) var_pop(float)0.01016416.41966
Query (gtypes) count(int) where int is 10.32598416.669058
Query (gtypes) count(int) where int is 30.358017641.80022
Query (gtypes) count(int) where int is 4 and patient_id = 12340.027244668.470442

All queries performed on the Oracle 11G DBMS.

Results

Our CASTOR approach converts the sequential file into a normalized and indexed, direct-access database (Figure 1). Combining all normalization techniques the data set was reduced from 98.4 to 13.8 GB, a decrease in disk space usage of 86%, without loss of information (Table 6). Removing redundant SNP information alone reclaimed over 50 GB of space.

CASTOR data diagram.
Figure 1.

CASTOR data diagram.

Table 6.

Performance comparison results

OriginalCASTOR
Size of genotype data97 GB6.8 MB × 2000 tables = 13.3 GB
Size of phenotype data1.4 GB34 MB × 14 tables = 476 MB
Total data set size98.4 GB13.776 GB
Oracle 11G: load time (min)493.5 (8.23 h)90.3 (1.51 h)
Oracle 11G: Total time to run all performance tests (min)937.2 (15.62 h)0.15 (9.1 s)
Oracle 11G: Total time to perform evaluation (min)1430.7 (23.85 h)90.18 (1.50 h)
OriginalCASTOR
Size of genotype data97 GB6.8 MB × 2000 tables = 13.3 GB
Size of phenotype data1.4 GB34 MB × 14 tables = 476 MB
Total data set size98.4 GB13.776 GB
Oracle 11G: load time (min)493.5 (8.23 h)90.3 (1.51 h)
Oracle 11G: Total time to run all performance tests (min)937.2 (15.62 h)0.15 (9.1 s)
Oracle 11G: Total time to perform evaluation (min)1430.7 (23.85 h)90.18 (1.50 h)
Table 6.

Performance comparison results

OriginalCASTOR
Size of genotype data97 GB6.8 MB × 2000 tables = 13.3 GB
Size of phenotype data1.4 GB34 MB × 14 tables = 476 MB
Total data set size98.4 GB13.776 GB
Oracle 11G: load time (min)493.5 (8.23 h)90.3 (1.51 h)
Oracle 11G: Total time to run all performance tests (min)937.2 (15.62 h)0.15 (9.1 s)
Oracle 11G: Total time to perform evaluation (min)1430.7 (23.85 h)90.18 (1.50 h)
OriginalCASTOR
Size of genotype data97 GB6.8 MB × 2000 tables = 13.3 GB
Size of phenotype data1.4 GB34 MB × 14 tables = 476 MB
Total data set size98.4 GB13.776 GB
Oracle 11G: load time (min)493.5 (8.23 h)90.3 (1.51 h)
Oracle 11G: Total time to run all performance tests (min)937.2 (15.62 h)0.15 (9.1 s)
Oracle 11G: Total time to perform evaluation (min)1430.7 (23.85 h)90.18 (1.50 h)

Once loaded, the data can be reused and reanalyzed without the need to repeat either the conversion or the data load. The significant decrease in load time (90.3 min versus 8.23 h for the original data set) is primarily a result of data set size reduction due to normalization, and the corresponding reduction in index size due to the horizontal segmentation of this data set. The smaller indices are easily loaded into available memory when needed, removing the need to use slower hard disk based virtual memory space often required by larger indices.

As each column in the CASTOR data structure represents a SNP, the database metadata itself is responsible for SNP indexing thus obviating the need to separately index the SNPs for rapid data access, as has been the focus of earlier efforts (10–12).

The genotype table from the non-optimized original database (single table) structure had a row count of 7 billion (1 million genotypes for 7000 samples) and the phenotype table had 49-million rows (7000 phenotypes for 7000 samples). With an alphanumeric index (such as the combination of SNP name and sample id), the index alone would take up 111 GB of memory (9 character SNP name + 8 character sample id = 17 characters × 7 billion records). Our CASTOR database has significantly fewer rows (7000 per table, one row per sample), but has 2000 genotype tables and 14 phenotype tables, dividing the data set into smaller, fragmented indices. Using a single row per sample, the index on each CASTOR table is 55 Kb (7000 samples × 8 character sample id) allowing for very rapid load times. The total size of the CASTOR indices (across all 2000 genotype tables) is 107 megabytes, but since the indices are fragmented across many tables, only those indices needed to fulfill a specific query are loaded at any given time. Table 6 illustrates the benefits of the CASTOR approach.

Since DBMSs are optimized for column-oriented calculations, using each column as a list of all genotypes for a particular SNP (SNPs across samples), optimizes the data set for GWAS-type analyses while still supporting row-based calculations across SNPs when necessary.

The final result is a CASTOR data set (containing all of the original information) that is very wide, comprising over 2000 tables and 1-million columns for genotypes alone, but quite short, with only a single row per sample in each table (Figure 1). The resulting multi-table data schema’s time required to conduct the performance analysis was reduced by 99.9% by moving from a single table to a multi-table data structure (15.62 hours compared to 9.1 seconds) (Table 6).

Discussion

Applying both well-known and novel data transformation and data architecture techniques, we have arrived at a simple and elegant solution that achieves a significant data set size reduction and a dramatic increase in processing speed. As data is loaded into the database the data is normalized to remove duplications, then encoded into numerical data and subsequently divided into the novel multidimensional multi-table structure specifically designed for large genetic data set analysis. Converting the original data set into a multidimensional data set has many advantages, such as enabling the use of OLAP (16) and increasing the speed of the data set (Table 5) by eliminating slower alphanumeric data from the analysis tasks. An additional benefit is the further reduction of the size of the data set while preserving all of the information contained in the original.

A multi-dimensional encoding scheme can furthermore be used to encode more than just the initial data. For example, the genotypes dimension table (Table 2) not only encodes the 10 possibilities of genotype pairs, it also easily separates homozygous pairs (code ≤4) from heterozygous pairs (code ≥5). Counting alleles, a basic calculation in a GWAS, can be accommodated with the following structured query language (SQL) query:

select sum(genotypes_dim.allele_a) from genotypes_dim, gtypes1 where gtypes1.snp2 = genotypes_dim.code and genotypes_dim.allele_a > 0

Where genotypes_dim is the database table that holds the information for each genotype; allele_a is a count of A alleles in a particular genotype; gtypes1 is the table containing the genotypes for the first 500 SNPs; snp2 is the field containing the genotype code for snp2.

If adopted, this approach would offload basic statistical manipulations to the database, provide a platform for automated initial quality control and analysis, and result in savings in disk storage, data archiving and transfer time. Our CASTOR approach, if adopted for biological data sets, would provide a much more reasonable starting point that could enable analytical solutions on laptop computers or other non-specialized hardware, while still benefitting from the performance improvements available to cloud computing and other hardware-based solutions. The CASTOR approach will help meet the demand for high-speed analysis by providing a solid foundation to handle ever-increasing amounts of genetic data. Our data set can scale to several million samples and nearly an unlimited amount of SNPs with nothing more than a linear impact on performance.

Aside from the stated performance benefits, CASTOR also has a potential impact on storage costs associated with this data. Based on published estimates, the average long-term storage cost currently is $25/month/GB (18), including all overheads. Prior to any normalization, the original data set composed of 7000 samples with 1-million SNPs and 7000 phenotypes would cost $29 520 per year in total storage costs. Based on the same published estimates, the same information in CASTOR format would cost $4140 per year.

The next step is to incorporate the CASTOR approach into commonly used software packages such as PLINK. The CASTOR approach, as it is DBMS-based, readily accommodates multi-processing and multi-core processor architecture. This should significantly reduce the time required to perform GWAS or similar analyses, allow for the development of new algorithms, as well as extend the lifespan of current software tools by eliminating hardware bottlenecks.

Funding

Génome Québec. Funding for open access charge: Génome Québec.

Conflict of interest. None declared.

Acknowledgements

Thanks to Christopher Beck, Amr Al Mallah and Madjid Hihi for their advice and encouragement; Paul Guelpa for his graphical skills and feedback; and Andrea Smith for comments and suggestions.

References

1
Moore
GE
Cramming more components onto integrated circuits
Readings in Computer Architecture
2000
San Francisco, CA
Morgan Kaufmann Publishers Inc.
(pg. 
56
-
59
)
2
Antofie
A
Lateur
M
Oger
R
et al. 
A new versatile database created for geneticists and breeders to link molecular and phenotypic data in perennial crops: the AppleBreed DataBase
Bioinformatics
2007
, vol. 
23
 (pg. 
882
-
891
)
3
Fayyad
U
Piatetsky-Shapiro
G
Smyth
P
The KDD process for extracting useful knowledge from volumes of data
Comm. ACM
1996
, vol. 
39
 (pg. 
27
-
34
)
4
Férey
N
Gros
PE
Hérisson
J
et al. 
Visual data mining of genomic databases by immersive graph-based exploration
Proceedings of the Third International Conference on Computer Graphics and Interactive Techniques in Australasia and South East Asia
2005
Dunedin, New Zealand
ACM
(pg. 
143
-
146
)
5
Purcell
S
Neale
B
Todd-Brown
K
et al. 
PLINK: a toolset for whole-genome association and population-based linkage analysis
Am J Hum Genet.
2007
, vol. 
81
 (pg. 
559
-
575
)
6
Barbasiewicz
A
Liu
L
Lang
BF
et al. 
Building a genome database using an object-oriented approach
In Silico Biol.
2002
, vol. 
2
 (pg. 
213
-
217
)
7
Cohn
JD
XML and genomic data
SIGBIO Newslett.
2000
, vol. 
20
 (pg. 
22
-
24
)
8
Sen
A
Sinha
AP
A comparison of data warehousing methodologies, Commun
ACM
2005
, vol. 
48
 (pg. 
79
-
84
)
9
De Francesco
E
Di Santo
G
Palopoli
L
et al. 
Sidhu
AS
Dilliom
TS
A summary of genomic databases: overview and discussion
Studies in Computational Intelligence
2009
Germany
Springer
(pg. 
37
-
54
)
10
Barsky
M
Stege
U
Thomo
A
et al. 
A new method for indexing genomes using on-disk suffix trees
Proceeding of the 17th ACM Conference on Information and Knowledge Management
2008
Napa Valley, CA
ACM
(pg. 
649
-
658
)
11
Cooper
G
Raymer
M
Doom
T
et al. 
Indexing genomic databases
Fourth IEEE Symposium on Bioinformatics and Bioengineering
2004
Taiwan, ROC
Taichung
(pg. 
587
-
591
)
12
Phoophakdee
B
Zaki
MJ
Trellis+: an effective approach for indexing genome-scale sequences using suffix trees
Pacific Symp. Biocomput.
2007
, vol. 
13
 (pg. 
90
-
101
)
13
Hong
MG
Pawitan
Y
Magnusson
PKE
et al. 
Strategies and issues in the detection of pathway enrichment in genome-wide association studies
Hum. Genet.
2009
, vol. 
126
 (pg. 
289
-
301
)
14
Olund
G
Brinne
A
Lindqvist
P
et al. 
Unleashing genotypes in epidemiology - A novel method for managing high throughput information
J. Biomed. Inform.
2009
, vol. 
42
 (pg. 
1029
-
1034
)
15
Ramakrishnan
R
Gehrke
J
Database Management Systems
2003
New York, NY
McGraw-Hill
16
Colliat
G
OLAP, relational, and multidimensional database systems
SIGMOD Rec.
1996
, vol. 
25
 (pg. 
64
-
69
)
17
Wall
L
Programming Perl
2000
Sebastopol, CA
O'Reilly & Associates, Inc.
18
Butts
S
How to Use Single Instancing to Control Storage Expense
2009
New York, NY
eWeek. Ziff Davis Enterprise

Author notes

Present address: Sharon Marsh and Tibor van Rooij, Faculty of Pharmacy and Pharmaceutical Sciences, 3126 Dentistry/Pharmacy Centre, University of Alberta, Edmonton, Alberta, T6G 2N8 Canada.

This is Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data