Ensembl BioMarts: a hub for data retrieval across taxonomic space Open Access

Summary of data available at the Ensembl BioMart as of Ensembl release 61

Data set	Description of data content
Ensembl Genes 61	Genes from 52 species with annotated external references, protein domains, multi species comparison (orthologs, possible orthologs and paralogs), variation (germline and somatic), regulation (probe set mapping for microarray platforms), gene ontology, expression (GNF/Atlas) and transcript splicing event data
Ensembl Variation 61	Variation data for 18 species including human somatic mutation data from COSMIC (7), human structural variation, human phenotype, Genome Wide Association Studies (GWAS) and variation set data. Strain specific data is available for certain other species.
Ensembl Regulation 61	Regulation data for human, mouse and Drosophila melanogaster (annotated, regulatory and external features)
Vega 41	Manually curated genes for human, mouse and zebrafish by the HAVANA group at WTSI and displayed in the VEGA database (21)
Reactome	Manually curated and peer-reviewed pathways from the BioMart (22) at http://www.reactome.org/ cgi-bin/mart
PRIDE (EBI UK)	Proteomics data from the PRIDE PRoteomics IDEntifications (17) BioMart database at http://www.ebi.ac.uk/pride/prideMart.do

Data set	Description of data content
Ensembl Genes 61	Genes from 52 species with annotated external references, protein domains, multi species comparison (orthologs, possible orthologs and paralogs), variation (germline and somatic), regulation (probe set mapping for microarray platforms), gene ontology, expression (GNF/Atlas) and transcript splicing event data
Ensembl Variation 61	Variation data for 18 species including human somatic mutation data from COSMIC (7), human structural variation, human phenotype, Genome Wide Association Studies (GWAS) and variation set data. Strain specific data is available for certain other species.
Ensembl Regulation 61	Regulation data for human, mouse and Drosophila melanogaster (annotated, regulatory and external features)
Vega 41	Manually curated genes for human, mouse and zebrafish by the HAVANA group at WTSI and displayed in the VEGA database (21)
Reactome	Manually curated and peer-reviewed pathways from the BioMart (22) at http://www.reactome.org/ cgi-bin/mart
PRIDE (EBI UK)	Proteomics data from the PRIDE PRoteomics IDEntifications (17) BioMart database at http://www.ebi.ac.uk/pride/prideMart.do

Table 1.

Summary of data available at the Ensembl BioMart as of Ensembl release 61

Data set	Description of data content
Ensembl Genes 61	Genes from 52 species with annotated external references, protein domains, multi species comparison (orthologs, possible orthologs and paralogs), variation (germline and somatic), regulation (probe set mapping for microarray platforms), gene ontology, expression (GNF/Atlas) and transcript splicing event data
Ensembl Variation 61	Variation data for 18 species including human somatic mutation data from COSMIC (7), human structural variation, human phenotype, Genome Wide Association Studies (GWAS) and variation set data. Strain specific data is available for certain other species.
Ensembl Regulation 61	Regulation data for human, mouse and Drosophila melanogaster (annotated, regulatory and external features)
Vega 41	Manually curated genes for human, mouse and zebrafish by the HAVANA group at WTSI and displayed in the VEGA database (21)
Reactome	Manually curated and peer-reviewed pathways from the BioMart (22) at http://www.reactome.org/ cgi-bin/mart
PRIDE (EBI UK)	Proteomics data from the PRIDE PRoteomics IDEntifications (17) BioMart database at http://www.ebi.ac.uk/pride/prideMart.do

Data set	Description of data content
Ensembl Genes 61	Genes from 52 species with annotated external references, protein domains, multi species comparison (orthologs, possible orthologs and paralogs), variation (germline and somatic), regulation (probe set mapping for microarray platforms), gene ontology, expression (GNF/Atlas) and transcript splicing event data
Ensembl Variation 61	Variation data for 18 species including human somatic mutation data from COSMIC (7), human structural variation, human phenotype, Genome Wide Association Studies (GWAS) and variation set data. Strain specific data is available for certain other species.
Ensembl Regulation 61	Regulation data for human, mouse and Drosophila melanogaster (annotated, regulatory and external features)
Vega 41	Manually curated genes for human, mouse and zebrafish by the HAVANA group at WTSI and displayed in the VEGA database (21)
Reactome	Manually curated and peer-reviewed pathways from the BioMart (22) at http://www.reactome.org/ cgi-bin/mart
PRIDE (EBI UK)	Proteomics data from the PRIDE PRoteomics IDEntifications (17) BioMart database at http://www.ebi.ac.uk/pride/prideMart.do

The Ensembl Genomes BioMarts are created using the BioMart database schemas generated by the Ensembl project and these are adapted to suit the specific requirements for each of the domains. A gene-centric database is available for each of the five domains and a variation-centric database is available for Protists, Fungi, Metazoa and Plants (Table 2).

Table 2.

Summary of data available at the Ensembl Genomes BioMarts as of Ensembl Genomes release 8

Data set	Description of data content
Ensembl Bacteria 8	249 genomes across 10 different clades (Gene database)
Ensembl Protists 8	11 species including Plasmodium falciparum, Plasmodium knowlesi, Plasmodium vivax and three oomycete genomes (Gene database for all species and Variation database for one species)
Ensembl Fungi 8	13 species, including eight Aspergillus species, Neosartorya fischeri, Puccinia graminis f. sp. Tritici, Saccharomyces cerevisiae, Schizosaccharomyces pombe (Gene database for all species and Variation database for one species)
Ensembl Metazoa 8	30 species, including 12 Drosphila, five Caenorhabditis, Aedes aegypti and Apis mellifera (Gene database for all species and Variation database for two species)
Ensembl Plants 8	10 species, including Arabidopsis lyrata, Arabidopsis thaliana, Brachypodium distachyon, Oryza sativa, Oryza sativa indica group and Zea mays (Gene database for all species and Variation database for four species)

Data set	Description of data content
Ensembl Bacteria 8	249 genomes across 10 different clades (Gene database)
Ensembl Protists 8	11 species including Plasmodium falciparum, Plasmodium knowlesi, Plasmodium vivax and three oomycete genomes (Gene database for all species and Variation database for one species)
Ensembl Fungi 8	13 species, including eight Aspergillus species, Neosartorya fischeri, Puccinia graminis f. sp. Tritici, Saccharomyces cerevisiae, Schizosaccharomyces pombe (Gene database for all species and Variation database for one species)
Ensembl Metazoa 8	30 species, including 12 Drosphila, five Caenorhabditis, Aedes aegypti and Apis mellifera (Gene database for all species and Variation database for two species)
Ensembl Plants 8	10 species, including Arabidopsis lyrata, Arabidopsis thaliana, Brachypodium distachyon, Oryza sativa, Oryza sativa indica group and Zea mays (Gene database for all species and Variation database for four species)

Table 2.

Summary of data available at the Ensembl Genomes BioMarts as of Ensembl Genomes release 8

Data set	Description of data content
Ensembl Bacteria 8	249 genomes across 10 different clades (Gene database)
Ensembl Protists 8	11 species including Plasmodium falciparum, Plasmodium knowlesi, Plasmodium vivax and three oomycete genomes (Gene database for all species and Variation database for one species)
Ensembl Fungi 8	13 species, including eight Aspergillus species, Neosartorya fischeri, Puccinia graminis f. sp. Tritici, Saccharomyces cerevisiae, Schizosaccharomyces pombe (Gene database for all species and Variation database for one species)
Ensembl Metazoa 8	30 species, including 12 Drosphila, five Caenorhabditis, Aedes aegypti and Apis mellifera (Gene database for all species and Variation database for two species)
Ensembl Plants 8	10 species, including Arabidopsis lyrata, Arabidopsis thaliana, Brachypodium distachyon, Oryza sativa, Oryza sativa indica group and Zea mays (Gene database for all species and Variation database for four species)

Data set	Description of data content
Ensembl Bacteria 8	249 genomes across 10 different clades (Gene database)
Ensembl Protists 8	11 species including Plasmodium falciparum, Plasmodium knowlesi, Plasmodium vivax and three oomycete genomes (Gene database for all species and Variation database for one species)
Ensembl Fungi 8	13 species, including eight Aspergillus species, Neosartorya fischeri, Puccinia graminis f. sp. Tritici, Saccharomyces cerevisiae, Schizosaccharomyces pombe (Gene database for all species and Variation database for one species)
Ensembl Metazoa 8	30 species, including 12 Drosphila, five Caenorhabditis, Aedes aegypti and Apis mellifera (Gene database for all species and Variation database for two species)
Ensembl Plants 8	10 species, including Arabidopsis lyrata, Arabidopsis thaliana, Brachypodium distachyon, Oryza sativa, Oryza sativa indica group and Zea mays (Gene database for all species and Variation database for four species)

The Ensembl BioMart tables are made available for download from the FTP site (ftp://ftp.ensembl.org/pub) for each release (e.g. Ensembl Genes 61 BioMart database available from ftp://ftp.ensembl.org/pub/release-61/mysql/ensembl_mart_61). Users can access the BioMarts by web interface, BioMart API, biomaRt package from bioconductor (19), SOAP based and RESTful webservices and by publicly available MySQL server offering direct access to the BioMart databases (http://www.ensembl.org/info/data/mysql.html). Help and documentation details are summarized in Table 3. The Ensembl and Ensembl Genomes BioMarts are also displayed on the main BioMart central portal http://www.biomart.org. Three Ensembl mirrors have been created to improve the website performance for users around the globe. These mirrors, located on the west and east coasts of the USA (http://uswest.ensembl.org, http://useast.ensembl.org) and in Asia (http://asia.ensembl.org) also contain the Ensembl BioMarts to facilitate more effective data access.

Table 3.

Summary of sources of help and documentation at Ensembl

Information resource	URL or Email address
Ensembl frequently asked questions	http://www.ensembl.org/Help/Faq
BioMart frequently asked questions	http://www.biomart.org/faqs.html
Tutorials	http://www.ensembl.org/info/website/tutorials
YouTube videos	http://www.youtube.com/user/EnsemblHelpdesk
Ensembl news containing information about updates to mart databases	http://www.ensembl.org/info/website/news
Ensembl Blog	http://www.ensembl.info
Ensembl archives containing archived BioMart databases	http://www.ensembl.org/info/website/archives
Ensembl helpdesk mailing list	helpdesk@ensembl.org
Ensembl Genomes helpdesk mailing list	helpdesk@ensemblgenomes.org
Ensembl Genomes portal website containing project information	http://www.ensemblgenomes.org

Information resource	URL or Email address
Ensembl frequently asked questions	http://www.ensembl.org/Help/Faq
BioMart frequently asked questions	http://www.biomart.org/faqs.html
Tutorials	http://www.ensembl.org/info/website/tutorials
YouTube videos	http://www.youtube.com/user/EnsemblHelpdesk
Ensembl news containing information about updates to mart databases	http://www.ensembl.org/info/website/news
Ensembl Blog	http://www.ensembl.info
Ensembl archives containing archived BioMart databases	http://www.ensembl.org/info/website/archives
Ensembl helpdesk mailing list	helpdesk@ensembl.org
Ensembl Genomes helpdesk mailing list	helpdesk@ensemblgenomes.org
Ensembl Genomes portal website containing project information	http://www.ensemblgenomes.org

Table 3.

Summary of sources of help and documentation at Ensembl

Information resource	URL or Email address
Ensembl frequently asked questions	http://www.ensembl.org/Help/Faq
BioMart frequently asked questions	http://www.biomart.org/faqs.html
Tutorials	http://www.ensembl.org/info/website/tutorials
YouTube videos	http://www.youtube.com/user/EnsemblHelpdesk
Ensembl news containing information about updates to mart databases	http://www.ensembl.org/info/website/news
Ensembl Blog	http://www.ensembl.info
Ensembl archives containing archived BioMart databases	http://www.ensembl.org/info/website/archives
Ensembl helpdesk mailing list	helpdesk@ensembl.org
Ensembl Genomes helpdesk mailing list	helpdesk@ensemblgenomes.org
Ensembl Genomes portal website containing project information	http://www.ensemblgenomes.org

Information resource	URL or Email address
Ensembl frequently asked questions	http://www.ensembl.org/Help/Faq
BioMart frequently asked questions	http://www.biomart.org/faqs.html
Tutorials	http://www.ensembl.org/info/website/tutorials
YouTube videos	http://www.youtube.com/user/EnsemblHelpdesk
Ensembl news containing information about updates to mart databases	http://www.ensembl.org/info/website/news
Ensembl Blog	http://www.ensembl.info
Ensembl archives containing archived BioMart databases	http://www.ensembl.org/info/website/archives
Ensembl helpdesk mailing list	helpdesk@ensembl.org
Ensembl Genomes helpdesk mailing list	helpdesk@ensemblgenomes.org
Ensembl Genomes portal website containing project information	http://www.ensemblgenomes.org

Query examples

To demonstrate the utility of the Ensembl and Ensembl Genomes BioMarts we present several biologically relevant queries that can be performed using available tools and interfaces.

Query #1: The G-protein coupled receptor domain (GPCR) has the InterPro ID of IPR000276. Find the human protein-coding genes in Ensembl that code for this domain, and investigate whether any of them are detectable with the Affy HuGene 1_0 st v1 array.

Database: Data sets	Filters	Attributes
Ensembl Genes 61: Homo sapiens genes (GRCh37.p2)	Gene type: protein_coding	Ensembl Gene ID
	Limit to genes with these family or domain IDs: IPR000276	Associated Gene Name
		Affy HuGene 1_0 st v1

Database: Data sets	Filters	Attributes
Ensembl Genes 61: Homo sapiens genes (GRCh37.p2)	Gene type: protein_coding	Ensembl Gene ID
	Limit to genes with these family or domain IDs: IPR000276	Associated Gene Name
		Affy HuGene 1_0 st v1

Database: Data sets	Filters	Attributes
Ensembl Genes 61: Homo sapiens genes (GRCh37.p2)	Gene type: protein_coding	Ensembl Gene ID
	Limit to genes with these family or domain IDs: IPR000276	Associated Gene Name
		Affy HuGene 1_0 st v1

Database: Data sets	Filters	Attributes
Ensembl Genes 61: Homo sapiens genes (GRCh37.p2)	Gene type: protein_coding	Ensembl Gene ID
	Limit to genes with these family or domain IDs: IPR000276	Associated Gene Name
		Affy HuGene 1_0 st v1

The GPCR genes make up a large protein family that covers a wide range of functions. A scientist may already know the InterPro ID of the GPCR rhodopsin-like domain and wish to investigate how many Ensembl gene IDs code for this GPCR and whether these were detected using the Affy HuGene 1_0 st v1 array. To do this query, the user must select the protein_coding filter from the GENE filter section and filter with the known InterPro ID in the PROTEIN DOMAINS filter section. Attributes are selected from Features:GENE and Features:EXTERNAL sections (Figure 1).

Figure 1.

There are 777 Ensembl protein coding genes that code for the GPCR domain with InterPro ID (IPR000276) and that are detectable with the Affy HuGene 1_0 st v1 array 25.

Query #2: esv263 is the DGVa accession number of a structural variation from Redon et al. (20). What genomic region does this copy number variation span?

Database: Data sets	Filters	Attributes
Ensembl Variation 61: Homo sapiens Structural Variation	Limit to variants with these IDs: esv263	Chromosome Name
		Sequence region start (bp)
		Sequence region end (bp)
		Structural Variation Name
		Structural Variation Description
		Source Name

Database: Data sets	Filters	Attributes
Ensembl Variation 61: Homo sapiens Structural Variation	Limit to variants with these IDs: esv263	Chromosome Name
		Sequence region start (bp)
		Sequence region end (bp)
		Structural Variation Name
		Structural Variation Description
		Source Name

Database: Data sets	Filters	Attributes
Ensembl Variation 61: Homo sapiens Structural Variation	Limit to variants with these IDs: esv263	Chromosome Name
		Sequence region start (bp)
		Sequence region end (bp)
		Structural Variation Name
		Structural Variation Description
		Source Name

Database: Data sets	Filters	Attributes
Ensembl Variation 61: Homo sapiens Structural Variation	Limit to variants with these IDs: esv263	Chromosome Name
		Sequence region start (bp)
		Sequence region end (bp)
		Structural Variation Name
		Structural Variation Description
		Source Name

Recent studies such as Redon et al. (20) have mapped copy number variations (CNV) in the human population. Redon et al. (20) studied 270 individuals from four populations whose DNA was screened for CNVs. Having read the article, a user may be interested in finding out more about a particular structural variation, such as the size of the genomic region that a particular structural variation spans (Figure 2). To do this query, the user must filter on the Structural Variation Name in the GENERAL STRUCTURAL VARIATION FILTERS and the attributes can be selected from the STRUCTURAL VARIATION attribute section.

Figure 2.

The esv263 structural variation from DGVa occurs between 16 265 092 and 16 446 378 bp on chromosome 12.

Query #3: Are there any genes in Ensembl that contain somatic mutations associated with tumors in the eye?

Database: Data sets	Filters	Attributes
Ensembl Variation 61: Homo sapiens Somatic Variation (COSMIC 50)	Phenotype: COSMIC: tumor_site:eye	Variation ID
		Chromosome name
		Position on Chromosome (bp)
		Allele
		Phenotype description
		Associated gene
		Ensembl Gene ID

Database: Data sets	Filters	Attributes
Ensembl Variation 61: Homo sapiens Somatic Variation (COSMIC 50)	Phenotype: COSMIC: tumor_site:eye	Variation ID
		Chromosome name
		Position on Chromosome (bp)
		Allele
		Phenotype description
		Associated gene
		Ensembl Gene ID

Database: Data sets	Filters	Attributes
Ensembl Variation 61: Homo sapiens Somatic Variation (COSMIC 50)	Phenotype: COSMIC: tumor_site:eye	Variation ID
		Chromosome name
		Position on Chromosome (bp)
		Allele
		Phenotype description
		Associated gene
		Ensembl Gene ID

Database: Data sets	Filters	Attributes
Ensembl Variation 61: Homo sapiens Somatic Variation (COSMIC 50)	Phenotype: COSMIC: tumor_site:eye	Variation ID
		Chromosome name
		Position on Chromosome (bp)
		Allele
		Phenotype description
		Associated gene
		Ensembl Gene ID

The COSMIC project focuses on somatic mutations relating to human cancers. A somatic variation data set has been incorporated into the Ensembl Variation BioMart database to give users access to this data. A scientist can select from a list of COSMIC phenotypes from the GENERAL VARIATION FILTERS filter section, choose a selection of useful attributes from the Variation:SEQUENCE VARIATION and Variation:GENE ASSOCIATED INFORMATION attribute sections and export their results in a selection of file formats (Figure 3).

Figure 3.

Shows that there are 100 single nucleotide polymorphisms in the human somatic variation data set associated with tumors in the eye and the list of Ensembl gene IDs containing these variations can be downloaded for further study or one can click on an entry in the Ensembl Gene ID column on the interface which links to the main Ensembl website.

Query #4: Find the HGNC symbols for a list of human variations.

Database: Data sets	Filters	Attributes
Ensembl Variation 61: Homo sapiens variation (dbSNP 132;ENSEMBL)	Limit to variants with these IDs dbSNP rs IDs: rs348, rs362, rs364, rs565, rs645	Variation ID
		Chromosome name
		Position on chromosome (bp)
		Ensembl Gene ID
Ensembl Genes 61: Homo sapiens genes (GRCh37.p2)		HGNC ID
Ensembl Genes 61: Homo sapiens genes (GRCh37.p2)		HGNC symbol

Database: Data sets	Filters	Attributes
Ensembl Variation 61: Homo sapiens variation (dbSNP 132;ENSEMBL)	Limit to variants with these IDs dbSNP rs IDs: rs348, rs362, rs364, rs565, rs645	Variation ID
		Chromosome name
		Position on chromosome (bp)
		Ensembl Gene ID
Ensembl Genes 61: Homo sapiens genes (GRCh37.p2)		HGNC ID
Ensembl Genes 61: Homo sapiens genes (GRCh37.p2)		HGNC symbol

Database: Data sets	Filters	Attributes
Ensembl Variation 61: Homo sapiens variation (dbSNP 132;ENSEMBL)	Limit to variants with these IDs dbSNP rs IDs: rs348, rs362, rs364, rs565, rs645	Variation ID
		Chromosome name
		Position on chromosome (bp)
		Ensembl Gene ID
Ensembl Genes 61: Homo sapiens genes (GRCh37.p2)		HGNC ID
Ensembl Genes 61: Homo sapiens genes (GRCh37.p2)		HGNC symbol

Database: Data sets	Filters	Attributes
Ensembl Variation 61: Homo sapiens variation (dbSNP 132;ENSEMBL)	Limit to variants with these IDs dbSNP rs IDs: rs348, rs362, rs364, rs565, rs645	Variation ID
		Chromosome name
		Position on chromosome (bp)
		Ensembl Gene ID
Ensembl Genes 61: Homo sapiens genes (GRCh37.p2)		HGNC ID
Ensembl Genes 61: Homo sapiens genes (GRCh37.p2)		HGNC symbol

This query requires that the user selects filters and attributes from the human data set in the Variation BioMart database as well as selecting attributes from the human data set in the Ensembl Genes BioMart database. The linking of two data sets is a useful feature of the BioMart technology and allows for complex cross database queries to be constructed. In this query the user may have a list of dbSNP IDs and would like to obtain a list of Ensembl gene IDs and their corresponding HGNC IDs that contain these variations (Figure 4). The user must first upload their list of dbSNP IDs to the GENERAL VARIATION FILTERS section and then select the required attributes from the Variation:SEQUENCE VARIATION and Variation:GENE ASSOCIATED INFORMATION attribute sections. Then select the second data set [Homo sapiens genes (GRCh37.p2) from Ensembl Genes mart] from the left sidebar on the screen. Then select the HGNC ID and HGNC symbol from the features: EXTERNAL attribute section.

Figure 4.

Five dbSNP rs IDs were used to filter the human variation data set and Ensembl gene IDs containing these five variations were selected in the attributes. Then linking to the second data set, human gene data set from Ensembl Genes database, the HGNC ID and symbol were selected in the attribute section to retrieve the corresponding gene names from HGNC. They are FAN1, MTMR10 and EEF1DP3.

Query #5: Find the genes from Escherichia coli strain K12 that are found within the region ‘360473–365601’ and discover whether there are any orthologs in the related strains E. coli O157:H7 EC4115 and E. coli DH10B.

Database: Data sets	Filters	Attributes
Ensembl Bacteria Bacterial Mart (Release 8): Escherichia coli K12 genes	Gene start (bp): 360473	Ensembl Gene ID
	Gene end (bp): 365601	Ensembl Transcript ID
		Associated Gene Name
		Escherichia coli DH10B Ensembl Gene ID
		Escherichia coli DH10B Chromosome Start (bp)
		Escherichia coli DH10B Chromosome End (bp)
		Escherichia coli O157:H7 EC4115 Ensembl Gene ID
		Escherichia coli O157:H7 EC4115 Chromosome Start (bp)
		Escherichia coli O157:H7 EC4115 Chromosome End (bp)

Database: Data sets	Filters	Attributes
Ensembl Bacteria Bacterial Mart (Release 8): Escherichia coli K12 genes	Gene start (bp): 360473	Ensembl Gene ID
	Gene end (bp): 365601	Ensembl Transcript ID
		Associated Gene Name
		Escherichia coli DH10B Ensembl Gene ID
		Escherichia coli DH10B Chromosome Start (bp)
		Escherichia coli DH10B Chromosome End (bp)
		Escherichia coli O157:H7 EC4115 Ensembl Gene ID
		Escherichia coli O157:H7 EC4115 Chromosome Start (bp)
		Escherichia coli O157:H7 EC4115 Chromosome End (bp)

Database: Data sets	Filters	Attributes
Ensembl Bacteria Bacterial Mart (Release 8): Escherichia coli K12 genes	Gene start (bp): 360473	Ensembl Gene ID
	Gene end (bp): 365601	Ensembl Transcript ID
		Associated Gene Name
		Escherichia coli DH10B Ensembl Gene ID
		Escherichia coli DH10B Chromosome Start (bp)
		Escherichia coli DH10B Chromosome End (bp)
		Escherichia coli O157:H7 EC4115 Ensembl Gene ID
		Escherichia coli O157:H7 EC4115 Chromosome Start (bp)
		Escherichia coli O157:H7 EC4115 Chromosome End (bp)

Database: Data sets	Filters	Attributes
Ensembl Bacteria Bacterial Mart (Release 8): Escherichia coli K12 genes	Gene start (bp): 360473	Ensembl Gene ID
	Gene end (bp): 365601	Ensembl Transcript ID
		Associated Gene Name
		Escherichia coli DH10B Ensembl Gene ID
		Escherichia coli DH10B Chromosome Start (bp)
		Escherichia coli DH10B Chromosome End (bp)
		Escherichia coli O157:H7 EC4115 Ensembl Gene ID
		Escherichia coli O157:H7 EC4115 Chromosome Start (bp)
		Escherichia coli O157:H7 EC4115 Chromosome End (bp)

This query involves finding what E. coli genes lie in the given region and then discovering whether there are any orthologs in two related strains of E. coli. This is interesting as it may highlight bacterial genes that may have been acquired by some strains when compared to others and some genes may have been lost relative to other related strains (Figure 5). To do this query, add the gene start and end coordinates in the REGION filter section and then select the attributes from the Homologs:GENE and Homologs:ORTHOLOGS attribute sections.

Figure 5.

The genes in the filtered region were lacA, lacY and lacZ and we can see that there are no orthologs for the lacZ gene in the E. coli DH10B strain.

Query #6: The three-gene APL1 locus encodes essential components of the mosquito immune defense against malaria parasites. Find the variations within the APL1A, APL1B and APL1C genes as well as the strain name, strain genotype, allele and biotype.

Database: Data sets	Filters	Attributes
Ensembl MetazoaMetazoa Variation Mart (release 8): Anopheles gambiae variations (AgamP3)	Ensembl Gene IDs: AGAP007035 AGAP007036 AGAP007033	Variation ID
		Chromosome name
		Position on Chromosome (bp)
		Allele
		dbSNP rsID
		Strain Name
		Strain Genotype
		Ensembl Gene ID
		Biotype

Database: Data sets	Filters	Attributes
Ensembl MetazoaMetazoa Variation Mart (release 8): Anopheles gambiae variations (AgamP3)	Ensembl Gene IDs: AGAP007035 AGAP007036 AGAP007033	Variation ID
		Chromosome name
		Position on Chromosome (bp)
		Allele
		dbSNP rsID
		Strain Name
		Strain Genotype
		Ensembl Gene ID
		Biotype

Database: Data sets	Filters	Attributes
Ensembl MetazoaMetazoa Variation Mart (release 8): Anopheles gambiae variations (AgamP3)	Ensembl Gene IDs: AGAP007035 AGAP007036 AGAP007033	Variation ID
		Chromosome name
		Position on Chromosome (bp)
		Allele
		dbSNP rsID
		Strain Name
		Strain Genotype
		Ensembl Gene ID
		Biotype

Database: Data sets	Filters	Attributes
Ensembl MetazoaMetazoa Variation Mart (release 8): Anopheles gambiae variations (AgamP3)	Ensembl Gene IDs: AGAP007035 AGAP007036 AGAP007033	Variation ID
		Chromosome name
		Position on Chromosome (bp)
		Allele
		dbSNP rsID
		Strain Name
		Strain Genotype
		Ensembl Gene ID
		Biotype

The Ensembl Metazoa Variation BioMart database consolidates single nucleotide polymorphisms from high-density, genome-wide mosquito SNP-genotyping array mapping and enables users to retrieve variations from the SNP-array identified through sequencing of two genetically diverged molecular forms of A. gambiae, Mopti (M) and Savanna (S) (23). This resource could help to analyze parasite susceptibility alleles from population subgroups. Query 6 shows how a user can obtain variation data for a particular gene or set of genes of interest (Figure 6). To do this query, the user must upload the gene IDs to the GENE ASSOCIATED VARIATION FILTERS section and then select the attributes of interest from the Variation: SEQUENCE VARIATION and Variation:GENE ASSOCIATED INFORMATION sections.

Figure 6.

Having first retrieved the Ensembl gene IDs for the three APL1 genes, these are used to filter the A. gambiae data set. Fifty variations were retrieved that lie within the three genes of the APL1 locus.

Query #7: Find the coding sequence for all human genes on chromosome 22 along with the gene name and gene start and end.

Database: Data sets	Filters	Attributes
Ensembl Gene 61: Homo sapiens genes (GRCh37.p2)	Chromosome 22	Coding sequence
		Ensembl Gene ID
		Associated Gene Name
		Associated Gene DB
		Gene Start (bp)
		Gene End (bp)

Database: Data sets	Filters	Attributes
Ensembl Gene 61: Homo sapiens genes (GRCh37.p2)	Chromosome 22	Coding sequence
		Ensembl Gene ID
		Associated Gene Name
		Associated Gene DB
		Gene Start (bp)
		Gene End (bp)

Database: Data sets	Filters	Attributes
Ensembl Gene 61: Homo sapiens genes (GRCh37.p2)	Chromosome 22	Coding sequence
		Ensembl Gene ID
		Associated Gene Name
		Associated Gene DB
		Gene Start (bp)
		Gene End (bp)

Database: Data sets	Filters	Attributes
Ensembl Gene 61: Homo sapiens genes (GRCh37.p2)	Chromosome 22	Coding sequence
		Ensembl Gene ID
		Associated Gene Name
		Associated Gene DB
		Gene Start (bp)
		Gene End (bp)

The BioMart technology allows for the download of sequence information in a usable format. This is a powerful feature that allows users to retrieve flanking sequence, exon sequence, 3′ and 5′-UTR, cDNA sequence, coding sequence and protein sequence. Query 7 illustrates how to retrieve coding sequences for all genes on chromosome 22 as well as obtaining information about the gene name and the location of the gene start and end (Figure 7). To do this query, select the chromosome from the REGION filter section and the attributes of interest from the Sequences:SEQUENCES and Sequence:Header Information attribute sections.

Figure 7.

The ability to retrieve sequence information for genes of interest is a powerful feature of the BioMart tool. Here a user can download the coding sequence for all genes on chromosome 22 as well as additional information about each gene and this can be exported in a useful format.

Discussion and future directions

The BioMart interface and querying platform provides the Ensembl and Ensembl Genomes projects with the necessary tools to design BioMart databases from the various source databases produced by the project. The BioMart databases and accompanying interface provides users with a fast and flexible means of querying the customized sets of biological data using a wide range of querying methods. The BioMart software also allows federation to other databases of scientific interest so that cross querying can be accomplished. It also allows the Ensembl and Ensembl Genomes databases to be incorporated into other portals with ease such as www.biomart.org.

As scientific activity evolves and in an effort to provide the most useful resources for our users, both the Ensembl and Ensembl Genomes projects will incorporate data from additional species and additionally handle new types of data, which will be included in the project BioMarts. In the future, we plan to move both projects to the new BioMart 0.8 code (24) and incorporate the new interface into the main Ensembl website.

Funding

The Wellcome Trust provide majority funding for the Ensembl project (grant number WT062023) with additional support from the European Commission under SLING, grant agreement number 226073 (Integrating Activity) within Research Infrastructures of the FP7 Capacities Specific Programme; the UK Biotechnology and Biological Sciences Research Council (grant numbers BB/F019793/1, BB/I001077/1); the Bill and Melinda Gates Foundation; and the European Molecular Biological Laboratory. Funding for open access charge: The Wellcome Trust.

Conflict of interest. None declared.

Acknowledgements

The authors thank all the users of the Ensembl and Ensembl Genomes projects especially those who have provided us with feedback about the Ensembl BioMarts. The authors would also like to thank the members of the BioMart team at the Ontario Institute for Cancer Research (OICR), especially Dr Arek Kasprzyk, for providing sustained technical support and assistance over the years.

References

Flicek

Amode

Barrell

, et al. ,

Ensembl 2011

Nucleic Acids Res.

2011

, vol.

(pg.

D800

D806

)

Foelo

Sherry

Weiner

Gabriel

Stephens

. ,

NCBI dbSNP Database: content and searching

Genetic Variation: A Laboratory Manual

2007

Cold Spring Harbour, NY

Cold Spring Harbour Laboratory Press

(pg.

)

Google Preview

Chen

Cunningham

Rios

, et al. ,

Ensembl variation resources

BMC Genomics

2010

, vol.

pg.

293

Hunter

Apweiler

Attwood

, et al. ,

InterPro: the integrative protein signature database

Nucleic Acids Res.

2009

, vol.

(pg.

D211

D215

)

Church

Lappalainen

Sneddon

, et al. ,

Public data archives for genomic structural variation

Nat. Genet.

2010

, vol.

(pg.

813

814

)

Bruford

Lush

Wright

, et al. ,

The HGNC database in 2008: a resource for the human genome

Nucleic Acids Res.

2008

, vol.

(pg.

D445

D448

)

Forbes

Tang

Bindal

, et al. ,

COSMIC (the Catalogue of Somatic Mutations in Cancer): a resource to investigate acquired mutations in human cancer

Nucleic Acids Res.

2010

, vol.

(pg.

D652

D657

)

Kersey

Lawson

Birney

, et al. ,

Ensembl genomes: extending ensembl across the taxonomic space

Nucleic Acids Res.

2010

, vol.

(pg.

D563

D569

)

Curwen

Eyras

Andrews

, et al. ,

The Ensembl automatic gene annotation system

Genome Res.

2004

, vol.

(pg.

942

950

)

Vilella

Severin

Ureta-Vidal

, et al. ,

EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates

Genome Res.

2009

, vol.

(pg.

327

335

)

Ballester

Johnson

Proctor

Flicek

. ,

Consistent annotation of gene expression arrays

BMC Genomics

2010

, vol.

pg.

294

McLaren

Pritchard

Rios

, et al. ,

Deriving the consequences of genomic variants with the Ensembl API and SNP effect predictor

Bioinformatics

2010

, vol.

(pg.

2069

2070

)

Stabenau

McVicker

Melsopp

, et al. ,

The Ensembl core software libraries

Genome Res.

2004

, vol.

(pg.

929

933

)

Parker

Bragin

Brent

, et al. ,

Using caching and optimization techniques to improve performance of the Ensembl website

BMC Bioinformatics

2010

, vol.

pg.

239

Smedley

Haider

Ballester

, et al. ,

BioMart – biological queries made easy

BMC Genomics

2009

, vol.

pg.

Raney

Cline

Rosenbloom

, et al. ,

ENCODE whole-genome data in the UCSC genome browser (2011 update)

Nucleic Acids Res.

2011

, vol.

(pg.

D871

D875

)

Vizcaíno

Reisinger

Côté

Martens

. ,

PRIDE and “Database on Demand” as valuable tools for computational proteomics

Meth. Mol. Biol.

2011

, vol.

696

(pg.

105

)

Croft

O'Kelly

, et al. ,

Reactome: a database of reactions, pathways and biological processes

Nucleic Acids Res.

2011

, vol.

(pg.

D691

D697

)

Durinck

Spellman

Birney

Huber

. ,

Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt

Nat. Protoc.

2009

, vol.

(pg.

1184

1191

)

Redon

Ishikawa

Fitch

, et al. ,

Global variation in copy number in the human genome

Nature

2006

, vol.

444

(pg.

444

454

)

Wilming

Gilbert

Howe

, et al. ,

The vertebrate genome annotation (Vega) database

Nucleic Acids Res.

2008

, vol.

(pg.

D753

D760

)

Shepherd

Forbes

Beare

, et al. ,

The Reactome BioMart

Database

2011

Neafsey

Lawniczak

Park

, et al. ,

SNP genotyping defines complex gene-flow boundaries among African malaria vector mosquitoes

Science

2010

, vol.

330

(pg.

514

517

)

Erratum in: Science. 330, 1477

Zhang

Haider

Guberman

, et al. ,

BioMart: A data federation framework for large collaborative projects

Database

2011