Data mining using the Catalogue of Somatic Mutations in Cancer BioMart Open Access

Total contents in v51 of the COSMIC database, January 2011 release

Curated data type	Curated data count
Experiments	2 946 792
Tumours	577 304
Mutations	167 193
References	11 062
Genes	19 000
Fusions	5573
Structural variants	2729
Whole-cancer genomes	51
Whole-cancer exomes	332

Curated data type	Curated data count
Experiments	2 946 792
Tumours	577 304
Mutations	167 193
References	11 062
Genes	19 000
Fusions	5573
Structural variants	2729
Whole-cancer genomes	51
Whole-cancer exomes	332

Table 1.

Open in new tab Download slide

Total contents in v51 of the COSMIC database, January 2011 release

Curated data type	Curated data count
Experiments	2 946 792
Tumours	577 304
Mutations	167 193
References	11 062
Genes	19 000
Fusions	5573
Structural variants	2729
Whole-cancer genomes	51
Whole-cancer exomes	332

Curated data type	Curated data count
Experiments	2 946 792
Tumours	577 304
Mutations	167 193
References	11 062
Genes	19 000
Fusions	5573
Structural variants	2729
Whole-cancer genomes	51
Whole-cancer exomes	332

Query examples

COSMICMart allows data to be filtered on six different categories (Figure 1): cancer sample, gene, mutation, site of the tumour, histology and other (e.g. Ensembl Gene ID, Swissprot ID, Entrez Gene ID). The interface has a number of pre-selected filters and attributes; mutated samples are selected by default. Users can change these to suit their requirements. Results are displayed in tabulated form and are exportable in various formats for further analysis.

Figure 1.

Example of how COSMICMart can be queried. This query searches for all cell lines with missense substitution mutations in the BRAF gene (A). Attributes can be selected (B) to display in the results table (C).

Query #1: ‘Find all missense substitution mutations for BRAF in cell lines, and display sample, mutation, site, and histology information’ (Figure 1, Table 2).

Table 2.

Data sets, filters and attributes selected for query #1

Data sets	Filters	Attributes
COSMIC51	Mutated sample: yes	Sample name
	Sample source: cell-line	Sample source
	Gene name: BRAF	Gene name
	AA mutation type: substitution—missense	Cosmic mutation ID (COSM ID)
		CDS mutation syntax
		AA mutation syntax
		Primary site
		Primary histology
		Tumour source
		Pubmed ID

Data sets	Filters	Attributes
COSMIC51	Mutated sample: yes	Sample name
	Sample source: cell-line	Sample source
	Gene name: BRAF	Gene name
	AA mutation type: substitution—missense	Cosmic mutation ID (COSM ID)
		CDS mutation syntax
		AA mutation syntax
		Primary site
		Primary histology
		Tumour source
		Pubmed ID

Table 2.

Open in new tab Download slide

Data sets, filters and attributes selected for query #1

Data sets	Filters	Attributes
COSMIC51	Mutated sample: yes	Sample name
	Sample source: cell-line	Sample source
	Gene name: BRAF	Gene name
	AA mutation type: substitution—missense	Cosmic mutation ID (COSM ID)
		CDS mutation syntax
		AA mutation syntax
		Primary site
		Primary histology
		Tumour source
		Pubmed ID

Data sets	Filters	Attributes
COSMIC51	Mutated sample: yes	Sample name
	Sample source: cell-line	Sample source
	Gene name: BRAF	Gene name
	AA mutation type: substitution—missense	Cosmic mutation ID (COSM ID)
		CDS mutation syntax
		AA mutation syntax
		Primary site
		Primary histology
		Tumour source
		Pubmed ID

Missense mutations are the most common variant type in COSMIC; over 90% of mutations in BRAF are missense mutations at the p.V600 position. The results are returned as a tabular summary with links back to the COSMIC website. The sample name field links back to the COSMIC sample overview web page, and mutation ID (COSM ID) to the COSMIC mutation summary page (Figure 2). From the COSMIC mutation summary web page, there are links to the Ensembl contig view so the mutation can be viewed in a genomic context. There are also links to the GMOD’s GBrowse where COSMIC coding and non-coding mutations, gene footprints, structural rearrangements and copy number variants can be viewed (11).

Figure 2.

The COSMIC sample (A) and mutation (B) summary pages are linked directly from COSMICMart output table.

Query #2: ‘Find all gene fusion mutations involving the FUS gene with a primary site of bone, and display mutation and sample information’ (Table 3).

Table 3.

Data sets, filters and attributes selected for query #2

Data sets	Filters	Attributes
COSMIC51	Mutated sample: yes	Cosmic sample ID
	Gene name: FUS	Sample name
	CDS mutation type: inferred breakpoint, observed mRNA	Sample source
	Primary site: bone	Cosmic fusion mutation ID
		CDS mutation syntax
		Pubmed ID

Data sets	Filters	Attributes
COSMIC51	Mutated sample: yes	Cosmic sample ID
	Gene name: FUS	Sample name
	CDS mutation type: inferred breakpoint, observed mRNA	Sample source
	Primary site: bone	Cosmic fusion mutation ID
		CDS mutation syntax
		Pubmed ID

Table 3.

Data sets, filters and attributes selected for query #2

Data sets	Filters	Attributes
COSMIC51	Mutated sample: yes	Cosmic sample ID
	Gene name: FUS	Sample name
	CDS mutation type: inferred breakpoint, observed mRNA	Sample source
	Primary site: bone	Cosmic fusion mutation ID
		CDS mutation syntax
		Pubmed ID

Data sets	Filters	Attributes
COSMIC51	Mutated sample: yes	Cosmic sample ID
	Gene name: FUS	Sample name
	CDS mutation type: inferred breakpoint, observed mRNA	Sample source
	Primary site: bone	Cosmic fusion mutation ID
		CDS mutation syntax
		Pubmed ID

Gene fusions have been associated with a number of specific tumour types including prostate and blood tumours. These biomarkers can be useful in diagnosis and as targets for drug therapies. COSMIC has annotations, for an increasing number of gene fusion mutations, which are viewable using COSMICMart. The COSMIC fusion mutation ID links to the gene fusion summary pages, which give a graphical view of different fusion structures observed. Many of the papers describing gene fusions have identified more than one gene fusion product for the same genes in a single sample. Observed mRNAs are the actual expressed products reported in the results. However, to aid display and website navigation, we have inferred the genomic breakpoint from the experimental data.

Query #3: ‘Find variation information in Ensembl for all genes from mutated samples with a primary site of breast, and display COSMIC gene, mutation and sample information along with Ensembl variation information’ (Table 4).

Table 4.

Data sets, filters and attributes selected for query #3

Data sets	Filters	Attributes
COSMIC51	Mutated sample: yes	Cosmic sample ID
	Primary site: breast	Sample name
		Sample source
		Cosmic mutation ID (COSM ID)
		CDS mutation syntax
		AA mutation syntax
Ensembl: Homo sapiens genes		Features: Ensembl gene ID
		Features: Ensembl transcript ID
		Variations: variation source
		Variations: source description
		Variations: reference ID
		Variations: allele

Data sets	Filters	Attributes
COSMIC51	Mutated sample: yes	Cosmic sample ID
	Primary site: breast	Sample name
		Sample source
		Cosmic mutation ID (COSM ID)
		CDS mutation syntax
		AA mutation syntax
Ensembl: Homo sapiens genes		Features: Ensembl gene ID
		Features: Ensembl transcript ID
		Variations: variation source
		Variations: source description
		Variations: reference ID
		Variations: allele

Table 4.

Data sets, filters and attributes selected for query #3

Data sets	Filters	Attributes
COSMIC51	Mutated sample: yes	Cosmic sample ID
	Primary site: breast	Sample name
		Sample source
		Cosmic mutation ID (COSM ID)
		CDS mutation syntax
		AA mutation syntax
Ensembl: Homo sapiens genes		Features: Ensembl gene ID
		Features: Ensembl transcript ID
		Variations: variation source
		Variations: source description
		Variations: reference ID
		Variations: allele

Data sets	Filters	Attributes
COSMIC51	Mutated sample: yes	Cosmic sample ID
	Primary site: breast	Sample name
		Sample source
		Cosmic mutation ID (COSM ID)
		CDS mutation syntax
		AA mutation syntax
Ensembl: Homo sapiens genes		Features: Ensembl gene ID
		Features: Ensembl transcript ID
		Variations: variation source
		Variations: source description
		Variations: reference ID
		Variations: allele

COSMICMart is federated with Ensembl (12), which allows Biomart queries to return and integrate data from both resources. For instance, the linking of the two resources can allow the retrieval of variation data from both resources (somatic mutations from COSMIC and germline polymorphisms from Ensembl) for a particular gene or set of genes or cancer type. There is an increasing awareness of how genomic variation can affect a tumour’s sensitivity or resistance to anti-cancer agents. While this genetic variation can be familial or somatic, an understanding of common genetic variation around known cancer genes can be of much value to investigations searching for loci modifying a tumour’s response to drug therapy (13–15). This query is achieved by first selecting the filters/attributes in the COSMIC BioMart and then clicking the ‘Dataset’ link at the bottom of the left hand margin of the BioMart interface. An additional data set can then be selected from the drop down list, in this instance Ensembl, which allows a federated query between COSMIC and Ensembl. The filters/attributes are then set in the usual way using the Ensembl BioMart to produce an integrated query.

Future directions

COSMIC will continue to curate newly discovered cancer genes and is committed to update existing cancer genes with a data release every 2 months. This will ensure that the scientific community has an up-to-date catalogue of somatic mutations implicated in human cancer. COSMICMart is also automatically updated with each new COSMIC release, which allows the data set to be easily mined and integrated with other resources. COSMIC has been successfully adapted to hold complete catalogues of somatic mutations for individual cancer samples. Currently COSMIC holds genome-wide data on 383 tumour samples and we expect this to increase in the near future.

It is intended to federate COSMICMart with further BioMart-driven data resources in addition to the current link with Ensembl. Linking our data to PRIDE (16), UniProt (17) and InterPro (18) will allow COSMIC somatic mutation data to be linked to protein and peptide annotation, while the addition of the Reactome (19) database will allow the incorporation of pathway data. We also intend to create direct links between COSMIC and the ICGC Data Portal (http://dcc.icgc.org/) so somatic mutation data can be integrated between the two data resources.

Funding

Funding for open access charge: Wellcome Trust (grant reference 077012/Z/05/Z).

Conflict of interest. None declared.

References

Haider

Ballester

Smedley

, et al. ,

BioMart Central Portal–unified access to biological data

Nucleic Acids Res.

2009

, vol.

pg.

Den Dunnen

Antonarakis

. ,

Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion

Hum. Mutat.

2000

, vol.

(pg.

)

Forbes

Tang

Bindal

, et al. ,

COSMIC (the Catalogue of Somatic Mutations in Cancer): a resource to investigate acquired mutations in human cancer

Nucleic Acids Res.

2010

, vol.

(pg.

D652

D657

)

Forbes

Bhamra

Bamford

, et al. ,

The Catalogue of Somatic Mutations in Cancer (COSMIC)

Curr. Protoc. Hum. Genet.

2008

Chapter 10, 11

Petitjean

Mathe

Kato

, et al. ,

Impact of mutant p53 functional properties on TP53 mutation patterns and tumor phenotype: lessons from recent developments in the IARC TP53 database

Hum. Mutat.

2007

, vol.

(pg.

622

629

)

Cancer Genome Atlas Research Network. (2008) Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature, 455, 1061–1068

Sjöblom

Jones

Wood

, et al. ,

The consensus coding sequences of human breast and colorectal cancers

Science

2006

, vol.

314

(pg.

268

274

)

Parsons

Jones

Zhang

, et al. ,

An integrated genomic analysis of human glioblastoma multiforme

Science

2008

, vol.

321

(pg.

1807

1812

)

Ding

Getz

Wheeler

, et al. ,

Somatic mutations affect key pathways in lung adenocarcinoma

Nature

2008

, vol.

455

(pg.

1069

1075

)

International Cancer Genome Consortium

Hudson

Anderson

, et al. ,

International network of cancer genome projects

Nature

2010

, vol.

464

(pg.

993

998

)

Stein

Mungall

Shu

, et al. ,

The generic genome browser: a building block for a model organism system database

Genome Res.

2002

, vol.

(pg.

1599

1610

)

Flicek

Aken

Ballester

, et al. ,

Ensembl's 10th year

Nucleic Acids Res.

2010

, vol.

(pg.

D557

D562

)

Sharma

Haber

Settleman

. ,

Cell line-based platforms to evaluate the therapeutic efficacy of candidate anticancer agents

Nat. Rev. Cancer

2010

, vol.

(pg.

241

253

)

Jänne

Gray

Settleman

. ,

Factors underlying sensitivity of cancers to small-molecule kinase inhibitors

Nat. Rev. Drug Discov.

2009

, vol.

(pg.

709

723

)

McDermott

Sharma

Settleman

. ,

High-throughput lung cancer cell line screening for genotype-correlated sensitivity to an EGFR kinase inhibitor

Methods Enzymol.

2008

, vol.

438

(pg.

331

341

)

PubMed

Vizcaíno

Reisinger

Côté

, et al. ,

PRIDE and "Database on Demand" as valuable tools for computational proteomics

Methods Mol. Biol.

2011

, vol.

696

(pg.

105

)

PubMed