AntiFam: a tool to help identify spurious ORFs in protein annotation Open Access

AntiFam entries derived from Pfam families

Pfam accession number (identifier)	Last Pfam release present	Reason for deleting from Pfam	No. of matches in UniProt	No. of matches in metagenomics data set^a
PF07612 (DUF1575)	15.0	Proteins may not be expressed. Evidence for homology to known protein on opposite strand	3	0
PF07616 (DUF1578)	15.0	Proteins may not be expressed. Evidence for homology to known protein on opposite strand	6	6
PF07630 (DUF1591)	15.0	Proteins may not be expressed. Evidence for homology to known protein on opposite strand	6	0
PF07633 (DUF1594)	15.0	Proteins may not be expressed. Evidence for homology to known protein on opposite strand	5	0
PF11370 (DUF3170)	25.0	Probable shadow ORF of Clp protease	16	7
PF11194 (DUF2825)	25.0	Probable CRISPR^b repeat regions	159	18
PF11664 (DUF3264)	25.0	Probable CRISPR repeat regions	21	13
PF10695 (Cw-hydrolase)	25.0	Antisense to rRNA (9)	225	1,654
PF10919 (DUF2699)	26.0	Shadow ORF of PF00665 (integrase core domain 1)	25	11
PF07641 (DUF1596)	26.0	Dubious genome annotation. Family comprises only three sequences from Rhodopirellula baltica, two overlapping	3	0

Pfam accession number (identifier)	Last Pfam release present	Reason for deleting from Pfam	No. of matches in UniProt	No. of matches in metagenomics data set^a
PF07612 (DUF1575)	15.0	Proteins may not be expressed. Evidence for homology to known protein on opposite strand	3	0
PF07616 (DUF1578)	15.0	Proteins may not be expressed. Evidence for homology to known protein on opposite strand	6	6
PF07630 (DUF1591)	15.0	Proteins may not be expressed. Evidence for homology to known protein on opposite strand	6	0
PF07633 (DUF1594)	15.0	Proteins may not be expressed. Evidence for homology to known protein on opposite strand	5	0
PF11370 (DUF3170)	25.0	Probable shadow ORF of Clp protease	16	7
PF11194 (DUF2825)	25.0	Probable CRISPR^b repeat regions	159	18
PF11664 (DUF3264)	25.0	Probable CRISPR repeat regions	21	13
PF10695 (Cw-hydrolase)	25.0	Antisense to rRNA (9)	225	1,654
PF10919 (DUF2699)	26.0	Shadow ORF of PF00665 (integrase core domain 1)	25	11
PF07641 (DUF1596)	26.0	Dubious genome annotation. Family comprises only three sequences from Rhodopirellula baltica, two overlapping	3	0

The final two columns show the number of matches of each AntiFam entry to UniProtKB and to a metagenomic data set.

^aThe metagenomic set of sequences is the same as that used by Pfam (14).

^bCRISPR, Clustered Regularly Interspaced Short Palindromic Repeats.

Table 1.

AntiFam entries derived from Pfam families

Pfam accession number (identifier)	Last Pfam release present	Reason for deleting from Pfam	No. of matches in UniProt	No. of matches in metagenomics data set^a
PF07612 (DUF1575)	15.0	Proteins may not be expressed. Evidence for homology to known protein on opposite strand	3	0
PF07616 (DUF1578)	15.0	Proteins may not be expressed. Evidence for homology to known protein on opposite strand	6	6
PF07630 (DUF1591)	15.0	Proteins may not be expressed. Evidence for homology to known protein on opposite strand	6	0
PF07633 (DUF1594)	15.0	Proteins may not be expressed. Evidence for homology to known protein on opposite strand	5	0
PF11370 (DUF3170)	25.0	Probable shadow ORF of Clp protease	16	7
PF11194 (DUF2825)	25.0	Probable CRISPR^b repeat regions	159	18
PF11664 (DUF3264)	25.0	Probable CRISPR repeat regions	21	13
PF10695 (Cw-hydrolase)	25.0	Antisense to rRNA (9)	225	1,654
PF10919 (DUF2699)	26.0	Shadow ORF of PF00665 (integrase core domain 1)	25	11
PF07641 (DUF1596)	26.0	Dubious genome annotation. Family comprises only three sequences from Rhodopirellula baltica, two overlapping	3	0

Pfam accession number (identifier)	Last Pfam release present	Reason for deleting from Pfam	No. of matches in UniProt	No. of matches in metagenomics data set^a
PF07612 (DUF1575)	15.0	Proteins may not be expressed. Evidence for homology to known protein on opposite strand	3	0
PF07616 (DUF1578)	15.0	Proteins may not be expressed. Evidence for homology to known protein on opposite strand	6	6
PF07630 (DUF1591)	15.0	Proteins may not be expressed. Evidence for homology to known protein on opposite strand	6	0
PF07633 (DUF1594)	15.0	Proteins may not be expressed. Evidence for homology to known protein on opposite strand	5	0
PF11370 (DUF3170)	25.0	Probable shadow ORF of Clp protease	16	7
PF11194 (DUF2825)	25.0	Probable CRISPR^b repeat regions	159	18
PF11664 (DUF3264)	25.0	Probable CRISPR repeat regions	21	13
PF10695 (Cw-hydrolase)	25.0	Antisense to rRNA (9)	225	1,654
PF10919 (DUF2699)	26.0	Shadow ORF of PF00665 (integrase core domain 1)	25	11
PF07641 (DUF1596)	26.0	Dubious genome annotation. Family comprises only three sequences from Rhodopirellula baltica, two overlapping	3	0

The final two columns show the number of matches of each AntiFam entry to UniProtKB and to a metagenomic data set.

^aThe metagenomic set of sequences is the same as that used by Pfam (14).

^bCRISPR, Clustered Regularly Interspaced Short Palindromic Repeats.

Table 2.

AntiFam entries derived from custom multiple sequence alignment

Identifier	Type of spurious family	No. of matches in UniProt	No. of matches in metagenomics data set^a
Spurious_ORF_10	Translated bacterial tRNA, tRNA01	196	795
Spurious_ORF_11	Translated bacterial tRNA, tRNA02	89	170
Spurious_ORF_12	Translated bacterial tRNA, tRNA03	143	408
Spurious_ORF_13	Translated bacterial tRNA, tRNA04	77	671
Spurious_ORF_14	Translated bacterial tRNA, tRNA05	156	191
Spurious_ORF_15	Translated bacterial tRNA, tRNA06	31	63
Spurious_ORF_16	Translated bacterial tRNA, tRNA07	40	17
Spurious_ORF_17	Translated bacterial tRNA, tRNA08	5	10
Spurious_ORF_18	Translated bacterial tRNA, tRNA09	4	39
Spurious_ORF_19	Translated bacterial tRNA, tRNA10	7	12
Spurious_ORF_20	Translated bacterial tRNA, tRNA11	43	28
Spurious_ORF_21	PrfB frameshift	24	5
Spurious_ORF_22	From a lncRNA, LINC00174	26	1

Identifier	Type of spurious family	No. of matches in UniProt	No. of matches in metagenomics data set^a
Spurious_ORF_10	Translated bacterial tRNA, tRNA01	196	795
Spurious_ORF_11	Translated bacterial tRNA, tRNA02	89	170
Spurious_ORF_12	Translated bacterial tRNA, tRNA03	143	408
Spurious_ORF_13	Translated bacterial tRNA, tRNA04	77	671
Spurious_ORF_14	Translated bacterial tRNA, tRNA05	156	191
Spurious_ORF_15	Translated bacterial tRNA, tRNA06	31	63
Spurious_ORF_16	Translated bacterial tRNA, tRNA07	40	17
Spurious_ORF_17	Translated bacterial tRNA, tRNA08	5	10
Spurious_ORF_18	Translated bacterial tRNA, tRNA09	4	39
Spurious_ORF_19	Translated bacterial tRNA, tRNA10	7	12
Spurious_ORF_20	Translated bacterial tRNA, tRNA11	43	28
Spurious_ORF_21	PrfB frameshift	24	5
Spurious_ORF_22	From a lncRNA, LINC00174	26	1

^aThe metagenomic set of sequences is the same as that used by Pfam (14).

Table 2.

AntiFam entries derived from custom multiple sequence alignment

Identifier	Type of spurious family	No. of matches in UniProt	No. of matches in metagenomics data set^a
Spurious_ORF_10	Translated bacterial tRNA, tRNA01	196	795
Spurious_ORF_11	Translated bacterial tRNA, tRNA02	89	170
Spurious_ORF_12	Translated bacterial tRNA, tRNA03	143	408
Spurious_ORF_13	Translated bacterial tRNA, tRNA04	77	671
Spurious_ORF_14	Translated bacterial tRNA, tRNA05	156	191
Spurious_ORF_15	Translated bacterial tRNA, tRNA06	31	63
Spurious_ORF_16	Translated bacterial tRNA, tRNA07	40	17
Spurious_ORF_17	Translated bacterial tRNA, tRNA08	5	10
Spurious_ORF_18	Translated bacterial tRNA, tRNA09	4	39
Spurious_ORF_19	Translated bacterial tRNA, tRNA10	7	12
Spurious_ORF_20	Translated bacterial tRNA, tRNA11	43	28
Spurious_ORF_21	PrfB frameshift	24	5
Spurious_ORF_22	From a lncRNA, LINC00174	26	1

Identifier	Type of spurious family	No. of matches in UniProt	No. of matches in metagenomics data set^a
Spurious_ORF_10	Translated bacterial tRNA, tRNA01	196	795
Spurious_ORF_11	Translated bacterial tRNA, tRNA02	89	170
Spurious_ORF_12	Translated bacterial tRNA, tRNA03	143	408
Spurious_ORF_13	Translated bacterial tRNA, tRNA04	77	671
Spurious_ORF_14	Translated bacterial tRNA, tRNA05	156	191
Spurious_ORF_15	Translated bacterial tRNA, tRNA06	31	63
Spurious_ORF_16	Translated bacterial tRNA, tRNA07	40	17
Spurious_ORF_17	Translated bacterial tRNA, tRNA08	5	10
Spurious_ORF_18	Translated bacterial tRNA, tRNA09	4	39
Spurious_ORF_19	Translated bacterial tRNA, tRNA10	7	12
Spurious_ORF_20	Translated bacterial tRNA, tRNA11	43	28
Spurious_ORF_21	PrfB frameshift	24	5
Spurious_ORF_22	From a lncRNA, LINC00174	26	1

^aThe metagenomic set of sequences is the same as that used by Pfam (14).

AntiFam is primarily a tool that is aimed at bioinformaticians to be used as part of genome annotation projects. Therefore, we have not implemented a standalone website for viewing entries in AntiFam. The AntiFam alignments and profile HMMs can be downloaded from the following URL: ftp://ftp.sanger.ac.uk/pub/databases/Pfam/AntiFam/

Of the 1310 proteins identified in UniProtKB as probably being spurious the large majority were from TrEMBL, the unreviewed part of UniProtKB. This means that no annotator had been involved in the creation of the entries. They had been automatically created from the records in the European Nucleotide Archive, GenBank or DNA Data Bank of Japan (DDBJ). These protein entries are in the process of being checked for removal from UniProtKB. One spurious protein found in the reviewed Swiss-Prot section of UniProtKB was Y114_CHLMU (Q9PLI5) that is an uncharacterized protein from Chlamydia muridarum. This belonged to the previously mentioned spurious Cw-hydrolase family and was removed in UniProt release 2011_10. An additional 13 spurious proteins in the reviewed portion of UniProtKB are also identified, of which 8 are due to non-coding RNA translations:

O67358.1 Aquifex aeolicus Trigger factor contains frameshift extension;
P19773.1 Mycobacterium tuberculosis protein matching DUF2699;
P47080.1 yeast protein YJL007C product of a dubious gene prediction;
P92540.1 Arabidopsis protein;
Q04100.1 yeast protein YDR445C product of dubious gene prediction and partly overlaps YDR444W;
Q52M62.3 human product of a dubious coding sequence (CDS) prediction. Probable non-coding RNA;
Q6ZQT7.1 human product of a dubious CDS prediction. Probable non-coding RNA;
Q6ZRM9.1 human product of a dubious CDS prediction. Probable non-coding RNA;
Q75L30.1 human product of a dubious CDS prediction. Probable non-coding RNA;
Q9CJR2.1 Pasteurella multocida tRNA-derived match;
Q9CMD0.1 P. multocida tRNA-derived match;
Q9CMX0.1 P. multocida tRNA-derived match; and
Q9CMZ6.1 P. multocida tRNA-derived match.

Identification of problematic Pfam families

In addition to the families reported by Pfam users, we tried to identify if further spurious families existed. The large majority of proteins in the TrEMBL portion of UniProtKB come from translations found in entries in the European Nucleotide Archive, GenBank or DDBJ. Thus, we scanned TrEMBL entries to identify UniProtKB entries that overlapped with each other in the nucleotide entry. We confined our scan to the prokaryotic entries because the nature of overlaps is relatively simple compared to the complex patterns of interlacing and nesting found in eukaryotic gene structures. The scan identified 73 853 proteins that were found to be overlapping. This list of proteins was then used to identify further Pfam families that contained numerous overlapping genes. We ordered the Pfam families by the fraction of overlapping proteins found within it. This list can be found in Supplementary Table S1. Using this measure means that large well-known families that are likely to have many overlaps by chance are not at the top of the list.

Future plans

The first release of AntiFam contains only a modest number of families. However, we see a number of ways to increase this in the future. The first of these is to increase the number of non-coding RNA-based families. We currently have only one ribosomal RNA-based family and we can add many further families. We can identify proteins related to ribosomal RNAs initially using tblastn, which compares a protein to a nucleotide sequence considering all six reading frames. In addition, we could also consider comparing a large database of RNA sequences to the protein sequence databases to identify further potentially spurious proteins. To date, we have only been able to investigate the Pfam families with the highest fraction of overlapping proteins. But in the coming months, we will investigate this list more thoroughly to identify if any further Pfam families should be deleted and added to AntiFam.

Conclusions

The first release of AntiFam contains 23 families derived from Pfam as well as a small number of non-coding RNAs that were erroneously translated into protein sequences. We expect that this number will grow in the future and we have several ideas to help us to achieve this. This should increase the power of AntiFam to reduce the number of spurious ORFs finding their way into the sequence databases. We hope that AntiFam will become an indispensible tool for quality control in metagenomic and genomic studies. We are particularly keen for biocurators and experimental biologists to remain vigilant and alert us to new cases of spurious ORFs so that we can add them to this resource.

Supplementary Data

Supplementary data are available at Database Online.

Funding

Wellcome Trust (grant number WT077044/Z/05/Z); National Human Genome Research Institute (grant number R01 HG004881). Funding for open access charge: Wellcome Trust (grant number WT077044/Z/05/Z).

Conflict of interest. None declared.

Acknowledgements

We are grateful to James Tripp from University of California Santa Cruz, who took the time to alert us to one of these spurious families.

References

Magrane

Consortium

. ,

UniProt Knowledgebase: a hub of integrated protein data

Database

2011

, vol.

2011

bar009

OpenURL Placeholder Text

Brenner

. ,

Errors in genome annotation

Trends Genet.

1999

, vol.

(pg.

132

133

)

Schnoes

Brown

Dodevski

Babbitt

. ,

Annotation error in public databases: misannotation of molecular function in enzyme superfamilies

PLoS Comput. Biol.

2009

, vol.

pg.

e1000605

Bork

Bairoch

. ,

Go hunting in sequence databases but watch out for the traps

Trends Genet.

1996

, vol.

(pg.

425

427

)

Delcher

Bratke

Powers

Salzberg

. ,

Identifying bacterial genes and endosymbiont DNA with Glimmer

Bioinformatics

2007

, vol.

(pg.

673

679

)

Veloso

Riadi

Aliaga

, et al. ,

Large-scale, multi-genome analysis of alternate open reading frames in bacteria and archaea

Omics

2005

, vol.

(pg.

105

)

Normark

Bergstrom

Edlund

, et al. ,

Overlapping genes

Annu. Rev. Genet.

1983

, vol.

(pg.

499

525

)

Punta

Coggill

Eberhardt

, et al. ,

The Pfam protein families database

Nucleic Acids Res.

2011

, vol.

(pg.

D290

D301

)

Tripp

Hewson

Boyarsky

, et al. ,

Misannotations of rRNA can now generate 90% false positive protein matches in metatranscriptomic studies

Nucleic Acids Res.

2011

, vol.

(pg.

8792

8802

)

Kall

Krogh

Sonnhammer

. ,

A combined transmembrane topology and signal peptide prediction method

J. Mol. Biol.

2004

, vol.

338

(pg.

1027

1036

)

Wootton

Federhen

. ,

Analysis of compositionally biased regions in sequence databases

Methods Enzymol.

1996

, vol.

266

(pg.

554

571

)

PubMed

OpenURL Placeholder Text

Craigen

Cook

Tate

Caskey

. ,

Bacterial peptide chain release factors: conserved primary structure and possible frameshift regulation of release factor 2

Proc. Natl Acad. Sci. USA

1985

, vol.

(pg.

3616

3620

)

Crossref

Eddy

. ,

A new generation of homology search tools based on probabilistic inference

Genome Inform.

2009

, vol.

(pg.

205

211

)

PubMed

OpenURL Placeholder Text