BactPepDB: a database of predicted peptides from a exhaustive survey of complete prokaryote genomes Open Access

BactPepDB entries by categories

	Small	Medium	Large
RefSeq	3946 (0.6)	143 157 (30.4)	362 395 (63.2)
Potential pseudogenes	201 228 (28.6)	51 752 (11.0)	11 743 (2.1)
Intergenic	324 533 (46.2)	189 573 (40.2)	121 897 (21.3)
Entity-overlapping	173 270 (24.6)	86 907 (18.4)	77 012 (13.4)

	Small	Medium	Large
RefSeq	3946 (0.6)	143 157 (30.4)	362 395 (63.2)
Potential pseudogenes	201 228 (28.6)	51 752 (11.0)	11 743 (2.1)
Intergenic	324 533 (46.2)	189 573 (40.2)	121 897 (21.3)
Entity-overlapping	173 270 (24.6)	86 907 (18.4)	77 012 (13.4)

The three categories correspond to small (<30 amino acids), medium (30–50 amino acids) and large (>50 amino acids) peptide sizes. Peptides already annotated in RefSeq are distinguished from newcomers of BactPepDB categorized as potential pseudogenes, intergenic and entity-overlapping. Fractions in % within brackets.

Table 1.

BactPepDB entries by categories

	Small	Medium	Large
RefSeq	3946 (0.6)	143 157 (30.4)	362 395 (63.2)
Potential pseudogenes	201 228 (28.6)	51 752 (11.0)	11 743 (2.1)
Intergenic	324 533 (46.2)	189 573 (40.2)	121 897 (21.3)
Entity-overlapping	173 270 (24.6)	86 907 (18.4)	77 012 (13.4)

	Small	Medium	Large
RefSeq	3946 (0.6)	143 157 (30.4)	362 395 (63.2)
Potential pseudogenes	201 228 (28.6)	51 752 (11.0)	11 743 (2.1)
Intergenic	324 533 (46.2)	189 573 (40.2)	121 897 (21.3)
Entity-overlapping	173 270 (24.6)	86 907 (18.4)	77 012 (13.4)

Considering the intra-genus and intra-order conservation, we observe that close to 184 000 new SCSs detected in the intergenic regions are conserved to some extent across different species of a genus, whereas 112 000 of them are conserved across different taxonomic families of an order. Table 2 presents a comparison of the fraction of the conserved SCSs for these new intergenic SCSs and those preexisting in RefSeq. Interestingly, the fraction of intergenic SCSs that are conserved is similar to that of RefSeq, which suggests that the information of these newcomers is consistent with preexisting one. The fraction of the conserved peptides identified in the intergenic regions appears stable depending of peptide size when considering the different species of a genus (intra-genus) or the different families of an order (intra-order). Overall, depending on the taxonomic level chosen as a requirement for conservation significance, at least 18% of the new intergenic SCSs are conserved.

Table 2.

Conserved SCSs

		Small	Medium	Large
Intra-genus	RefSeq	978 (25)	40 913 (29)	152 429 (42)
Intra-genus	New intergenic SCSs	84 276 (26)	57 584 (30)	41 447 (34)
Intra-order	RefSeq	750 (19)	20 041 (14)	60 194 (17)
Intra-order	New intergenic SCSs	61 662 (19)	28 047 (15)	21 940 (18)

		Small	Medium	Large
Intra-genus	RefSeq	978 (25)	40 913 (29)	152 429 (42)
Intra-genus	New intergenic SCSs	84 276 (26)	57 584 (30)	41 447 (34)
Intra-order	RefSeq	750 (19)	20 041 (14)	60 194 (17)
Intra-order	New intergenic SCSs	61 662 (19)	28 047 (15)	21 940 (18)

Numbers and fractions (% within brackets) of peptide entries that are conserved across species of a genus (intra-genus) and across families of an order (intra-order). Fractions are relative to the total number of entries in each category (see Table 1 ). The three categories correspond to small (<30 amino acids), medium (30–50 amino acids) and large (>50 amino acids) peptide sizes. Peptides already annotated in RefSeq are distinguished from the new intergenic SCSs of BactPepDB.

Table 2.

Conserved SCSs

		Small	Medium	Large
Intra-genus	RefSeq	978 (25)	40 913 (29)	152 429 (42)
Intra-genus	New intergenic SCSs	84 276 (26)	57 584 (30)	41 447 (34)
Intra-order	RefSeq	750 (19)	20 041 (14)	60 194 (17)
Intra-order	New intergenic SCSs	61 662 (19)	28 047 (15)	21 940 (18)

		Small	Medium	Large
Intra-genus	RefSeq	978 (25)	40 913 (29)	152 429 (42)
Intra-genus	New intergenic SCSs	84 276 (26)	57 584 (30)	41 447 (34)
Intra-order	RefSeq	750 (19)	20 041 (14)	60 194 (17)
Intra-order	New intergenic SCSs	61 662 (19)	28 047 (15)	21 940 (18)

Evaluation of predictions

As the database contains predicted candidates, it is important to assess how likely it can assist the identification of truly expressed peptides. A way to assess BactPepDB-added value comes from experimental studies focusing on specific genomes. For instance, 14 CDSs that were missing from the initial annotation of Vibrio splendidus LGP32 were recently uncovered ( 57 ). We found that 12 of our predictions overlap these missing CDSs. To assess this on a larger scale, we have also compared two versions of BactPepDB based on two versions of the RefSeq database (on date of 11 June 2013 and 30 September 2013) and have found that 125 newly annotated peptides of size comprised between 10 and 80 amino acids were added for genomes that are common to both versions, 33 of which are not of the ‘predicted’ kind and were biologically confirmed. BactGeneSHOW had correctly predicted 89 of these newcomers in the previous version of BactPepDB, among which 24 are now biologically confirmed, which means that about 70% of those newly annotated peptides were already present in BactPepDB before making it to the RefSeq database. Among those predicted peptides, 83 are conserved across different species and only six were unique in their respective order, supporting that peptide conservation is a good measure of peptide expression likeliness.

As the core of BactPepDB relies on BactGeneSHOW, we have also run other gene prediction programs over these genomes to assess BactGeneSHOW performance. GeneMarkHMM 2.6 was able to retrieve 94 of these newcomers, whereas Prodigal 2.5 could only find 38 of them. Although GeneMarkHMM 2.6 slightly outperformed BactGeneSHOW, it is important to note that GeneMarkHMM was apparently inefficient for some genomes, for instance in Flavobaterium psychrophilum JIP02/86 where none of the five newly annotated genes was detected, whereas BactGeneSHOW retrieves them. Indeed, GeneMarkHMM relies on precalculated heuristic models which may not be suitable for all species whereas BactGeneSHOW relies on a self-learning algorithm.

Finally, another important point to assess is the expected proportion of false positives present in the database. Although this is a very difficult question to answer, we recently gained some insight through RNA deep sequencing data, which reveals smaller intergenic transcripts and mRNA extensions. Analysis of new transcripts from Escherichia coli str. K-12 substr. MG1655 ( 56 , 58 ) showed that only 74 predicted sequences of BactPepDB were overlapping the 1094 potentially non-coding transcripts (ncRNA) and long 5′-UTR extensions detected in the intergenic regions of MG1655 . This is interesting enough because only a very small fraction of these 1094 transcribed regions is supposed to code for peptides.

Searching for Bactibase homologs

We illustrate here the use of BactPepDB to the search for homologs of Bactibase ( 23 ). Bactibase is, in our experience, the only database devoted to antimicrobial peptides for which the complete sequence collection could be downloaded. Among the 219 entries, 197 peptide sequences match the condition to have sequences of size between 10 and 80 residues without non-standard amino acids in their sequence. The genomic information (chromosomal and plasmidic) corresponding to the genus/species was present in BactPepDB for 146 of them. However, this condition does not imply the information should be present in BactPepDB since for one part some variation between strains of a species can occur, and for another part, some peptides can result from the cleavage of preproteins larger than 80 residues, thus out of the scope of BactPepDB. A careful inspection of the literature reporting peptide identification for each Bactibase entry showed that peptides not found in BactPepDB correspond to 11 cases for which preproteins are larger than 80 amino acids, and 37 cases for which it was not possible to conclude, owing to the fact that the peptide sequence was not elucidated using genomic information or that it was not possible to conclude between chromosomal or plasmid encoding. As a result, 98 peptides only were clearly in the scope of BactPepDB.

A similarity-based search in BactPepDB—accepting a correct identification for a hit in the same species, and with over 90% identity—led to the identification of 56 of them. RefSeq annotations were only present for 34 cases over 56. Thus BactPepDB was able to infer new knowledge for 22 cases over 98, a gain of 22%. Furthermore, hits at a lesser sequence identity were found for 22 more peptides. BactPepDB was thus able to grab information for 76 peptides over 98. We remind that all strains of a species are not expected to produce all antimicrobial peptides [see for instance ( 59–62 )]. Overall, such results illustrate that the re-annotation of the complete genome using a method specialized for SCSs can have added value, at least as a preliminary step to confront with additional information.

Comparison with BAGEL, a database of predicted bacteriocins

We have also analysed the consistency of BactPepDB with BAGEL ( 24 ), a resource predicting bacteriocins from genomic data, over a collection of 15 genomes of different genera: Acaryochloris marina MBIC11017 , Achromobacter xylosoxidans A8 , Bacillus cereus AH187 , Bacillus subtilis BSn5 , Enterobacter cloacae SCF1 , Escherichia coli W , Geobacillus sp C56 T3 , Lactobacillus casei W56 , Methanococcus voltae A3 , Mycobacterium tuberculosis H37Rv , Mycobacterium tuberculosis RGTB327 , Streptococcus pneumoniae AP200 , Streptococcus thermophilus CNRZ1066 , Vibrio parahaemolyticus RIMD 2210633 , and Vibrio vulnificus CMCP6 . Over these genomes, BAGEL returned 713 candidates. We also found 395 of these candidates have a size of >80 amino acids. On the 213 remaining candidates, only 89 are common to BAGEL and BactPepDB. Such difference of 124 candidates is not per se surprising since BAGEL relies on Glimmer2 to identify candidates, and it does not consider the presence of a RBS when BactGeneSHOW does—one can thus expect BactGeneSHOW to be more stringent. Among the 89 candidates identified by both BAGEL and BactPepDB, 55 are annotated in RefSeq (and in BactPepDB), in which seven are known bacteriocins, the others being hypothetical bacteriocins. None of the remaining 124 candidates proposed by BAGEL is annotated in RefSeq. Thus, accepting the RefSeq annotation as a criterion to validate the candidates—note that not all RefSeq entries are biologically confirmed—we find BactPepDB would propose a more narrow set of candidates, not discarding any true positive.

Conclusions and Future Directions

BactPepDB is a database of predicted peptides from a exhaustive survey of complete prokaryote genomes. BactGeneSHOW being a generic approach to the search for SCSs, taking into account the complete spectrum of prokaryotes from archaea to bacteria and the diversity of each category, it is expected that due to the variability in start codon and codon usage, some part of the truly expressed SCSs are not detected. Genome coding specificity, particularly that existing for bacteria and archaea, could be integrated in BactGeneSHOW but this remains the subject for further work. In addition, from our analyses, BactPepDB already shows the ability to retrieve a large part of previously annotated biological peptides when in the scope of the database. BactPepDB could be improved in several other directions. At present, this precludes important sources of prokaryotic information such as those with unusual codons, as well as the incomplete genomes available in RefSeq or other databases from which it should be possible to increase the knowledge of the degree of conservation of candidates. Particularly, it could be of interest to add data from the Ensembl Bacteria database ( 63 ) as it contains, on average, more strains per species. Another limit is related to the impossibility to detect peptides resulting from the maturation of large proteins, which is presently beyond the scope of BactPepDB.

Accepting these limitations, it remains that BactPepDB appears to contain new knowledge about SCSs compared to previous RefSeq entries. Although, it is difficult to exactly assess the amount of candidate peptides that may be expressed in some physiological conditions, or that may have a biological activity, BactPepDB provides a rather unique panorama of SCSs over the complete collection of genomes available, at the level of individual sequences but also considering their conservation through genera. The close to 18% of BactPepDB newcomers conserved to some extent could be seeds for further investigations. The detection of small peptides being more difficult using biochemical analyses, BactPepDB is thus expected to assist the experimental discovery of new bioactive peptides.

Acknowledgements

The authors thank Pierre Nicolas and the MIG team for making available their BactGeneSHOW program, the eBio platform (Université Paris-Sud) for RNA-seq data and analysis, and F. Guyon and J. Muzard for useful discussions.

Funding

INSERM-University Paris Diderot UMR-S 973 and IBiSA (for deployment on the RPBS platform).

Conflict of interest: None declared.

References

Vlieghe

Lisowski

Martinez

et al. . (

2010

)

Synthetic therapeutic peptides: science and market

Drug Discov. Today

–

Audie

Boyd

(

2010

)

The synergistic use of computation, chemistry and biology to discover novel peptide-based drugs: the time is right

Curr. Pharm. Des.

567

–

582

Pan

C.Q.

Buxton

J.M.

Yung

S.L.

et al. . (

2006

)

Design of a long acting peptide functioning as both a glucagon-like peptide-1 receptor agonist and a glucagon receptor antagonist

J. Biol. Chem.

281

12506

–

12515

LaBelle

J.L.

Katz

S.G.

Bird

G.H.

et al. . (

2012

)

A stapled BIM peptide overcomes apoptotic resistance in hematologic cancers

J. Clin. Invest.

122

2018

–

2031

Elmagbari

N.O.

Egleton

R.D.

Palian

M.M.

et al. . (

2004

)

Antinociceptive structure-activity studies with enkephalin-based opioid glycopeptides

J. Pharmacol. Exp. Ther.

311

290

–

297

Svensen

Walton

J.G.A.

Bradley

(

2012

)

Peptides for cell-selective drug delivery

Trends Pharmacol. Sci.

186

–

192

Hancock

R.E.W.

Sahl

H.-G.

(

2006

)

Antimicrobial and host-defense peptides as new anti-infective therapeutic strategies

Nat. Biotechnol.

1551

–

1557

Chen

Swem

L.R.

Swem

D.L.

et al. . (

2011

)

A strategy for antagonizing quorum sensing

Mol. Cell

199

–

209

Vetter

Davis

J.L.

Rash

L.D.

et al. . (

2011

)

Venomics: a new paradigm for natural products-based drug discovery

Amino Acids

–

Montrose

Yang

Sun

et al. . (

2013

)

Xentry, a new class of cell-penetrating peptide uniquely equipped for delivery of drugs

Sci Rep.

1661

Landon

L.A.

Zou

Deutscher

S.L.

(

2004

)

Is phage display technology on target for developing peptide-based cancer drugs?

Curr. Drug Discov. Technol.

113

–

132

P.S.

Mui

E.Y.Y.

et al. . (

2009

)

Phage display screening against a set of targets to establish peptide-based sugar mimetics and molecular docking to predict binding site

Bioorg. Med. Chem.

4825

–

4832

Lam

K.S.

(

1997

)

Application of combinatorial library methods in cancer research and drug discovery

Anticancer Drug Des.

145

–

167

Marani

M.M.

Ceron

M.C.M.

Giudicessi

S.L.

et al. . (

2009

)

Screening of one-bead-one-peptide combinatorial library using red fluorescent dyes. Presence of positive and false positive beads

J. Comb. Chem.

146

–

150

Ricklin

Lambris

J.D.

(

2008

)

Compstatin: a complement inhibitor on its way to clinical application

Adv. Exp. Med. Biol.

632

273

–

292

Seidah

N.G.

Chrétien

(

1999

)

Proprotein and prohormone convertases: a family of subtilases generating diverse bioactive polypeptides

Brain Res.

848

–

Ibrahim

Nicolas

Bessières

et al. . (

2007

)

A genome-wide survey of short coding sequences in streptococci

Microbiology

153

3631

–

3644

Warren

A.S.

Archuleta

Feng

W.-C.

et al. . (

2010

)

Missing genes in the annotation of prokaryotic genomes

BMC Bioinformatics

131

Pruitt

K.D.

Tatusova

Maglott

D.R.

(

2007

)

NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins

Nucleic Acids Res.

D61

–

D65

Vuyst

L.D.

Avonts

Neysens

et al. . (

2004

)

The lactobin a and amylovorin L471 encoding genes are identical, and their distribution seems to be restricted to the species Lactobacillus amylovorus that is of interest for cereal fermentations

Int. J. Food Microbiol.

–

106

Fjell

C.D.

Hancock

R.E.W.

Cherkasov

(

2007

)

Amper: a database and an automated discovery tool for antimicrobial peptides

Bioinformatics

1148

–

1155

Wang

(

2009

)

APD2: the updated antimicrobial peptide database and its application in peptide design

Nucleic Acids Res.

D933

–

D937

Hammami

Zouhir

Hamida

J.B.

et al. . (

2007

)

Bactibase: a new web-accessible database for bacteriocin characterization

BMC Microbiol.

deJong

vanHeel

A.J.

Kok

et al. . (

2010

)

BAGEL2: mining for bacteriocins in genomic data

Nucleic Acids Res.

W647

–

W651

Thomas

Karnik

Barai

R.S.

et al. . (

2010

)

CAMP: a useful resource for research on antimicrobial peptides

Nucleic Acids Res.

D774

–

D780

Sundararajan

V.S.

Gabere

M.N.

Pretorius

et al. . (

2012

)

DAMPD: a manually curated antimicrobial peptide database

Nucleic Acids Res.

D1108

–

D1112

Piotto

S.P.

Sessa

Concilio

et al. . (

2012

)

YADAMP: yet another database of antimicrobial peptides

Int. J. Antimicrob. Agents

346

–

351

Jehl

M.-A.

Arnold

Rattei

(

2011

)

Effective—a database of predicted secreted bacterial proteins

Nucleic Acids Res.

D591

–

D595

Wynendaele

Bronselaer

Nielandt

et al. . (

2013

)

Quorumpeps database: chemical space, microbial origin and functionality of quorum sensing peptides

Nucleic Acids Res.

D655

–

D659

Choo

K.H.

Tan

T.W.

Ranganathan

(

2005

)

SPdb—a signal peptide database

BMC Bioinformatics

249

Novković

Simunić

Bojović

et al. . (

2012

)

DADP: the database of anuran defense peptides

Bioinformatics

1406

–

1407

Whitmore

Chugh

J.K.

Snook

C.F.

et al. . (

2003

)

The peptaibol database: a sequence and structure resource

J. Pept. Sci.

663

–

665

Gautam

Singh

Tyagi

et al. . (

2012

)

CPPsite: a curated database of cell penetrating peptides

Database

2012 : article ID bas015; doi:10.1093/database/bas015

Caboche

Pupin

Leclère

et al. . (

2008

)

NORINE: a database of nonribosomal peptides

Nucleic Acids Res.

D326

–

D331

Shtatland

Guettler

Kossodo

et al. . (

2007

)

Pepbank—a database of peptides based on sequence text mining and public peptide data sources

BMC Bioinformatics

280

Zamyatnin

A.A.

Borchikov

A.S.

Vladimirov

M.G.

et al. . (

2006

)

The EROP-Moscow oligopeptide database

Nucleic Acids Res.

D261

–

D266

Vanhee

Reumers

Stricher

et al. . (

2010

)

PepX: a structural database of non-redundant protein-peptide complexes

Nucleic Acids Res.

D545

–

D551

London

Movshovitz-Attias

Schueler-Furman

(

2010

)

The structural basis of peptide-protein binding strategies

Structure

188

–

199

Nicolas

Bize

Muri

et al. . (

2002

)

Mining Bacillus subtilis chromosome heterogeneities using hidden Markov models

Nucleic Acids Res.

1418

–

1426

Altschul

S.F.

Gish

Miller

et al. . (

1990

)

Basic local alignment search tool

J. Mol. Biol.

215

403

–

410

Wattam

A.R.

Williams

K.P.

Snyder

E.E.

et al. . (

2009

)

Analysis of ten Brucella genomes reveals evidence for horizontal gene transfer despite a preferred intracellular lifestyle

J. Bacteriol.

191

3569

–

3579

Gevers

Cohan

F.M.

Lawrence

J.G.

et al. . (

2005

)

Opinion: re-evaluating prokaryotic species

Nat. Rev. Microbiol.

733

–

739

Ward

D.M.

Cohan

F.M.

Bhaya

et al. . (

2008

)

Genomics, environmental genomics and the issue of microbial species

Heredity (Edinb)

100

207

–

219

Konstantinidis

K.T.

Tiedje

J.M.

(

2005

)

Towards a genome-based taxonomy for prokaryotes

J. Bacteriol.

187

6258

–

6264

Konstantinidis

K.T.

Tiedje

J.M.

(

2005

)

Genomic insights that advance the species definition for prokaryotes

Proc. Natl. Acad. Sci. U. S. A.

102

2567

–

2572

Jones

D.T.

(

1999

)

Protein secondary structure prediction based on position-specific scoring matrices

J. Mol. Biol.

292

195

–

202

Camproux

A.C.

Gautier

Tufféry

(

2004

)

A Hidden Markov Model derived structural alphabet for proteins

J. Mol. Biol.

339

591

–

605

Berman

H.M.

Westbrook

Feng

et al. . (

2000

)

The protein data bank

Nucleic Acids Res.

235

–

242

Cheng

Saigo

Baldi

(

2006

)

Large-scale prediction of disulphide bridges using kernel methods, two-dimensional recursive neural networks, and weighted graph matching

Proteins

617

–

629

Krogh

Larsson

vonHeijne

et al. . (

2001

)

Predicting transmembrane protein topology with a Hidden Markov Model: application to complete genomes

J. Mol. Biol.

305

567

–

580

Petersen

T.N.

Brunak

vonHeijne

et al. . (

2011

)

SignalP 4.0: discriminating signal peptides from transmembrane regions

Nat. Methods

785

–

786

Mooney

Haslam

N.J.

Pollastri

et al. . (

2012

)

Towards the improved discovery and design of functional peptides: common features of diverse classes permit generalized prediction of bioactivity

PLoS One

e45012

Lata

Mishra

N.K.

Raghava

G.P.S.

(

2010

)

AntiBP2: improved version of antibacterial peptide prediction

BMC Bioinformatics

S19

Sievers

Wilm

Dineen

et al. . (

2011

)

Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega

Mol. Syst. Biol.

539

Crooks

G.E.

Hon

Chandonia

J.-M.

et al. . (

2004

)

WebLogo: a sequence logo generator

Genome Res.

1188

–

1190

Toffano-Nioche

Nguyen

A.N.

Kuchly

et al. . (

2012

)

Transcriptomic profiling of the oyster pathogen Vibrio splendidus opens a window on the evolutionary dynamics of the small RNA repertoire in the Vibrio genus

RNA

2201

–

2219

Raghavan

Sloan

D.B.

Ochman

(

2012

)

Antisense transcription is pervasive but rarely conserved in enteric bacteria

MBio

e000156-12

Crossref

Toffano-Nioche

Luo

Kuchly

et al. . (

2013

)

Detection of non-coding RNA in bacteria and archaea using the DETR’PROK Galaxy pipeline

Methods

–

Stephens

S.K.

Floriano

Cathcart

D.P.

et al. . (

1998

)

Molecular analysis of the locus responsible for production of plantaricin S, a two-peptide bacteriocin produced by Lactobacillus plantarum LPCO10

Appl. Environ. Microbiol.

1871

–

1877

Maldonado

Jiménez-Díaz

Ruiz-Barba

J.L.

(

2004

)

Induction of plantaricin production in Lactobacillus plantarum NC8 after coculture with specific gram-positive bacteria is mediated by an autoinduction mechanism

J. Bacteriol.

186

1556

–

1564

Woodruff

W.A.

Novak

Caufield

P.W.

(

1998

)

Sequence analysis of mutA and mutM genes involved in the biosynthesis of the lantibiotic mutacin II in Streptococcus mutans

Gene

206

–

Ross

K.F.

Ronson

C.W.

Tagg

J.R.

(

1993

)

Isolation and characterization of the lantibiotic salivaricin a and its structural gene sala from Streptococcus salivarius 20P3

Appl. Environ. Microbiol.

2014

–

2021