LPS-annotate: complete annotation of compositionally biased regions in the protein knowledgebase

Abstract

Compositional bias (i.e. a skew in the composition of a biological sequence towards a subset of residue types) can occur at a wide variety of scales, from compositional biases of whole genomes, down to short regions in individual protein and gene–DNA sequences that are compositionally biased (CB regions). Such CB regions are made from a subset of residue types that are strewn along the length of the region in an irregular way. Here, we have developed the database server LPS-annotate, for the analysis of such CB regions, and protein disorder in protein sequences. The algorithm defines compositional bias through a thorough search for lowest-probability subsequences (LPSs) (i.e., the least likely sequence regions in terms of composition). Users can (i) initially annotate CB regions in input protein or nucleotide sequences of interest, and then (ii) query a database of greater than 1 500 000 pre-calculated protein-CB regions, for investigation of further functional hypotheses and inferences, about the specific CB regions that were discovered, and their protein disorder propensities. We demonstrate how a user can search for CB regions of similar compositional bias and protein disorder, with a worked example. We show that our annotations substantially augment the CB-region annotations that already exist in the UniProt database, with more comprehensive annotation of more complex CB regions. Our analysis indicates tens of thousands of CB regions that do not comprise globular domains or transmembrane domains, and that do not have a propensity to protein disorder, indicating a large cohort of protein-CB regions of biophysically uncharacterized types. This server and database is a conceptually novel addition to the workbench of tools now available to molecular biologists to generate hypotheses and inferences about the proteins that they are investigating. It can be accessed at http://libaio.biol.mcgill.ca/lps-annotate.html.

Database URL:http://libaio.biol.mcgill.ca/lps-annotate.html

Introduction

Development of tools for automated biological-sequence annotation is imperative, particularly since now greater than 1500 complete genomes have been sequenced and assembled. One important problem is the comprehensive annotation of compositionally biased (CB) regions in biological sequences. CB regions are sequence stretches with a large fraction of a small subset of residue types. If the CB regions are biased for multiple amino-acid residue types that are strewn along the sequence in an irregular way, the boundaries of these regions can be difficult to define. A well-known CB case arises in the yeast prions, which tend to contain CB regions made from glutamine and asparagine residue types (1). Other examples include the arginine-/serine-rich regions in some RNA-binding proteins (2), and the proline-rich domain of the transcriptional-complex protein Ssdp1, which is responsible for transactivation and is found in other diverse contexts (3).

A specific type of CB region is the ‘intrinsically disordered’ (ID) protein or domain. ID regions lack a globular 3D structure, and are unfolded in their native states (4,5). Work on disordered-region annotation has been extensive, with several algorithms being developed (5–10). ID proteins can function in signalling and regulation, and are associated with post-translational modifications, such as phosphorylation sites (5,11,12). The link between ID and CB regions, and simple repetitive regions has been demonstrated, with CB regions of a certain degree of disorder having distinct compartmentalizations and functional category tendencies (13–15).

Several algorithms have been derived previously to annotate CB regions, with the primary goal of ‘masking’ such regions before sequence alignment, to avoid incorrect inference of homology [e.g. SEG, (16); and CAST (17)]. To facilitate the automated annotation of all possible CB regions, we have developed a server, called LPS-annotate (LPS stands for Lowest Probability Subsequence; see the algorithm summary below for further details). This algorithm annotates CB regions in both protein and nucleotide sequences. Previously, we reported the development of this algorithm for the exhaustive assignment of CB regions (1,14,18). The chief novelty of this procedure is that CB regions of multiple amino acid residue types can be assigned thoroughly and completely, with clearly optimized boundaries.

Here, we report the development of a server and an on-line database of annotations that is based on the latest development of this algorithm. First, the server can be used to apply the LPS algorithm to annotation of both protein and nucleotide sequences. Second, after determining the biases in a sequence, a database of greater than 1500 000 pre-calculated CB regions and regions of predicted protein disorder (PPD) can be queried for regions of the same type and protein disorder content, for investigation of further functional hypotheses and inferences. In the database portion of the website, CB region annotations have been pre-calculated for the Uniprot/SWISSPROT database (19). CB annotations are cross-referenced with default SEG annotations of low-complexity regions (16), and also with predictions of disordered regions in proteins [made using DISOPRED2 (20)].

Methods

LPS algorithm

In brief, the LPS algorithm scans along the input sequence in a decreasing series of window sizes, the maximum (W_max) and minimum (W_min) of which are specified by the user. For each residue type x, and for the range of window sizes (W_min ≤ w ≤ W_max), the input sequence is searched for stretches that have compositional bias of the lowest probability (P_min):

(1)

where i is each possible start position for a window w in the sequence, spaced according to the user-specified parameter S (step size). The probability P_bias(i, w) in Equation (1) is given by a binomial distribution:

(2)

where f_x is the proportion of amino-acid type x as given by the database amino-acid composition. The count for x is denoted n in the window w starting at position i. Sequence stretches with P_min are termed lowest-probability subsequences (LPSs).

To calculate biases derived from any number of residue types thoroughly for a given protein sequence, the following iterative process is performed. P_min values are calculated for any set of amino acids {xyz…}, by summing up the number of residues over the whole residue-type set. However, biases are only picked in preference over a previously calculated bias made by a smaller number of residue types, if their P_min-values are smaller. The set of residue types contributing to the bias (sorted in decreasing order of their original P_min values), is defined as the ‘CB signature’. The iterative procedure is performed until convergence. Using this procedure, regions that comprise mild bias for multiple residue types can be detected as significantly biased. Further details of the algorithm are given on the help pages of the server [and in (14,18)].

Data analysed

The algorithm was run on the complete UniProt/SwissProt and UniProt/TrEMBL databases from July 2009 (19). The CB regions in the database are given a ranking, with the most biased (smallest P-value) being given a ranking of 1, and others given higher rankings.

Assignments of disordered regions in proteins were made using DISOPRED2 (20), with default settings. Of course, other DISOPRED program parameter settings are possible, but the disordered region predictions annotated here are just used as a guide, as a prelude to further detailed characterizations by the user for their proteins of interest. The fraction of the CB regions that are comprised of predicted protein disorder, was calculated from these assignments, and is displayed in the database entries. Also, the mean disorder propensity of each CB region was calculated by averaging the disorder propensity values for all residues in the CB region. Disorder propensity values (P_diso, _X) were calculated from the DISPROT database of known disordered regions (21,22), for each amino-acid residue type X from the following formula:

Assignment of globular domains was performed using blastp (e-value ≤ 1 x 10⁻⁴) (23) comparisons to the ASTRALSCOP non-redundant database of protein domains made with a threshold of 40% sequence identity (24). Annotations of transmembrane domains were taken from the ‘FT TRANSMEM’ records in the UniProt/SwissProt database (19). Existing UniProt/SwissProt annotations of CB regions were taken from the ‘FT COMPBIAS’ records.

Use of the database

The database can be used in a two-step process:

LPS-annotate server: assignment of CB regions in a query sequence (either protein or nucleotide sequences);
Database of pre-calculated CB annotations for UniProt: searching the database of pre-calculated CB regions for functional inferences (this is, of course, for protein sequences only).

LPS-Annotate server

The LPS-annotate server can annotate both protein and nucleotide sequences for CB regions. Users can paste the input sequence into the query box provided, and select values for W_min, W_max and S (step size). A screenshot of an example of the output of this server is illustrated, for a glutamine-/histidine-rich protein (Figure 1). A help page is provided, which explains the functioning of the server, including recommended values for W_max and W_min. Typically, if a smaller W_max is used, there are two effects on the CB region annotations: (i) longer CB regions are broken up into shorter stretches (of up to approximately the size of W_max); (ii) subsidiary mild biases that can only be detected with longer window sizes are not considered. Thus, it is generally advisable to use the largest W_max (= 500 residues length).

Figure 1.

Screenshot of output of initial server portion of database. An example of the initial LPS-annotate program server output for the example PHO2, from budding yeast (P07269, PHO2_YEAST).

Open in new tab Download slide

As shown in the Figure 1 example, for each CB region, the server output displays: (i) the protein name; (ii) the number of bias residues (i.e. those residue types that define the bias); (iii) the start and end points of the CB region; (iv) the CB region’s binomial P-value; (v) the CB signature; (vi) the mean disorder propensity for the CB region (calculated as described in ‘Methods’ section); (vii) the CB region subsequences. Other fields in the output are explained in the downloadable Help page. A link to ‘Download’ the data is given at the bottom of the page.

Database of pre-calculated CB annotations for UniProt

We have supplied a database of CB-region annotations for proteins in the June 2009 version of the Uniprot/SWISSPROT protein database (19). These were made using the parameter settings for the LPS algorithm (W_min = 25, W_max = 500 and S = 1). The annotations in the database are cross-referenced with: (i) low-complexity regions identified with SEG (16) (run with default settings); (ii) predictions of disordered regions made with the program DISOPRED2 (run with default settings) (20). It is important to note here that the default settings for the SEG program are designed for sequence masking as a prelude to sequence alignment, not for the annotation of compositional biases, which is the purpose of the presently described algorithm, LPS-annotate.

The database can be searched in three ways: (i) with a Uniprot/SwissProt identifier; (ii) with a CB signature; or also (iii) with a sequence, through a BLAST search interface (23). The CB-signature search capability is particularly useful for finding regions of similar compositional bias and protein disorder content. Such similar regions may help infer functional linkages or hypotheses, in a sequence that was initially input into the LPS-annotate server. In the output for each database search, a list of CB regions in increasing order of binomial P-value is given (each with a link to the complete Database entries for each CB region) (Figure 2). Each CB-region name is a live link to the individual Database entry of the CB region (Figure 3). At the bottom of the page, a ‘Download’ link is provided, so that the user can download the list of similar CB regions (Figure 2).

Figure 2.

Screenshot of initial output after database search. An example of the initial LPS-annotate Database output for the search for bias type ‘QH’. Each CB region is a live link in the depicted list. The download link for the data is at the bottom of the page.

Open in new tab Download slide

Figure 3.

Screenshot of example of output from database. An example of a complete LPS-annotate Database entry (as described in the main text, and in the downloadable Database Help page). This is for the QH region from PHO2 of budding yeast.

Open in new tab Download slide

An example of the individual database entry display for the glutamine/histidine-rich (QH-rich) region in PHO2 from budding yeast, is illustrated (Figure 3). PHO2 is a regulator in phosphate metabolism that contains a homeobox DNA-binding domain, and acts as a derepressor of PHO5, another central regulator. It binds to the upstream activator sequence of PHO5, and the promoters of TRP4, HIS4 and CYC1. The database entry contains the following useful information: (i) the subsequence identifier, a unique identifier for the subsequence in the UniProt/SwissProt sequence that is compositionally biased; (ii) sequence accession number; (iii) the initial bias used to build the CB region (in this case = ‘Q’); (iv) the number of residues in the CB region defining the bias; (v) the start and end points of the CB region; (vi) the binomial P-value for the CB region; (vii) the rank of the CB region in the database; (viii) the CB signature (in this case = ‘QH’); (ix) the mean protein disorder propensity (if >1.0, this indicates that the region on average has a propensity to protein disorder); (x) the proportion of the CB region that is disordered, according to the disordered region assignments made with the DISOPRED2 algorithm (in this case it is 100.0%). Other database entry fields are described on the ‘Help’ page.

Below this list of information are displays of the sequence with the CB region in bold (and the bias-defining residues in red bold) (Figure 3). PPD (predicted using the DISOPRED 2 algorithm, see ‘Methods’ section) is indicated by asterisks (Figure 3).

The database and server for LPS-annotate is available at //libaio.biol.mcgill.ca/lps-annotate.html. A link is provided on the web page to download the complete database of annotations for both UniProt/SwissProt and UniProt/TrEMBL (July 2009 versions).

Using the LPS-annotate Database to search for similar CB regions in other proteins

The chief utility of the database is to find proteins with similar CB regions, to yield a list of proteins of use for further functional hypotheses and inferences. For example, take the sample sequence PHO2 from budding yeast. First, we can determine the CB regions in PHO2 using the LPS-annotate Program Server; the most obvious biased region in PHO2 is the ‘QH’-rich region which is predicted to be 100% protein disorder by the program DISOPRED. Second, we can either: (i) click on the links given in the LPS-annotate Program Server output to obtain lists of similar biases in the LPS-annotate Database or (ii) type the biases of interest into the query box for the LPS-annotate Database Server, and proceed with downloading, from there. The output to download comprises a list of similar CB regions in other proteins, including the complete sequence of the CB regions within these other proteins. After download, this list of proteins can then be further examined bioinformatically by the user in a manner of his/her choosing, e.g. for shared globular protein domains elsewhere in the sequence, sequence motifs, cellular co-localizations and functional linkages [as specified, for example, by the Gene Ontology classification (25)].

Comparison of LPS-annotate Database to existing CB annotations in UniProt

We have substantially augmented the annotations of compositionally biased regions in the UniProt database, which are intentionally limited in the UniProt/SwissProt databases to a few, more specific cases, such as homopolymeric runs, with up to one or two short interruptions in the run (26). Here, we have generated more than 23 000 000 CB-region annotations for the UniProt/TrEMBL database, and more than 1 500 000 CB annotations for UniProt/SwissProt. Original CB-region annotations in UniProt number approximately 43 000 (‘FT COMPBIAS’ records); all of these are for the SwissProt portion of UniProt. We have compared these COMPBIAS feature annotations with our new annotations (Figure 4). For binomial P ≤ 10⁻⁶, the new CB annotations (blue column) overlap ∼20% of the COMPBIAS records (to within five residues at either end point). A further breakdown of these overlapping CB annotations is given in some pie charts in Figure 5.

Figure 4.

Column chart showing the augmentation of existing COMPBIAS annotations in UniProt, using the LPS algorithm. The blue column shows the number of CB regions annotated with the LPS-annotate algorithm, for three different P-value thresholds (10⁻⁶; 10⁻¹²; 10⁻¹⁸). The red columns are the existing UniProt COMPBIAS records that overlap the new LPS-annotate annotations (±5 residues at either end of the regions). The UniProt COMPBIAS records are intentionally limited in the UniProt/SwissProt databases to a few, more specific cases, such as homopolymeric runs, with up to one or two short interruptions in the run (26).

Open in new tab Download slide

Figure 5.

Comparison of the Uniprot COMPBIAS annotations with annotations by the LPS algorithm. Pie charts showing the detailed breakdown of how the new LPS-annotate CB annotations correspond with the UniProt COMPBIAS annotations, for four different P-value thresholds (10⁻⁴, i.e. all of the LPS-annotate CB annotations; 10⁻⁶; 10⁻¹²; 10⁻¹⁸). These are depicted in Figure parts A, B, C and D respectively. Annotations that are exactly matching are colored blue, those that are off by one at either end are colored red and so on. The UniProt COMPBIAS records are intentionally limited in the UniProt/SwissProt databases to a few, more specific cases, such as homopolymeric runs, with up to one or two short interruptions in the run (26).

Open in new tab Download slide

The CB-region annotations also do not have a simple correspondence with PPD. After removing CB region annotations corresponding to globular domains and transmembrane domains (Figure 6), there are a large number of CB regions without an overall tendency to protein disorder. For example, in Figure 6, for binomial P ≤ 10⁻⁶, ∼21% of the CB regions (approximately 78 000 in number) have disorder propensity less than 1.0, indicating a cohort of CB regions that are potentially uncharacterized biophysical types, e.g. functional amyloids (27).

Mean disorder propensity of the CB regions. This is a pie chart for the mean disorder propensity of all CB regions with P < 10−6, with any CB regions that correspond to globular or transmembrane domains removed. The mean disorder propensity is calculated as described in ‘Methods’ section.

Figure 6.

Mean disorder propensity of the CB regions. This is a pie chart for the mean disorder propensity of all CB regions with P < 10⁻⁶, with any CB regions that correspond to globular or transmembrane domains removed. The mean disorder propensity is calculated as described in ‘Methods’ section.

Open in new tab Download slide

Conclusions

This server and database is a conceptually novel addition to the panoply of tools now available to molecular biologists to generate hypotheses and inferences about the proteins that they are investigating. Furthermore, large-scale analysis of cohorts of proteins with specific compositional biases and disorder propensities is made tractable by our analysis. The database of CB annotations is updatable at regular intervals.

Funding

This research was funded by the National Science & Engineering Research Council, Le Fonds québécois de la recherche sur la nature et les technologies, and McGill University. The open access publication charge is paid by Le Fonds québécois de la recherche sur la nature et les technologies.

Conflict of interest. None declared.

References

Harrison

Stajich

, et al. ,

Evolution of budding yeast prion-determinant sequences across diverse fungi

J. Mol. Biol.

2007

, vol.

368

(pg.

273

282

)

Long

Caceres

. ,

The SR protein family of splicing factors: master regulators of gene expression

Biochem. J.

2009

, vol.

417

(pg.

)

Neduva

Russell

. ,

Proline-rich regions in transcriptional complexes: heading in many directions

Sci. STKE

2007

, vol.

2007

pg.

pe1

Uversky

Dunker

. ,

Understanding protein non-folding

Biochim. Biophys. Acta.

2010

, vol.

1804

(pg.

1231

1264

)

Dunker

Silman

Uversky

Sussman

. ,

Function and structure of inherently disordered proteins

Curr. Opin. Struct. Biol.

2008

, vol.

(pg.

756

764

)

Wang

Liu

, et al. ,

Predicting intrinsic disorder in proteins: an overview

Cell Res.

2009

, vol.

(pg.

929

949

)

Dosztanyi

Tompa

. ,

Prediction of protein disorder

Methods Mol. Biol.

2008

, vol.

426

(pg.

103

115

)

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Dosztanyi

Sandor

Tompa

, et al. ,

Prediction of protein disorder at the domain level

Curr. Protein Pept. Sci.

2007

, vol.

(pg.

161

171

)

Bourhis

Canard

Longhi

. ,

Predicting protein disorder and induced folding: from theoretical principles to practical applications

Curr. Protein Pept. Sci.

2007

, vol.

(pg.

135

149

)

Ferron

Longhi

Canard

, et al. ,

A practical overview of protein disorder prediction methods

Proteins

2006

, vol.

(pg.

)

Gao

Agrawal

Thelen

, et al. ,

A new machine learning approach for protein phosphorylation site prediction in plants

Lect. Notes Comput. Sci.

2009

, vol.

5462

(pg.

)

Google Scholar

OpenURL Placeholder Text

WorldCat

Iakoucheva

Radivojac

Brown

, et al. ,

The importance of intrinsic disorder for protein phosphorylation

Nucleic Acids Res.

2004

, vol.

(pg.

1037

1049

)

Jorda

Xue

Uversky

, et al. ,

Protein tandem repeats - the more perfect, the less structured

Febs J.

2010

, vol.

277

(pg.

2673

2682

)

Harrison

. ,

Exhaustive assignment of compositional bias reveals universally prevalent biased regions: analysis of functional associations in human and Drosophila

BMC Bioinformatics

2006

, vol.

pg.

441

Simon

Hancock

. ,

Tandem and cryptic amino acid repeats accumulate in disordered regions of proteins

Genome Biol.

2009

, vol.

pg.

R59

Wootton

Federhen

. ,

Analysis of compositionally biased regions in sequence databases

Methods Enzymol.

1996

, vol.

266

(pg.

554

571

)

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Promponas

Enright

Tsoka

, et al. ,

CAST: an iterative algorithm for the complexity analysis of sequence tracts. Complexity analysis of sequence tracts

Bioinformatics

2000

, vol.

(pg.

915

922

)

Harrison

Gerstein

. ,

A method to assess compositional bias in biological sequences and its application to prion-like glutamine/asparagine-rich domains in eukaryotic proteomes

Genome Biol.

2003

, vol.

(pg.

R40

R46

)

Apweiler

Bairoch

, et al. ,

UniProt: the Universal Protein knowledgebase

Nucleic Acids Res.

2004

, vol.

(pg.

D115

D119

)

Ward

McGuffin

Bryson

, et al. ,

The DISOPRED server for the prediction of protein disorder

Bioinformatics

2004

, vol.

(pg.

2138

2139

)

Sickmeier

Hamilton

LeGall

, et al. ,

DisProt: the Database of Disordered Proteins

Nucleic Acids Res.

2007

, vol.

(pg.

D786

D793

)

Vucetic

Obradovic

Vacic

, et al. ,

DisProt: a database of protein disorder

Bioinformatics

2005

, vol.

(pg.

137

140

)

Altschul

Madden

Schaffer

, et al. ,

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

Nucleic Acids Res.

1997

, vol.

(pg.

3389

3402

)

Chandonia

Hon

Walker

, et al. ,

The ASTRAL Compendium in 2004

Nucleic Acids Res.

2004

, vol.

(pg.

D189

D192

)

Consortium, G.O.

The Gene Ontology (GO) database and informatics resource

Nucleic Acids Res.

2004

, vol.

(pg.

D258

D261

)

Crossref

PubMed

WorldCat

UniProt

http://www.uniprot.org/manual/compbias (8 December 2010, date last accessed)

Fowler

Koulov

Balch

, et al. ,

Functional amyloid–from bacteria to humans

Trends Biochem. Sci.

2007

, vol.

(pg.

217

224

)

This is Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
November 2016	3
December 2016	4
January 2017	3
February 2017	8
March 2017	2
April 2017	14
May 2017	4
June 2017	2
July 2017	3
August 2017	8
October 2017	2
November 2017	1
December 2017	8
January 2018	10
February 2018	11
March 2018	12
April 2018	7
May 2018	7
June 2018	16
July 2018	14
August 2018	5
September 2018	6
October 2018	3
November 2018	12
December 2018	9
January 2019	4
February 2019	10
March 2019	6
April 2019	16
May 2019	12
June 2019	10
July 2019	10
August 2019	13
September 2019	14
October 2019	9
November 2019	11
December 2019	3
January 2020	11
February 2020	9
March 2020	6
April 2020	12
May 2020	15
June 2020	12
July 2020	10
August 2020	5
September 2020	2
October 2020	11
November 2020	10
December 2020	55
January 2021	36
February 2021	8
March 2021	24
April 2021	23
May 2021	4
June 2021	20
July 2021	6
August 2021	51
September 2021	3
October 2021	34
November 2021	14
December 2021	16
January 2022	1
February 2022	18
March 2022	7
April 2022	9
May 2022	13
June 2022	8
July 2022	6
August 2022	8
September 2022	42
October 2022	10
November 2022	13
December 2022	7
January 2023	7
February 2023	9
March 2023	8
April 2023	9
May 2023	28
June 2023	30
July 2023	38
August 2023	26
September 2023	27
October 2023	24
November 2023	15
December 2023	21
January 2024	30
February 2024	25
March 2024	16
April 2024	21
May 2024	12
June 2024	18
July 2024	19
August 2024	9
September 2024	12
October 2024	22
November 2024	24
December 2024	11
January 2025	10
February 2025	7
March 2025	6
April 2025	3
May 2025	7
June 2025	17
July 2025	4
August 2025	10
September 2025	4
October 2025	4
November 2025	8
December 2025	5
January 2026	10
February 2026	2
March 2026	14
April 2026	24
May 2026	16
June 2026	3
July 2026	4

Article Contents

LPS-annotate: complete annotation of compositionally biased regions in the protein knowledgebase

Abstract

Introduction

Methods

LPS algorithm

Data analysed

Use of the database

LPS-Annotate server

Database of pre-calculated CB annotations for UniProt

Using the LPS-annotate Database to search for similar CB regions in other proteins

Comparison of LPS-annotate Database to existing CB annotations in UniProt

Conclusions

Funding

References

Citations

Views

Altmetric

Email alerts

Citing articles via

New and popular articles

Article Contents

LPS-annotate: complete annotation of compositionally biased regions in the protein knowledgebase

Abstract

Introduction

Methods

LPS algorithm

Data analysed

Use of the database

LPS-Annotate server

Database of pre-calculated CB annotations for UniProt

Using the LPS-annotate Database to search for similar CB regions in other proteins

Comparison of LPS-annotate Database to existing CB annotations in UniProt

Conclusions

Funding

References

Citations

Views

Altmetric

Email alerts

Citing articles via

New and popular articles

More from Oxford Academic

This Feature Is Available To Subscribers Only

Gift article access

Gift article access

Gift article access

Gift article access