Abstract

Molecular databases are essential resources for both experimental and computational biologists. The rapid increase in high-quality genome assemblies has led to a surge in publications describing secondary gene loss events associated with lineage-specific adaptations across diverse vertebrate groups. This growing volume of information underscores the urgent need for organized, searchable, and curated resources that facilitate data discovery, allow detection of broad evolutionary patterns, and support downstream analyses. Currently, no existing database compiles manually curated and validated information on published secondary gene loss events. Here, we introduce the Gene Loss Database (Gene Loss DB), a platform designed to centralize and present these data in an easy-to-search and user-friendly format (https://geneloss.org/). Gene Loss DB compiles gene loss events alongside supporting evidence, including the inferred mechanism of gene loss (exon deletion, gene deletion, loss of function mutation), the type of data used to support inactivation (genomic, transcriptomic, single/multiple individual sequence reads, synteny maps) and, when available, whether the event is shared across all lineages within a taxon. Each entry also includes a short excerpt from the original publication to provide context. This information is structured in the database to be searchable by species, gene, taxa, or by gene ontology terms linked to the gene in question. The initial release of Gene Loss DB focuses on cetaceans, a lineage with numerous gene loss events linked to aquatic adaptations. This first collection comprises 1872 gene loss events identified across 57 cetacean species. In addition, the database includes 1321 gene loss events from other taxa, which were also reported in the same studies and collected simultaneously.

Introduction

The emergence of high-quality genomic data sheds light on the dynamic nature of genomes: shaped by the combination of selection, gene duplication, gene loss, and environmental adaptations. In vertebrates, genomic plasticity has been linked to phenotypic diversity, niche adaptation, and trait innovation [1, 2]. Such growing wealth of information also uncovers critical knowledge gaps, particularly in the case of secondary gene loss, defined as the loss of ancestral functions in extant lineages [1, 2]. Gene loss is often poorly documented in genomic databases; yet, recent high-quality genome assemblies have highlighted its significance in evolutionary processes [2–4]. Secondary gene loss has been detected across the tree of life, from bacteria [5] to plants [6] and animals [7], with important consequences for phenotypical evolution such as antimicrobial resistance and pathogenesis in bacteria [5, 8], inflammation in humans [9], and the evolution of the stomach in gnathostomes [10].

Gene loss often occurs due to ORF-disrupting mutations (open reading frame), deletions, or even chromosomal rearrangements, with partial or total excision of gene sequences [3]. Nonfunctional genes often share sequence similarity with functional homologues, which can lead to their misannotation as protein-coding by automated pipelines. This is more likely in species where supporting evidence, such as transcripts, proteins, or short reads from the same species, is limited, and thus genome annotation relies instead on data from closely related species [11]. This persistent gap in genome annotation has fuelled numerous studies, notably in mammals with recently published genomes, focusing on the identification of secondary gene loss events often linked to lineage-specific adaptive traits [1–4, 12–23].

Currently, a search on PubMed with the term ‘gene loss’ yields thousands of results, most of which have been published since 2015, coinciding with the emergence of high-quality genomic data for numerous species. Such wealth of information brings out the timely need to organize and catalogue data to maximize searchability and streamline subsequent studies. Despite this, until now, there was no existing database collating curated and validated information regarding published secondary gene loss events. To find targeted information regarding gene loss episodes in a specific gene family or species, researchers must scavenge through numerous manuscripts addressing gene loss with varying degrees of evidence; and, more laboriously, through extensive supplementary material files, making it challenging to quickly and accurately compile data from multiple papers.

To address this issue, we present the Gene Loss Database (Gene Loss DB), a user-friendly, curated, reliable, and efficient tool for navigating gene loss events. In this database, ‘gene loss’ refers to the inactivation of a gene, encompassing both partial and complete gene deletions and loss-of-function mutations that disrupt coding or regulatory regions. Furthermore, Gene Loss DB essentially focuses on secondary gene loss. Consequently, a gene is considered ‘lost’ in a specific taxon if there is evidence that the orthologous gene is present and functional in a sister clade, and likewise in the shared common ancestor [2].

While this first database release highlights events from cetaceans and other mammals, Gene Loss DB’s aim is to progressively expand and encompass a wider range of lineages. It is important to note that Gene Loss DB does not focus on the annotation and/or identification of pseudogenes from non-curated sources, such as public genome databases (e.g. Ensembl and NCBI). In addition, Gene Loss DB sets itself apart from pseudogene detection analysis tools such as PseudoChecker [24] and TOGA [25]. Instead, this database collects data from published manuscripts previously subjected to peer-reviewed scrutiny and validation. Expert researchers (biocurators) extract data from these publications, which is then organized and deposited in the database. Most importantly, this database brings to the spotlight vast amounts of data often shadowed within the supplementary materials and thus with decreased visibility and searchability. This is particularly relevant for publications addressing large datasets e.g. [26, 27]. For example, one paper reports the inactivation of multiple vision-related genes in subterranean and other mammals adapted to low-light environments. While the main text highlights a few genes, namely RBP3, OPN1SW/SWS1, GJ10, ARR3, CRB1, GRK7, GUCA1B, and GUCY2F, the study analysed 213 vision-associated genes, with the full list available in the supplementary material [26]. Another example examines gene losses in the cetacean stem lineage where 11 genes are discussed in the main text, yet 74 gene losses are reported in the supplementary material [27].

To address these challenges, Gene Loss DB aims to (i) aggregate and organize gene loss data that are currently scattered across numerous publications, repositories, and supplementary materials facilitating access to this information; (ii) promote a standardization of gene loss annotations; (iii) promote data-driven discovery by organizing curated data; and (iv) improve data traceability.

For the initial release of the database, we have selected the curation and aggregation of data from the Cetacea lineage. This choice was motivated not only because this taxon is a focus of our research group, but also by the recent availability of numerous cetacean genomes and the fact that this taxon exhibits a relatively high number of gene loss events, many of which have only been identified in recent years. In cetaceans, gene loss has been linked to key morphological and physiological traits associated with aquatic adaptation, e.g. [19, 22, 27, 28]. Despite these changes, cetacean genomes retain a high level of sequence conservation with those of other mammals, which can lead to the frequent misannotation of pseudogenes as intact coding genes in public databases [11]. Several studies have identified such pseudogenes and explored their potential adaptive relevance in cetaceans, including those associated with the loss of fur, e.g. [18, 27, 29, 30], the absence of sebaceous glands [13], alterations in skin structure, modifications in skin immunity, e.g. [16, 31–33], loss of tooth development in Mysticeti species, e.g. [22, 23, 34, 35], loss of taste receptors, loss of visual receptors, e.g. [19, 36, 37], among other cases. However, while many of these losses have been reported in the literature, the data remain scattered across numerous publications.

Here, we consolidate gene loss data for cetaceans from 56 published studies, providing a centralized, curated, searchable database resource—Gene Loss DB.

Methods

Database implementation

The Gene Loss DB frontend data exploration interface and the backend data curation interface for curators were specifically developed for this purpose. Gene Loss DB was built using Laravel, a robust and flexible PHP framework for web application development. The data storage engine employed was MySQL, a widely used and free relational database management system. For the backend administration interface, AdminLTE, a popular open-source admin dashboard template, was used to provide a responsive and customizable UI. To enhance the usability of the web interface, Select2 (available at http://select2.org), a jQuery-based replacement for select boxes, and DataTables, a plugin for enhancing HTML tables, were integrated. For the frontend service, Bootstrap, a popular CSS framework for responsive and mobile-first web development, was utilized. Additionally, data visualization capabilities were implemented using Chart.js, an open-source JavaScript library for creating flexible and interactive charts.

Gene Loss DB also integrates third-party APIs from NCBI, PantherDB, and gene ontology (GO) terms [38–40] to ensure data consistency, enrich data retrieval, and analysis capabilities.

Data collection, curation, and quality control

Published manuscripts on gene loss were collected through a comprehensive search in PubMed and PubTator3 [41], excluding reviews. The search was conducted using a combination of the following terms: ‘gene loss’, ‘pseudogene’, ‘gene inactivation’, ‘gene disruption’, ‘gene deletion’, ‘cetacea’, and ‘marine mammal’, along with the Boolean operator AND. The retrieved manuscripts were compiled into a non-redundant list, which was then manually reviewed to identify studies describing gene loss events in cetaceans. Manuscripts meeting the inclusion criteria underwent full-text review and were incorporated into the database as annotation jobs. We acknowledge that this dataset is incomplete and may remain so, as keyword-based searches can overlook studies reporting gene loss events without explicitly using those specific keywords. Likewise, while we excluded reviews from the initial PubMed/PubTator3 search, some reviews may include original data on gene loss that are therefore not included (e.g. [42]). Also, some literature may not be available in the database searched. To minimize the effect of these factors, the final dataset includes additional relevant manuscripts suggested by the curators that were not identified in the initial search.

Gene loss data curation was performed by expert curators who read the selected manuscripts and extracted gene loss information in three steps. In the first step, all structured data is collected by completing the Gene Loss (GLoss) annotation forms, either by providing NCBI identifiers or by selecting the correct option from the provided list (Table 1).

Table 1.

Structured information requested in a GLoss annotation.

Information requestedCurator reply
Identification of the gene reported to be lostReply: NCBI gene ID of the reference gene
Identification of the species in which the gene is reported to be lostReply: NCBI tax ID of the species in which the gene is lost
Determination of the type of gene lossOptions: full, polymorphic, undetermined
Identification of the gene loss mechanismOptions: gene deletion, exon(s) deletion, loss of function—LOF (frameshift, premature stop, abolishment of canonical splice sites), regulatory region mutation, other (chromosome rearrangements, inversions, or other phenomena)
Identification of the presented evidence to validate gene lossOptions: multiple individual SRA, single individual SRA, genomic, transcriptomic, genomic and synteny maps, genomic and transcriptomic, PCR and Sanger sequencing multiple, PCR and Sanger sequencing single, sequence trace archive
Indicate if the gene loss is shared in all organisms of a specific taxon ‘lineage-specific’Options: yes; no
Accession number of the lost gene if availableReply: accession number (optional)
Information requestedCurator reply
Identification of the gene reported to be lostReply: NCBI gene ID of the reference gene
Identification of the species in which the gene is reported to be lostReply: NCBI tax ID of the species in which the gene is lost
Determination of the type of gene lossOptions: full, polymorphic, undetermined
Identification of the gene loss mechanismOptions: gene deletion, exon(s) deletion, loss of function—LOF (frameshift, premature stop, abolishment of canonical splice sites), regulatory region mutation, other (chromosome rearrangements, inversions, or other phenomena)
Identification of the presented evidence to validate gene lossOptions: multiple individual SRA, single individual SRA, genomic, transcriptomic, genomic and synteny maps, genomic and transcriptomic, PCR and Sanger sequencing multiple, PCR and Sanger sequencing single, sequence trace archive
Indicate if the gene loss is shared in all organisms of a specific taxon ‘lineage-specific’Options: yes; no
Accession number of the lost gene if availableReply: accession number (optional)
Table 1.

Structured information requested in a GLoss annotation.

Information requestedCurator reply
Identification of the gene reported to be lostReply: NCBI gene ID of the reference gene
Identification of the species in which the gene is reported to be lostReply: NCBI tax ID of the species in which the gene is lost
Determination of the type of gene lossOptions: full, polymorphic, undetermined
Identification of the gene loss mechanismOptions: gene deletion, exon(s) deletion, loss of function—LOF (frameshift, premature stop, abolishment of canonical splice sites), regulatory region mutation, other (chromosome rearrangements, inversions, or other phenomena)
Identification of the presented evidence to validate gene lossOptions: multiple individual SRA, single individual SRA, genomic, transcriptomic, genomic and synteny maps, genomic and transcriptomic, PCR and Sanger sequencing multiple, PCR and Sanger sequencing single, sequence trace archive
Indicate if the gene loss is shared in all organisms of a specific taxon ‘lineage-specific’Options: yes; no
Accession number of the lost gene if availableReply: accession number (optional)
Information requestedCurator reply
Identification of the gene reported to be lostReply: NCBI gene ID of the reference gene
Identification of the species in which the gene is reported to be lostReply: NCBI tax ID of the species in which the gene is lost
Determination of the type of gene lossOptions: full, polymorphic, undetermined
Identification of the gene loss mechanismOptions: gene deletion, exon(s) deletion, loss of function—LOF (frameshift, premature stop, abolishment of canonical splice sites), regulatory region mutation, other (chromosome rearrangements, inversions, or other phenomena)
Identification of the presented evidence to validate gene lossOptions: multiple individual SRA, single individual SRA, genomic, transcriptomic, genomic and synteny maps, genomic and transcriptomic, PCR and Sanger sequencing multiple, PCR and Sanger sequencing single, sequence trace archive
Indicate if the gene loss is shared in all organisms of a specific taxon ‘lineage-specific’Options: yes; no
Accession number of the lost gene if availableReply: accession number (optional)

In step two, curators select short excerpts from the manuscript that provide additional context for the gene loss annotations. These excerpts (statements) are then incorporated into the gene loss annotation and categorized based on the type of information they contain. The categories include ‘Mutational Description’, ‘Functional’, ‘Phenotypic’, ‘Timing of Loss’, ‘Methodology and Validation’, and ‘Other’ (when the statement is deemed important, but does not fall into any of the other categories). In step three, curators can provide critical insights into a specific gene loss event that may not be immediately evident from the manuscript. These observations are recorded in the ‘Curator Observations’ field as unstructured free-text data. Gene Loss DB curators could opt to use the AI tool Coral AI Pro to summarize, identify, and extract key information from the selected manuscripts. Regardless, all information collected with AI assistance was confirmed by the curators independently.

After collecting all data from a specific annotation job, curators submit the full annotation job containing multiple GLoss annotations for quality control. Quality control consists of two rounds of data validation. The first round is performed programmatically/computationally, ensuring that all required fields are completed and that stable identifiers such as gene ID and tax ID are used appropriately, preventing duplicate annotations within the same job. The second round involves manual validation by trained database curators, who review each annotation job to ensure consistency and completeness (Fig. 1). Only after successfully undergoing both validation steps is the annotation job completed and then published in the database.

Database curation workflow and validation steps.
Figure 1.

Database curation workflow and validation steps.

Data structure and searchability

The fundamental unit of the Gene Loss DB is a Gene Loss Annotation or GLoss annotation, which documents the loss of a single gene in a specific species. Each GLoss annotation is linked to a reference manuscript and to a Reference Gene, the highest-ranking unit in the database’s organizational structure. A Reference Gene consolidates all GLoss annotations related to that gene, regardless of the species or manuscript in which the loss was reported (Fig. 2A).

Schematic representation of the Gene Loss DB structure. (A) Hierarchical organization of the overall database. (B) Data structure when explored by annotation job. (C) Data structure when explored by species. (D) Data organization when focused on GO terms.
Figure 2.

Schematic representation of the Gene Loss DB structure. (A) Hierarchical organization of the overall database. (B) Data structure when explored by annotation job. (C) Data structure when explored by species. (D) Data organization when focused on GO terms.

In most cases, the reference gene corresponds to the coding ortholog in the human genome. However, if the human gene is non-coding, a coding orthologue from another model species may be selected as the reference.

The reference gene is selected using a stable identifier—Gene ID (structured information—Table 1), allowing the automatic retrieval via API from the NCBI gene database of general information linked to the gene, including gene summary, symbol, aliases, and GO terms, as well as paralogues from the Panther Knowledgebase [38].

The Gene Loss DB has implemented a dynamic data structure, cross-linking, and use of unique identifiers to enhance data searchability and user-friendliness. As a result, the data structure may vary depending on the user’s starting point. Yet, regardless of the user’s starting point, each GLoss annotation retains all associated identifiers, ensuring comprehensive traceability. Each GLoss Annotation includes the following automatically linked identifiers upon creation:

  • Reference gene identifier (Gene ID): identifies the reference gene associated with the annotation.

  • GLoss identifier (GL_######): a unique, six-character alphanumeric identifier preceded by ‘GL’, generated automatically by the database.

  • Annotation job identifier (JB_######): a unique, six-character alphanumeric identifier preceded by ‘JB’ linking each GLoss annotation to a reference publication.

Exploring the Gene Loss DB by Annotation Job reveals all GLoss annotations associated with a single annotation job (Fig. 2B). When exploring gene loss data by species, all GLoss annotations linked to a specific species are displayed, regardless of their annotation job of origin (Fig. 2C). Finally, organizing the data by GO term presents all GLoss annotations associated with a specific GO term, which is linked to a corresponding term in the reference gene (Fig. 2D).

Results

Database usage—browsing and targeted search

The Gene Loss DB can be explored either by browsing all available data or through targeted searches. Users can browse the data through icons on the homepage (Fig. 3A). Browsing by selecting the pseudogene icon will return a list of all GLoss annotations in the database. Browsing by selecting the species icon provides a list of all species with at least one GLoss annotation, allowing users to select a species and view all associated GLoss annotations. Browsing by publication displays a list of curated publications in the current version of the database, showing the total number of GLoss annotations extracted from each publication along with direct links to the published article (Supplementary Table 1).

Overview of the Gene Loss DB and search methods. (A) Homepage with navigation options. (B) Reference gene page displaying GLoss annotations linked to a specific gene. (C) Search results for a GO term query, showing associated genes. (D) Search results for a species query, listing gene loss events reported for the selected species.
Figure 3.

Overview of the Gene Loss DB and search methods. (A) Homepage with navigation options. (B) Reference gene page displaying GLoss annotations linked to a specific gene. (C) Search results for a GO term query, showing associated genes. (D) Search results for a species query, listing gene loss events reported for the selected species.

Users may also perform targeted searches using the search box. To begin a search, they must first select a category from the drop-down menu, which includes gene, species, GLossID, and GO terms (Fig. 3A). To enhance user experience and facilitate data exploration, the Gene Loss DB supports ‘partial exact matching’, meaning users do not need to enter the full search term to obtain results, but correct spelling is required. When searching by gene, users can enter a gene symbol, gene alias, full gene name, or partial gene name. This returns a list of reference genes, and upon selection, users are redirected to a page containing reference gene details and all linked GLoss annotations (Fig. 3B). Searching by species allows input of a species name, partial species name, common name, order, or infraorder, retrieving GLoss annotations associated with species matching the search terms. Users can also explore gene loss data from a functional perspective using the GO term search, where keywords should correspond to full or partial GO terms or GO term ID numbers (Fig. 3C and D). This returns a list of reference genes linked to the specified GO term, and selecting a gene provides access to its associated GLoss annotations. Finally, searches can also be conducted using GLoss identifiers. By selecting the GLossID search option and entering a specific GLossID, users retrieve the corresponding GLoss annotation directly.

Each GLoss annotation page is structured into several sections to ensure clarity and ease of navigation. At the top, the general information section (Fig. 4A) provides a header displaying the gene symbol and the species in which the gene is reported as lost. Below this, additional details such as cross-links to the reference gene and the corresponding annotation job are included (Fig. 4B). The next sections focus on describing the gene loss event; these combine structured data (Fig. 4C), including the GLossID, species, gene loss mechanism, loss type, supporting evidence, and lineage specificity, with semi-structured data (Fig. 4D) in the form of text excerpts selected by the curator and extracted from the curated manuscript. These excerpts may be classified into 6 types depending on the type of information included here: (i) phenotypic—excerpts addressing the phenotypic outcome associated with the reported gene loss, (ii) functional—excerpts describing the function of the gene and corresponding protein encoded, (iii) timing of loss—excerpts indicating the approximate timing of gene loss, (iv) mutation description—excerpts with general description of the identified ORF disruption mutations, (v) methodology and validation—excerpts with a general description of the methods used to identify and validate the reported gene loss event, and (vi) other—text segments selected from the manuscript that the curator deemed as essential to provide context to the GLoss annotation and that cannot be classified in the previous types. Following this, the curator observations section (Fig. 4E) contains unstructured data, where expert curators offer specific insights and additional context regarding the gene loss event. This section highlights any critical details that may not be explicitly stated in the manuscript but are relevant for data interpretation. Finally, at the bottom of the page, the related GLosses section (Fig. 4F) lists other instances in which the same gene was reported as lost in the same species but in different curated manuscripts. This feature helps users identify repeated findings across independent studies, further supporting the reliability of gene loss reports. This also ensures complete coverage, as all gene losses reported in each publication are curated, which may result in multiple independent annotations for the same gene and species. While this introduces a degree of redundancy, it also increases the robustness of the database by enabling independent corroboration of findings across diverse sources.

Example of the general structure of a GLoss annotation page.
Figure 4.

Example of the general structure of a GLoss annotation page.

Database contents

Although this collection focuses on gene loss in cetacean species, all gene loss events reported in the selected publications were annotated, leading to a spillover into other mammalian orders. Currently, the Gene Loss DB contains curated data from 56 publications analysed within the scope of the cetacean collection (see Supplementary Table 1). This curation effort resulted in 3193 gene loss annotations across 443 genes in 359 mammalian species (including subspecies), with representatives from 22 mammalian orders (Table 2). Not surprisingly, the mammalian order with the highest number of species and GLoss annotations is Artiodactyla, which includes the cetacean infraorder. Following Artiodactyla, we find Carnivora with 209 GLoss annotations from 80 species and Chiroptera with 152 GLoss annotations in 39 species (Table 2). It is important to note that insofar the only nearly complete collection in the database pertains to cetaceans; yet, GLoss annotations are expected to grow to cover the complete mammalian catalogue of gene loss events.

Table 2.

Overview of the curated gene loss data included.

OrderNumber of speciesNumber of GLoss annotations
Artiodactyla109 (including 57 Cetacea)2058 (including 1872 Cetacea)
Afrosoricida*2 (Chrysochloris asiatica, Echinops telfairi)49, 17
Rodentia23150
Carnivora82209
Primates26113
Perissodactyla1438
Pholidota498
Dermoptera1 (Galeopterus variegatus)8
Pilosa827
Eulipotyphla773
Chiroptera40152
Scandentia1 (Tupaia chinensis)10
Proboscidea634
Sirenia463
Cingulata541
Tubulidentata1 (Orycteropus afer)14
Lagomorpha1022
Hyracoidea25
Macroscelidea1 (Elephantulus edwardii)1
Monotremata22
Dasyuromorphia1 (Sarcophilus harrisii)1
Diprotodontia1 (Vombatus ursinus)1
OrderNumber of speciesNumber of GLoss annotations
Artiodactyla109 (including 57 Cetacea)2058 (including 1872 Cetacea)
Afrosoricida*2 (Chrysochloris asiatica, Echinops telfairi)49, 17
Rodentia23150
Carnivora82209
Primates26113
Perissodactyla1438
Pholidota498
Dermoptera1 (Galeopterus variegatus)8
Pilosa827
Eulipotyphla773
Chiroptera40152
Scandentia1 (Tupaia chinensis)10
Proboscidea634
Sirenia463
Cingulata541
Tubulidentata1 (Orycteropus afer)14
Lagomorpha1022
Hyracoidea25
Macroscelidea1 (Elephantulus edwardii)1
Monotremata22
Dasyuromorphia1 (Sarcophilus harrisii)1
Diprotodontia1 (Vombatus ursinus)1
*

Not included in NCBI taxonomy but a recognized order.

Table 2.

Overview of the curated gene loss data included.

OrderNumber of speciesNumber of GLoss annotations
Artiodactyla109 (including 57 Cetacea)2058 (including 1872 Cetacea)
Afrosoricida*2 (Chrysochloris asiatica, Echinops telfairi)49, 17
Rodentia23150
Carnivora82209
Primates26113
Perissodactyla1438
Pholidota498
Dermoptera1 (Galeopterus variegatus)8
Pilosa827
Eulipotyphla773
Chiroptera40152
Scandentia1 (Tupaia chinensis)10
Proboscidea634
Sirenia463
Cingulata541
Tubulidentata1 (Orycteropus afer)14
Lagomorpha1022
Hyracoidea25
Macroscelidea1 (Elephantulus edwardii)1
Monotremata22
Dasyuromorphia1 (Sarcophilus harrisii)1
Diprotodontia1 (Vombatus ursinus)1
OrderNumber of speciesNumber of GLoss annotations
Artiodactyla109 (including 57 Cetacea)2058 (including 1872 Cetacea)
Afrosoricida*2 (Chrysochloris asiatica, Echinops telfairi)49, 17
Rodentia23150
Carnivora82209
Primates26113
Perissodactyla1438
Pholidota498
Dermoptera1 (Galeopterus variegatus)8
Pilosa827
Eulipotyphla773
Chiroptera40152
Scandentia1 (Tupaia chinensis)10
Proboscidea634
Sirenia463
Cingulata541
Tubulidentata1 (Orycteropus afer)14
Lagomorpha1022
Hyracoidea25
Macroscelidea1 (Elephantulus edwardii)1
Monotremata22
Dasyuromorphia1 (Sarcophilus harrisii)1
Diprotodontia1 (Vombatus ursinus)1
*

Not included in NCBI taxonomy but a recognized order.

Concerning gene loss mechanisms and evidence, data curation revealed that the most frequent gene loss mechanism reported in over 53% of the GLoss annotations was loss of function (LOF) mutations. These include frameshift mutations, mutations that alter canonical donor and acceptor splicing sites, and premature stop codons. This was followed by gene loss mechanisms included in the ‘Other’ category (circa 19.5%), which include mutations that abolish the start codon, as well as cases in which the exact mechanism of gene loss was not specified by the authors of the original paper. Also, gene and/or exon(s) deletions were evident among the gene loss mechanisms mentioned by the authors (circa 16.4% and circa 10.3%, respectively) (Fig. 5). In cases where gene loss is due to multiple mechanisms, such as the combination of LOF mutations and exon deletions, the Gene Loss DB prioritizes gene loss mechanisms that are shared across multiple species. For example, a shared premature stop codon will be given preference over an exon deletion observed in only one species within the same group. If no shared mutations are identified across several species under analysis, preference will be given to the most frequently observed mutation, followed by the first reported mutation that appears in the 5´ region of the canonical isoform of the gene. It is important to note that while one gene loss mechanism is annotated in the database, the gene in question may carry other mutations and forms of gene erosion, as is expected in pseudogenes, e.g. [13, 17, 43].

(A) Mutational spectrum of main gene loss mechanisms reported in the current release of Gene Loss DB. (B) Gene loss evidence reported in the current release Gene Loss DB.
Figure 5.

(A) Mutational spectrum of main gene loss mechanisms reported in the current release of Gene Loss DB. (B) Gene loss evidence reported in the current release Gene Loss DB.

When considering the evidence provided by the authors to support claims of gene loss, genomic evidence was the most frequently mentioned. This refers to the identification of gene loss mechanisms using publicly available genomic data or genome assembly of the species in question. Often, authors combined genomic evidence with synteny maps and transcriptomic data to further support their findings. In addition to genomic evidence, many authors also provided multiple-species SRA (Sequence Read Archive) datasets or single-species SRA datasets. Finally, in some cases, authors provided PCR (polymerase chain reaction) and Sanger sequencing data from independent samples to validate the identified mutations (Fig. 5). In cases where the authors present multiple forms of evidence supporting the identification of the gene loss mechanism, curators select the strongest form of evidence for inclusion in the database. This typically includes multi-species SRA datasets or other evidence validating the existence of the same mutation in multiple independent samples from the same species.

Gene loss was classified into three main categories: full, polymorphic, and undetermined. A gene loss event was classified as ‘Full’ when the gene in question was lost in all individuals of a specific species, indicating that the ORF-disrupting mutation has reached full fixation in that species. To validate this, curators screened the manuscripts for at least one of the following pieces of evidence: (i) the gene in question presents multiple LOF mutations, or (ii) if a single mutation is reported, the authors did not find evidence that this variation may be polymorphic, and/or (iii) the identified mutations were conserved with those observed in a sister species. A gene loss event was classified as ‘Polymorphic’ when a single ORF-disrupting mutation was present in some individuals of a species but absent in others, indicating that the mutation had not reached full fixation [43, 44]. To confirm this classification, curators validated whether (i) the mutation was observed in a subset of analysed individuals from the same species and/or (ii) the authors explicitly stated that the gene in question was a polymorphic pseudogene in the target species. Some examples of polymorphic pseudogenes included in the current release of the database are OPN1SW in Delphinapterus leucas and Phocoenoides dalli [28], MMP20 in Kogia breviceps [22], and CNGA3 in Eubalaena glacialis [36]. Finally, gene loss was classified as ‘Undetermined’ when the authors explicitly expressed uncertainty about the gene’s coding status and when the manuscript lacked sufficient evidence to fully support the claim of gene loss. Examples include cases where the ORF-disrupting mutation was located in the last exon or near the end of the gene and truncating mutations are present but do not rule out protein functionality, as was the case for TCHHL1 and FLG2 in Dugong dugong and Trichechus manatus [45] and IL20 in Trichechus manatus [16], or the authors did not identify any ORF-disrupting mutations but instead found multiple missense mutations affecting critical residues as in the case of CORT in Pontoporia blainvillei [46]. In the current database collection, a total of 3138 gene loss annotations were classified as full gene loss events, 16 were reported as polymorphic, and 33 were classified as undetermined.

When analysing the genes reported as lost, an overall review of the database shows that PCSK9 was the gene most frequently reported as lost, with 186 GLoss annotations emerging from two independent large multispecies studies [4, 21] and with a single GLoss annotation in a third study [4]. Since these annotations come from independent sources, redundant gene loss reports for PCSK9 in the same species were detected in 14 species. These duplicate annotations are referred to as ‘related Glosses’ in the database (see Fig. 4). Following PCSK9, the genes with the highest number of gene loss annotations are MTNR1B, CORT, and UCP1, with 67, 63, and 53 GLoss annotations, respectively.

The cetacean collection

Currently, the cetacean dataset comprises 1872 GLoss annotations referencing the loss of 314 genes in 57 cetacean species, including 15 Mysticeti and 42 Odontoceti species (Fig. 6A). The species with the highest number of GLoss annotations are Tursiops truncatus, Balaenoptera acutorostrata, Physeter macrocephalus, and Orcinus orca, each with over 200 annotations. It is important to note that this high number of reported gene losses in these species primarily reflects the early availability of their genome assemblies in public databases and the extensive research conducted on these organisms, rather than a greater propensity for gene loss. Furthermore, this does not imply that these genes remain intact in other cetaceans, which have not been investigated.

(A) Comparison of genes lost in all Mysticeti and Odontoceti species annotated in the Gene Loss DB. (B) Comparison of genes lost in four species with the highest number of GLoss annotations. (C) Comparison of genes lost shared between T. truncatus and B. acutorostrata (198) with genes reported in the literature to be lost in the cetacean ancestor, lineage-specific (120). (D) Schematic representation of gene loss data in the cetacean lineage (Venn diagrams prepared using dataset intersections at molbiotools.com).
Figure 6.

(A) Comparison of genes lost in all Mysticeti and Odontoceti species annotated in the Gene Loss DB. (B) Comparison of genes lost in four species with the highest number of GLoss annotations. (C) Comparison of genes lost shared between T. truncatus and B. acutorostrata (198) with genes reported in the literature to be lost in the cetacean ancestor, lineage-specific (120). (D) Schematic representation of gene loss data in the cetacean lineage (Venn diagrams prepared using dataset intersections at molbiotools.com).

A comparative analysis of gene losses across the four species with the most data identified 164 shared lost genes. When focusing on the two species with the highest number of GLoss annotations, T. truncatus (Odontoceti) and B. acutorostrata (Mysticeti), each representing a major cetacean lineage, the number of shared gene losses increases to 198 (Fig. 6B). The occurrence of these gene losses in both lineages suggests that they may have taken place before the divergence of Odontoceti and Mysticeti in the cetacean ancestral Archaeoceti (~50–35 Mya) [47] or depict parallel adaptation scenarios under similar environmental constraints. In the current cetacean data collection, curators aimed to determine whether reference manuscripts reported gene losses in the cetacean ancestor and/or if these losses could be attributed to a single mutational event (lineage-specific) in the cetacean ancestor. Through this analysis, a total of 120 genes were identified as having been lost in the ancestral cetacean lineage (~50 Mya) [47]. When comparing this list with the 198 genes found to be lost in both T. truncatus and B. acutorostrata, we identified an overlap of 110 genes. The observed overlap supports the hypothesis that genes absent in both Mysticeti and Odontoceti were predominantly lost in the cetacean ancestral Archaeoceti, though independent early losses in each lineage cannot be excluded in all cases. Based on this, we infer that an additional 88 genes, previously not classified as lineage-specific losses, are also potentially lost in the cetacean ancestor (Fig. 6C). This brings the total number of putatively lost genes in the cetacean lineage to 208 genes. The remaining 61 and 91 genes appear to have been lost after the divergence of the Mysticeti and Odontoceti lineages, respectively. Although the Gene Loss DB is not yet exhaustive, this preliminary analysis suggests a higher rate of gene loss in the ancestral cetacean lineage, consistent with an early phase of accelerated evolution. This observation agrees with previous studies, which also found an initial rapid evolution in stem cetaceans in the early to Eocene 50–42 Mya [48]. Further analysis of the genes lost after divergence of the Mysticeti and Odontoceti lineages reveals lineage-specific adaptations as, for example, the loss of genes related to tooth development in Mysticeti as previously reported [23, 34].

To gain insight into the main biological processes affected by the loss of these genes, a GO term analysis was performed [38, 39]. For this analysis, the 208 genes reported as pseudogenized in cetaceans in Gene Loss DB were compiled and queried against GO databases to test for overrepresentation [38, 39]. The results revealed significant enrichment in several biological processes (Table 3). As expected, many of these processes are linked to specific adaptations of these mammals to the aquatic environment. For example, these include the remodelling of the melatonin biosynthetic and metabolic process, which has been linked to the altered circadian rhythm of cetaceans [14], the remodelling of skin phenotype, characterized by the loss of fur and sebaceous glands, and modifications in the skin barrier [13, 18, 29, 30]. We also observed a significant enrichment of genes associated with keratinocyte and epidermal development, as well as cell differentiation. These findings support the hypothesis of extensive evolutionary modifications in skin development, which likely occurred in ancestral lineages, consistent with previous reports [4, 13, 16, 18]. Additionally, we observed a significant enrichment of genes related to sensory perception, such as taste, possibly reflecting adaptations to aquatic diets, or the visual sensory system, suggesting adaptations to underwater vision [26, 36].

Table 3.

GO term biological process enrichment analysis.

PAN-GO biological processHomo sapiensObsExpectFold enrichRaw P-valueFDR
Melatonin biosynthetic process220.02>1008.97E-051.49E-02
Melatonin metabolic process220.02>1008.97E-051.46E-02
Sensory perception of umami taste220.02>1008.97E-051.43E-02
Sensory perception of sweet taste330.03>1008.43E-072.43E-04
Regulation of water loss via skin320.0370.212.67E-043.86E-02
Indole-containing compound biosynthetic process320.0370.212.67E-043.79E-02
Establishment of skin barrier320.0370.212.67E-043.72E-02
Pyroptosis530.0563.188.32E-061.70E-03
Inflammatory response to antigenic stimulus840.0852.655.36E-071.61E-04
Sensory perception of taste3390.3128.721.66E-112.58E-08
Keratinization2670.2528.353.53E-092.50E-06
Skin development40100.3826.333.14E-126.11E-09
Keratinocyte differentiation2970.2825.428.19E-095.31E-06
Intermediate filament organization63130.6021.732.37E-141.85E-10
Detection of chemical stimulus involved in sensory perception of bitter taste3060.2821.063.34E-071.18E-04
Sensory perception of bitter taste3060.2821.063.34E-071.13E-04
Detection of chemical stimulus involved in sensory perception of taste3060.2821.063.34E-071.08E-04
Intermediate filament cytoskeleton organization72130.6819.011.49E-135.80E-10
Epidermal cell differentiation3970.3718.907.45E-083.62E-05
Intermediate filament-based process73130.6918.751.80E-134.67E-10
Epidermis development5290.4918.231.35E-091.17E-06
Detection of chemical stimulus involved in sensory perception5480.5115.604.11E-082.28E-05
Detection of stimulus involved in sensory perception5880.5514.537.33E-083.80E-05
Epithelial cell differentiation91120.8613.896.30E-118.18E-08
Detection of chemical stimulus6380.6013.371.42E-075.83E-05
Detection of stimulus96110.9112.071.87E-091.46E-06
Sensory perception of chemical stimulus116121.1010.891.12E-091.09E-06
Epithelium development134131.2710.224.98E-105.53E-07
Sensory perception173121.647.301.03E-074.72E-05
Tissue development255132.425.371.05E-062.92E-04
Nervous system process271132.575.052.06E-064.87E-04
Response to bacterium240112.284.831.96E-053.72E-03
Supramolecular fiber organization309132.934.438.61E-061.72E-03
System process391133.713.509.85E-051.50E-02
Defense response467144.433.161.58E-042.37E-02
Response to external biotic stimulus442134.203.103.26E-044.38E-02
Response to other organism442134.203.103.26E-044.30E-02
Response to biotic stimulus448134.253.063.71E-044.81E-02
Response to chemical982239.332.476.19E-051.05E-02
PAN-GO biological processHomo sapiensObsExpectFold enrichRaw P-valueFDR
Melatonin biosynthetic process220.02>1008.97E-051.49E-02
Melatonin metabolic process220.02>1008.97E-051.46E-02
Sensory perception of umami taste220.02>1008.97E-051.43E-02
Sensory perception of sweet taste330.03>1008.43E-072.43E-04
Regulation of water loss via skin320.0370.212.67E-043.86E-02
Indole-containing compound biosynthetic process320.0370.212.67E-043.79E-02
Establishment of skin barrier320.0370.212.67E-043.72E-02
Pyroptosis530.0563.188.32E-061.70E-03
Inflammatory response to antigenic stimulus840.0852.655.36E-071.61E-04
Sensory perception of taste3390.3128.721.66E-112.58E-08
Keratinization2670.2528.353.53E-092.50E-06
Skin development40100.3826.333.14E-126.11E-09
Keratinocyte differentiation2970.2825.428.19E-095.31E-06
Intermediate filament organization63130.6021.732.37E-141.85E-10
Detection of chemical stimulus involved in sensory perception of bitter taste3060.2821.063.34E-071.18E-04
Sensory perception of bitter taste3060.2821.063.34E-071.13E-04
Detection of chemical stimulus involved in sensory perception of taste3060.2821.063.34E-071.08E-04
Intermediate filament cytoskeleton organization72130.6819.011.49E-135.80E-10
Epidermal cell differentiation3970.3718.907.45E-083.62E-05
Intermediate filament-based process73130.6918.751.80E-134.67E-10
Epidermis development5290.4918.231.35E-091.17E-06
Detection of chemical stimulus involved in sensory perception5480.5115.604.11E-082.28E-05
Detection of stimulus involved in sensory perception5880.5514.537.33E-083.80E-05
Epithelial cell differentiation91120.8613.896.30E-118.18E-08
Detection of chemical stimulus6380.6013.371.42E-075.83E-05
Detection of stimulus96110.9112.071.87E-091.46E-06
Sensory perception of chemical stimulus116121.1010.891.12E-091.09E-06
Epithelium development134131.2710.224.98E-105.53E-07
Sensory perception173121.647.301.03E-074.72E-05
Tissue development255132.425.371.05E-062.92E-04
Nervous system process271132.575.052.06E-064.87E-04
Response to bacterium240112.284.831.96E-053.72E-03
Supramolecular fiber organization309132.934.438.61E-061.72E-03
System process391133.713.509.85E-051.50E-02
Defense response467144.433.161.58E-042.37E-02
Response to external biotic stimulus442134.203.103.26E-044.38E-02
Response to other organism442134.203.103.26E-044.30E-02
Response to biotic stimulus448134.253.063.71E-044.81E-02
Response to chemical982239.332.476.19E-051.05E-02
Table 3.

GO term biological process enrichment analysis.

PAN-GO biological processHomo sapiensObsExpectFold enrichRaw P-valueFDR
Melatonin biosynthetic process220.02>1008.97E-051.49E-02
Melatonin metabolic process220.02>1008.97E-051.46E-02
Sensory perception of umami taste220.02>1008.97E-051.43E-02
Sensory perception of sweet taste330.03>1008.43E-072.43E-04
Regulation of water loss via skin320.0370.212.67E-043.86E-02
Indole-containing compound biosynthetic process320.0370.212.67E-043.79E-02
Establishment of skin barrier320.0370.212.67E-043.72E-02
Pyroptosis530.0563.188.32E-061.70E-03
Inflammatory response to antigenic stimulus840.0852.655.36E-071.61E-04
Sensory perception of taste3390.3128.721.66E-112.58E-08
Keratinization2670.2528.353.53E-092.50E-06
Skin development40100.3826.333.14E-126.11E-09
Keratinocyte differentiation2970.2825.428.19E-095.31E-06
Intermediate filament organization63130.6021.732.37E-141.85E-10
Detection of chemical stimulus involved in sensory perception of bitter taste3060.2821.063.34E-071.18E-04
Sensory perception of bitter taste3060.2821.063.34E-071.13E-04
Detection of chemical stimulus involved in sensory perception of taste3060.2821.063.34E-071.08E-04
Intermediate filament cytoskeleton organization72130.6819.011.49E-135.80E-10
Epidermal cell differentiation3970.3718.907.45E-083.62E-05
Intermediate filament-based process73130.6918.751.80E-134.67E-10
Epidermis development5290.4918.231.35E-091.17E-06
Detection of chemical stimulus involved in sensory perception5480.5115.604.11E-082.28E-05
Detection of stimulus involved in sensory perception5880.5514.537.33E-083.80E-05
Epithelial cell differentiation91120.8613.896.30E-118.18E-08
Detection of chemical stimulus6380.6013.371.42E-075.83E-05
Detection of stimulus96110.9112.071.87E-091.46E-06
Sensory perception of chemical stimulus116121.1010.891.12E-091.09E-06
Epithelium development134131.2710.224.98E-105.53E-07
Sensory perception173121.647.301.03E-074.72E-05
Tissue development255132.425.371.05E-062.92E-04
Nervous system process271132.575.052.06E-064.87E-04
Response to bacterium240112.284.831.96E-053.72E-03
Supramolecular fiber organization309132.934.438.61E-061.72E-03
System process391133.713.509.85E-051.50E-02
Defense response467144.433.161.58E-042.37E-02
Response to external biotic stimulus442134.203.103.26E-044.38E-02
Response to other organism442134.203.103.26E-044.30E-02
Response to biotic stimulus448134.253.063.71E-044.81E-02
Response to chemical982239.332.476.19E-051.05E-02
PAN-GO biological processHomo sapiensObsExpectFold enrichRaw P-valueFDR
Melatonin biosynthetic process220.02>1008.97E-051.49E-02
Melatonin metabolic process220.02>1008.97E-051.46E-02
Sensory perception of umami taste220.02>1008.97E-051.43E-02
Sensory perception of sweet taste330.03>1008.43E-072.43E-04
Regulation of water loss via skin320.0370.212.67E-043.86E-02
Indole-containing compound biosynthetic process320.0370.212.67E-043.79E-02
Establishment of skin barrier320.0370.212.67E-043.72E-02
Pyroptosis530.0563.188.32E-061.70E-03
Inflammatory response to antigenic stimulus840.0852.655.36E-071.61E-04
Sensory perception of taste3390.3128.721.66E-112.58E-08
Keratinization2670.2528.353.53E-092.50E-06
Skin development40100.3826.333.14E-126.11E-09
Keratinocyte differentiation2970.2825.428.19E-095.31E-06
Intermediate filament organization63130.6021.732.37E-141.85E-10
Detection of chemical stimulus involved in sensory perception of bitter taste3060.2821.063.34E-071.18E-04
Sensory perception of bitter taste3060.2821.063.34E-071.13E-04
Detection of chemical stimulus involved in sensory perception of taste3060.2821.063.34E-071.08E-04
Intermediate filament cytoskeleton organization72130.6819.011.49E-135.80E-10
Epidermal cell differentiation3970.3718.907.45E-083.62E-05
Intermediate filament-based process73130.6918.751.80E-134.67E-10
Epidermis development5290.4918.231.35E-091.17E-06
Detection of chemical stimulus involved in sensory perception5480.5115.604.11E-082.28E-05
Detection of stimulus involved in sensory perception5880.5514.537.33E-083.80E-05
Epithelial cell differentiation91120.8613.896.30E-118.18E-08
Detection of chemical stimulus6380.6013.371.42E-075.83E-05
Detection of stimulus96110.9112.071.87E-091.46E-06
Sensory perception of chemical stimulus116121.1010.891.12E-091.09E-06
Epithelium development134131.2710.224.98E-105.53E-07
Sensory perception173121.647.301.03E-074.72E-05
Tissue development255132.425.371.05E-062.92E-04
Nervous system process271132.575.052.06E-064.87E-04
Response to bacterium240112.284.831.96E-053.72E-03
Supramolecular fiber organization309132.934.438.61E-061.72E-03
System process391133.713.509.85E-051.50E-02
Defense response467144.433.161.58E-042.37E-02
Response to external biotic stimulus442134.203.103.26E-044.38E-02
Response to other organism442134.203.103.26E-044.30E-02
Response to biotic stimulus448134.253.063.71E-044.81E-02
Response to chemical982239.332.476.19E-051.05E-02

Conclusions

One of the major challenges in the genomic era is the ability to compile and analyse increasingly large volumes of data. The rapid growth in available genomic sequences has expanded the raw material for identifying evolutionary and functional patterns. At the same time, however, it has exposed important analytical limitations, particularly the frequent misannotation of secondary gene loss by automated annotation pipelines [11]. As a result, gene loss research is gaining momentum not only for its evolutionary relevance but also because it offers valuable insights into biological processes by acting as a source of natural knockouts. Although efforts have been made to streamline the annotation and/or identification of pseudogenes [24, 25], secondary gene loss events have been largely reported and validated in scattered manuscripts, with no centralized database systematically collecting and integrating this information.

The Gene Loss DB addresses this gap by systematically aggregating and organizing gene loss information in a single and user-friendly resource making it compatible with FAIR principles and Open Science [49]. By centralizing gene loss research, we overcome the archival fragmentation of data, facilitating comparative analyses, improving data discoverability and traceability, and enabling novel connections that might otherwise be overlooked.

While the current collection focuses on cetaceans, the database is expanding to incorporate additional taxonomic groups with relevance to evolutionary and health research. Future collections will include species that serve as natural models for human disease, further bridging the gap between evolutionary genetics and biomedical applications. As more data becomes available, this resource will provide deeper insights into gene function, natural knockouts, and disease mechanisms, reinforcing the importance of gene loss studies in both evolutionary and biomedical sciences.

Acknowledgements

We acknowledge the support of CORAL AI, an instrumental tool that fast-tracked the reading and extraction of data from numerous manuscripts. We also acknowledge NCBI, PantherDB, and Gene Ontology for their API interfaces, which contributed significantly to the functionality of this project. Their documentation and support were essential in implementing the gene loss database. We dedicate this work to the memory of Bernardo Pinto, whose enthusiasm and contributions during the early stages of this project were deeply valued. His joy and curiosity remain an inspiration to us.

With the publication of this database, we extend an open invitation to the research community to contribute. Researchers are encouraged to identify and communicate relevant articles for curation and inclusion in the Gene Loss Database via our message box on https://geneloss.org/about. Alternatively, researchers can contact us to propose and lead a curation effort for a specific species or group of species within their research focus or interest.

Author contributions

Mónica Lopes-Marques: Conceptualization, Formal analysis, Methodology, Validation, Writing—original draft, preparing final manuscript, and submission. Sergio Fernandes: Software development and implementation of the full database, including back-end and front-end office. Server management and maintaining the database online. Mónica Lopes-Marques, Gonçalo E. Themudo, Raul Valente, Nadia Artilheiro, Inês Amorim, Diogo Oliveira, and Bernardo Pinto: Expert curators for data collection and input into the database, review and validation of data, and critical contributions to the manuscript. Gonçalo E. Themudo, Raquel Ruivo, and L. Filipe C. Castro: Critical feedback and optimization of the database workflow, writing, and review of the manuscript. All authors read and reviewed the final version of the manuscript.

Conflict of interest

The authors declare that they have no conflict of interests.

Funding

This work was supported by the Fundação para a Ciência e Tecnologia, Portugal (2022.00397.CEECIND/CP1728/CT0006 to M.L.-M., 2023.07615.CEECIND/CP2848/CT0007 to R.R., and CEECINST/00133/2018/CP1510/CT0004 to L.F.C.C.). This research was supported by strategic funding to CIIMAR (UIDB/04423/2020 and UIDP/04423/2020) through national funds provided by FCT—Fundação para a Ciência e a Tecnologia, and Marma-Detox (project no. 334739) funded by the Research Council of Norway.

Data availability

Gene Loss DB does not require user registration and is available online at https://geneloss.org.

References

1.

Cañestro
 
C
,
Albalat
 
R
,
Irimia
 
M
 et al.  
Impact of gene gains, losses and duplication modes on the origin and diversification of vertebrates
.
Semin Cell Dev Biol
.
2013
;
24
:
83
94
.

2.

Albalat
 
R
,
Cañestro
 
C
.
Evolution by gene loss
.
Nat Rev Genet
.
2016
;
17
:
379
91
.

3.

Olson
 
MV
.
When less is more: gene loss as an engine of evolutionary change
.
Am Hum Genet
.
1999
;
64
:
18
23
.

4.

Sharma
 
V
,
Hecker
 
N
,
Roscito
 
JG
 et al.  
A genomics approach reveals insights into the importance of gene losses for mammalian adaptations
.
Nat Commun
.
2018
;
9
:
1215
.

5.

Ochman
 
H
,
Moran
 
NA
.
Genes lost and genes found: evolution of bacterial pathogenesis and symbiosis
.
Science
.
2001
;
292
:
1096
99
.

6.

Clark
 
JW
.
Genome evolution in plants and the origins of innovation
.
New Phytol
.
2023
;
240
:
2204
09
.

7.

Guijarro-Clarke
 
C
,
Holland
 
PWH
,
Paps
 
J
.
Widespread patterns of gene loss in the evolution of the animal kingdom
.
Nat Ecol Evol
.
2020
;
4
:
519
23
.

8.

Török
 
ME
,
Chantratita
 
N
,
Peacock
 
SJ
.
Bacterial gene loss as a mechanism for gain of antimicrobial resistance
.
Curr Opin Microbiol
.
2012
;
15
:
583
87
.

9.

Eckhart
 
L
,
Ballaun
 
C
,
Hermann
 
M
 et al.  
Identification of novel mammalian caspases reveals an important role of gene loss in shaping the human caspase repertoire
.
Mol Biol Evol
.
2008
;
25
:
831
41
.

10.

Castro
 
LF
,
Gonçalves
 
O
,
Mazan
 
S
 et al.  
Recurrent gene loss correlates with the evolution of stomach phenotypes in gnathostome history
.
Proc Biol Sci
.
2014
;
281
:
20132669
.

11.

Thibaud-Nissen F
 
S.A.
,
Murphy
 
T
 et al.  
The NCBI Handbook
. 2nd edn.
Bethesda, MD
:
National Center for Biotechnology Information
,
2013
.

12.

Blumer
 
M
,
Brown
 
T
,
Freitas
 
MB
. et al.  
Gene losses in the common vampire bat illuminate molecular adaptations to blood feeding
.
Sci Adv
.
2022
;
8
:
eabm6494
.

13.

Lopes-Marques
 
M
,
Machado
 
AM
,
Alves
 
LQ
 et al.  
Complete inactivation of sebum-producing genes parallels the loss of sebaceous glands in cetacea
.
Mol Biol Evol
.
2019
;
36
:
1270
80
.

14.

Lopes-Marques
 
M
,
Ruivo
 
R
,
Alves
 
LQC
 et al.  
The singularity of cetacea behavior parallels the complete inactivation of melatonin gene modules
.
Genes
.
2019
;
10
:
121
.

15.

Pinto
 
B
,
Valente
 
R
,
Caramelo
 
F
 et al.  
Decay of skin-specific gene modules in pangolins
.
J Mol Evol
.
2023
;
91
:
121458
70
.

16.

Lopes-Marques
 
M
,
Machado
 
AM
,
Barbosa
 
S
 et al.  
Cetacea are natural knockouts for IL20
.
Immunogenetics
.
2018
;
70
:
681
87
.

17.

Lopes-Marques
 
M
,
Ruivo
 
R
,
Fonseca
 
E
 et al.  
Unusual loss of chymosin in mammalian lineages parallels neo-natal immune transfer strategies
.
Mol Phylogenet Evol
.
2017
;
116
:
78
86
.

18.

Nery
 
MF
,
Arroyo
 
JI
,
Opazo
 
JC
.
Increased rate of hair keratin gene loss in the cetacean lineage
.
BMC Genomics [Electronic Resource]
.
2014
;
15
:
869
.

19.

Zhu
 
K
,
Zhou
 
X
,
Xu
 
S
 et al.  
The loss of taste genes in cetaceans
.
BMC Evol Biol
.
2014
;
14
:
218
.

20.

Lang
 
D
,
Wang
 
X
,
Liu
 
C
 et al.  
Birth-and-death evolution of ribonuclease 9 genes in Cetartiodactyla
.
Sci China Life Sci
.
2023
;
66
:
1170
82
.

21.

van Asch
 
B
,
Teixeira da Costa
 
LF
.
Patterns and tempo of PCSK9 pseudogenizations suggest an ancient divergence in mammalian cholesterol homeostasis mechanisms
.
Genetica
.
2021
;
149
:
1
19
.

22.

Meredith
 
RW
,
Gatesy
 
J
,
Cheng
 
J
 et al.  
Pseudogenization of the tooth gene enamelysin (MMP20) in the common ancestor of extant baleen whales
.
Proc Biol Sci
.
2011
;
278
:
993
1002
.

23.

Meredith
 
RW
,
Gatesy
 
J
,
Murphy
 
WJ
 et al.  
Molecular decay of the tooth gene Enamelin (ENAM) mirrors the loss of enamel in the fossil record of placental mammals
.
PLoS Genet
.
2009
;
5
:
e1000634
.

24.

Alves
 
LQ
,
Ruivo
 
R
,
Fonseca
 
MM
 et al.  
PseudoChecker: an integrated online platform for gene inactivation inference
.
Nucleic Acids Res
.
2020
;
48
:
W321
31
.

25.

Kirilenko
 
BM
,
Munegowda
 
C
,
Osipova
 
E
 et al.  
Integrating gene annotation with orthology inference at scale
.
Science
.
2023
;
380
:
eabn3107
.

26.

Emerling
 
CA
.
Regressed but not gone: patterns of vision gene loss and retention in subterranean mammals
.
Integr Comp Biol
.
2018
;
58
:
441
51
.

27.

Huelsmann
 
M
,
Hecker
 
N
,
Springer
 
MS
 et al.  
Genes lost during the transition from land to water in cetaceans highlight genomic changes associated with aquatic adaptations
.
Sci Adv
.
2019
;
5
:
eaaw6671
.

28.

Meredith
 
RW
,
Gatesy
 
J
,
Emerling
 
CA
 et al.  
Rod monochromacy and the coevolution of cetacean retinal opsins
.
PLoS Genet
.
2013
;
9
:
e1003432
.

29.

Nam
 
K
,
Lee
 
KW
,
Chung
 
O
 et al.  
Analysis of the FGF gene family provides insights into aquatic adaptation in cetaceans
.
Sci Rep
.
2017
;
7
:
40233
.

30.

Zhang
 
X
,
Chi
 
H
,
Li
 
G
 et al.  
Parallel independent losses of G-type lysozyme genes in hairless aquatic mammals
.
Genome Biol Evolut
.
2021
;
13
:
evab201
.

31.

Holthaus
 
KB
,
Lachner
 
J
,
Ebner
 
B
 et al.  
Gene duplications and gene loss in the epidermal differentiation complex during the evolutionary land-to-water transition of cetaceans
.
Sci Rep
.
2021
;
11
:
12334
.

32.

Lachner
 
J
,
Mlitz
 
V
,
Tschachler
 
E
 et al.  
Epidermal cornification is preceded by the expression of a keratinocyte-specific set of pyroptosis-related genes
.
Sci Rep
.
2017
;
7
:
17446
.

33.

Lopes-Marques
 
M
,
Alves
 
LQ
,
Fonseca
 
MMC
 et al.  
Convergent inactivation of the skin-specific C-C motif chemokine ligand 27 in mammalian evolution
.
Immunogenetics
.
2019
;
71
:
363
72
.

34.

Randall
 
JG
,
Gatesy
 
J
,
Springer
 
MS
.
Molecular evolutionary analyses of tooth genes support sequential loss of enamel and teeth in baleen whales (Mysticeti)
.
Mol Phylogenet Evol
.
2022
;
171
:
107463
.

35.

Deméré
 
TA
,
McGowen
 
MR
,
Berta
 
A
 et al.  
Morphological and molecular evidence for a stepwise evolutionary transition from teeth to baleen in Mysticete whales
.
Syst Biol
.
2008
;
57
:
15
37
.

36.

McGowen
 
MR
,
Tsagkogeorga
 
G
,
Williamson
 
J
 et al.  
Positive selection and inactivation in the vision and hearing genes of cetaceans
.
Mol Biol Evol
.
2020
;
37
:
2069
83
.

37.

Levenson
 
DH
,
Dizon
 
A
.
Genetic evidence for the ancestral loss of short-wavelength-sensitive cone pigments in mysticete and odontocete cetaceans
.
Proc Biol Sci
.
2003
;
270
:
673
9
.

38.

Thomas
 
PD
,
Ebert
 
D
,
Muruganujan
 
A
 et al.  
PANTHER: making genome-scale phylogenetics accessible to all
.
Protein Sci
.
2022
;
31
:
8
22
.

39.

Mi
 
H
,
Thomas
 
P
.
PANTHER pathway: an ontology-based pathway database coupled with data analysis tools
.
Methods Mol Biol
.
2009
;
563
:
123
40
.

40.

Sayers
 
EW
,
Beck
 
J
,
Bolton
 
EE
 et al.  
Database resources of the National Center for Biotechnology Information in 2025
.
Nucleic Acids Res
.
2025
;
53
:
D20
29
.

41.

Wei
 
C-H
,
Allot
 
A
,
Lai
 
P-T
 et al.  
PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge
.
Nucleic Acids Res
.
2024
;
52
:
W540
46
.

42.

Espregueira Themudo
 
G
,
Alves
 
LQ
,
Machado
 
AM
 et al.  
Losing genes: the evolutionary remodeling of cetacea skin
.
Front Mar Sci
.
2020
;
7
:
2020
.

43.

Lopes-Marques
 
M
,
Serrano
 
C
,
Cardoso
 
AR
 et al.  
GBA3: a polymorphic pseudogene in humans that experienced repeated gene loss during mammalian evolution
.
Sci Rep
.
2020
;
10
:
11565
.

44.

Lopes-Marques
 
M
,
Peixoto
 
MJ
,
Cooper
 
DN
 et al.  
Polymorphic pseudogenes in the human genome—a comprehensive assessment
.
Hum Genet
.
2024
;
143
:
1465
79
.

45.

Steinbinder
 
J
,
Sachslehner
 
AP
,
Holthaus
 
KB
 et al.  
Comparative genomics of sirenians reveals evolution of filaggrin and caspase-14 upon adaptation of the epidermis to aquatic life
.
Sci Rep
.
2024
;
14
:
9278
.

46.

Valente
 
R
,
Alves
 
LQ
,
Nabais
 
M
 et al.  
Convergent Cortistatin losses parallel modifications in circadian rhythmicity and energy homeostasis in Cetacea and other mammalian lineages
.
Genomics
.
2021
;
113
:
1064
70
.

47.

Mancia
 
A
.
On the revolution of cetacean evolution
.
Mar Geonomics
.
2018
;
41
:
1
5
.

48.

Coombs
 
EJ
,
Felice
 
RN
,
Clavel
 
J
 et al.  
The tempo of cetacean cranial evolution
.
Curr Biol
.
2022
;
32
:
2233
2247.e4.e2234
.

49.

Wilkinson
 
MD
,
Dumontier
 
M
,
Aalbersberg
 
IJ
 et al.  
The FAIR Guiding Principles for scientific data management and stewardship
.
Sci Data
.
2016
;
3
:
160018
.

Author notes

Deceased.

Sergio Fernandes, Mónica Lopes-Marques Contributed equally.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data