Article Navigation

Journal Article

The Annotation-enriched non-redundant patent sequence databases

Author Notes

Abstract

The EMBL-European Bioinformatics Institute (EMBL-EBI) offers public access to patent sequence data, providing a valuable service to the intellectual property and scientific communities. The non-redundant (NR) patent sequence databases comprise two-level nucleotide and protein sequence clusters (NRNL1, NRNL2, NRPL1 and NRPL2) based on sequence identity (level-1) and patent family (level-2). Annotation from the source entries in these databases is merged and enhanced with additional information from the patent literature and biological context. Corrections in patent publication numbers, kind-codes and patent equivalents significantly improve the data quality. Data are available through various user interfaces including web browser, downloads via FTP, SRS, Dbfetch and EBI-Search. Sequence similarity/homology searches against the databases are available using BLAST, FASTA and PSI-Search. In this article, we describe the data collection and annotation and also outline major changes and improvements introduced since 2009. Apart from data growth, these changes include additional annotation for singleton clusters, the identifier versioning for tracking entry change and the entry mappings between the two-level databases.

Database URL:http://www.ebi.ac.uk/patentdata/nr/

Introduction

The patent data are a valuable resource, not only for the intellectual property world but also for the scientific community (1,2). During the past 15 years, the number of biological sequences appearing in patent documents has been increasing constantly (3). Today, >30 million nucleotide and protein sequences extracted from patent documents are available in the public domain (shown by the black lines in Figure 1). Searching this large amount of patent sequence data has become one of the key approaches in patent-related studies (4,5). Proprietary data also exist from the commercial sector providing alternative annotations of patent sequence data, such as GENESEQ^TM (Thomson Reuters, http://thomsonreuters.com/products_services/science/science_products/a-z/geneseq/), GQ-PAT (GenomeQuest, http://wiki.genomequest.com/index.php/GQ_Pat), USGENE(SequenceBase, http://www.sequencebase.com/usgene-sequences-database), but these require commercial licenses, which impose usage restrictions on the data.

Data growth of patent sequence data. The left-side Y-axis shows the number of sequence entries; the right-side Y-axis indicates the number of patents and patent families; the X-axis represents the release timeline. The black lines show the increasing number of source biological sequences; other coloured lines illustrate the trends of the NR patent sequence databases following the increase in source data. Note: The number of entries of level-2 clusters (NRNL2 and NRPL2) can decrease due to deletions and merging of patent family assignments and patent corrections, for example, in the cases of Release 10 (Oct 2011) and Release 13 (Oct 2012).

Figure 1

Data growth of patent sequence data. The left-side Y-axis shows the number of sequence entries; the right-side Y-axis indicates the number of patents and patent families; the X-axis represents the release timeline. The black lines show the increasing number of source biological sequences; other coloured lines illustrate the trends of the NR patent sequence databases following the increase in source data. Note: The number of entries of level-2 clusters (NRNL2 and NRPL2) can decrease due to deletions and merging of patent family assignments and patent corrections, for example, in the cases of Release 10 (Oct 2011) and Release 13 (Oct 2012).

Open in new tab Download slide

The EMBL-European Bioinformatics Institute (EMBL-EBI) offers free and unrestricted access to patent sequence resources, providing a valuable service to the intellectual property and bioscience communities (6). The two-level non-redundant (NR) patent sequence databases, based on sequence identity and patent family clusters, are comprehensive repositories for patent information on nucleotide and protein sequences provided by the European Patent Office (EPO), the US Patent and Trademark Office (USPTO), the Japanese Patent Office (JPO) and the Korean Intellectual Property Office (KIPO) and include the World Intellectual Property Organization (WIPO) patents from these offices. The NR patent sequences are enriched with biological annotation and additional data from patent documents. These databases, serving as a repository of scientific innovation and inspiration, are an important resource for patent-related searches, especially for determining potential commercial use of biological sequences and their patentability. In this article, we describe the sequence collection and annotation of the NR patent sequence databases, and introduce improvements and development of the databases over the past 3 years.

Sequence Collection and Annotation

The NR patent sequence data sources cover nucleotide and protein sequences in patent applications from the EPO, the JPO, the KIPO and the USPTO. The patent sequence data deposited to ENA (7), GenBank (8) or the DDBJ (9) are exchanged between these databases through the International Nucleotide Sequence Database Collaboration (Figure 2a). Sequence submissions by the inventors can be made as part of the patent application process using tools, such as BISSAP (http://www.epo.org/bissap/), an application developed to facilitate the creation of sequence listings (WIPO ST.25 and proposed XML format) for patent applications containing biological sequences by the EPO in collaboration with national patent offices and the EMBL-EBI.

Data flow for the NR patent sequence databases. (a) Data sources consist of patent sequences from the patent offices of the EPO, the JPO, the KIPO and the USPTO, as well as the patent family data from the OPS. (b) Data collection and annotation. The resulting databases include the sequence clusters level-1 (NRNL1, NRPL1, EPOPNR, JPOPNR, KPOPNR and USPOPNR) and level-2 (NRNL2 and NRPL2), the patent equivalent database and other relevant result files. (c) Data access through FTP, DbFetch, SRS, EBI-Search and SSS (Sequence Similarity/Homology Search).

Figure 2

Data flow for the NR patent sequence databases. (a) Data sources consist of patent sequences from the patent offices of the EPO, the JPO, the KIPO and the USPTO, as well as the patent family data from the OPS. (b) Data collection and annotation. The resulting databases include the sequence clusters level-1 (NRNL1, NRPL1, EPOPNR, JPOPNR, KPOPNR and USPOPNR) and level-2 (NRNL2 and NRPL2), the patent equivalent database and other relevant result files. (c) Data access through FTP, DbFetch, SRS, EBI-Search and SSS (Sequence Similarity/Homology Search).

Open in new tab Download slide

The NR patent sequence databases have been created at two levels to remove sequence redundancy by using sequence MD5 (Message-Digest algorithm 5, http://www.faqs.org/rfcs/rfc1321.html) checksums and patent family information, comprising NR patent nucleotides level-1 and -2 and NR patent proteins level-1 and -2. Level-1 sequences are 100% identical over their entire lengths, arising from either the same or different patent families; level-2 sequences are 100% identical over their entire length and belong to the same patent family. Patent family information for source sequences is retrieved from the EPO Open Patent Services (OPS) (10). Level-1 databases include NR nucleotide patent sequence clusters level-1 (NRNL1), NR protein patent sequence clusters level-1 (NRPL1) and NR protein patent sequence clusters from individual patent offices (EPOPNR for the EPO, JPOPNR for the JPO, KPOPNR for the KIPO and USPOPNR for the USPTO). Level-2 databases contain NR nucleotide patent sequence clusters level-2 (NRNL2) and NR protein patent sequence clusters level-2 (NRPL2) (Figure 2b). The method used to remove sequence redundancy is detailed in an article by Li et al. 2010 (6).

The patent equivalents database is also developed to provide patent family information extracted from the OPS service for the sequences collected in this study (Figure 2b). In patents, a right of priority is a time-limited right triggered by the first filing of a patent application; a patent family refers to several patent applications or publications for an individual invention, claiming exactly the same priority or priorities; all of these family equivalents are related to each other by common priority numbers and associated priority dates (http://www.epo.org/searching/essentials/patent-families/about.html). The family information in the database covers patent family numbers, patent priority, master publications, patent equivalents, subsequent publication levels and patent classification. The database format is detailed in the user manual (http://www.ebi.ac.uk/patentdata/doc/Family_equivalents_database_v3.pdf).

The annotation of the NR sequences comprises cluster member annotation, patent family information and biological features. The cluster member annotation includes source sequence information, e.g. identifier (ID), molecular type, sequence length, source database, patent number and a general description. The patent family information consists of family number, master publication, patent priority, earliest publication date and the EPO and international classifications. The earliest publication date is determined to identify relevant prior art of the patent by comparing the patent publication dates of all the members of a NR sequence cluster. The biological features contain information on organisms, coding sequence regions, genes, variations, combined for both contig and singleton members. This combined annotation allows better exploration of the original patent applications for related intellectual property data. It also provides better cross-references to external data resources and improves the biological context at the sequence level. The annotation format is detailed in the user manual (http://www.ebi.ac.uk/patentdata/doc/Non-redundant_databases-user_manual_v3.pdf).

Data Growth and Improvements

The NR patent sequence databases are released every 3 or 4 months, but usually following EMBL-Bank’s quarterly release cycles. The current release (Release 13, Oct 2012) contains 12 279 969 NRNL1, 14 920 929 NRNL2, 2 580 442 NRPL1 and 3 697 317 NRPL2, ∼2.4-, 2.2-, 1.9- and 1.6-fold in size compared with the first release of NRNL1, NRNL2, NRPL1 and NRPL2, respectively, covering over 6 571 318 proteins and 24 364 832 nucleotides from 184 447 patents (130 538 unique patent families), which are provided by the patent offices of the EPO, the JPO,the KIPO and the USPTO (Table 1, Figure 1). The data coverage is slightly larger than the commercial patent sequence database GENESEQ, which included >27 million sequences from >150 000 patents in Oct 2012 (http://thomsonreuters.com/products_services/science/science_products/a-z/geneseq/).

Table 1

Open in new tab

Summary of the NR patent sequences and the patent families in Release 13

	Number of entries	Redundancy before
Patent nucleotides	24 364 832
NRNL1	12 279 969	1.98
NRNL2	14 920 929	2.22
Patent proteins	6 571 318
NRPL1	2 580 442	1.88
NRPL2	3 697 317	1.62
Patents	184 447
Unique patent families	130 538	1.41

	Number of entries	Redundancy before
Patent nucleotides	24 364 832
NRNL1	12 279 969	1.98
NRNL2	14 920 929	2.22
Patent proteins	6 571 318
NRPL1	2 580 442	1.88
NRPL2	3 697 317	1.62
Patents	184 447
Unique patent families	130 538	1.41

Patent publication numbers, sequence kind-codes and patent equivalents are corrected or updated in each release using the latest patent family data from the EPO’s OPS. Across all releases, 43 111 patent numbers and 14 330 sequence kind-codes have been corrected; 102 227 patent numbers have been involved in the patent family assignment. The corrected publication numbers link to the correct full-text patent documents; the corrected publication kind codes and the publication levels indicate the legal status and progress through the patent application process.

The ID mappings between level-1 and level-2 databases have been generated since Release 10 to clearly illustrate how identical sequences from level-1 databases are mapped to level-2 database entries according to their patent family information. Figure 3 has two examples that illustrate how sequences from level-1 nucleotide and protein sequences are clustered into level-2 entries. These mappings offer a useful explanation of the relationship between identical sequences within or outside of a patent family.

Two example entries illustrating the mapping between identical sequences from level-1 to level-2. (a) The NRNL1 entry NRN_AX241249 contains five member sequences, which are 100% identical over their full-length but clustered into four NRNL2 entries according to their patent family information: NRN00208E35 (family number 22673211, containing the member sequences AX241249 and DJ381174), NRN00208E36 (family number 27401191, containg the member sequence AX487735), NRN00208E37 (family number 32911719, containing the member sequence AR579342) and NRN00208E38 (DI090734 as member sequence and family number unknown). (b) The NRPL1 entry NRP_AX240833 contains four member sequences that are clustered into three NRPL2 entries according to their patent family information.

Figure 3

Two example entries illustrating the mapping between identical sequences from level-1 to level-2. (a) The NRNL1 entry NRN_AX241249 contains five member sequences, which are 100% identical over their full-length but clustered into four NRNL2 entries according to their patent family information: NRN00208E35 (family number 22673211, containing the member sequences AX241249 and DJ381174), NRN00208E36 (family number 27401191, containg the member sequence AX487735), NRN00208E37 (family number 32911719, containing the member sequence AR579342) and NRN00208E38 (DI090734 as member sequence and family number unknown). (b) The NRPL1 entry NRP_AX240833 contains four member sequences that are clustered into three NRPL2 entries according to their patent family information.

Open in new tab Download slide

Members of level-2 clusters in an old release can move to other clusters in a new release. This is due to changes in equivalents assignment in patent families. The ID versioning has been provided since Release 6 for direct tracking of entry history. This functionality is necessary for recovering information from old entries that have moved or have become obsolete in a new release.

Data Access and Usage

Data access to the NR patent sequence resources has become more and more important to the user community as the volume of sequence data increases. The NR patent sequence databases can be accessed through four major routes (Figure 2c) at EMBL-EBI:

The flat files can be downloaded through the databases website (http://www.ebi.ac.uk/patentdata/nr/) and through the FTP site (ftp://ftp.ebi.ac.uk/pub/databases/patentdata/).
The EMBL-like formatted annotation data can be retrieved on a per-accession through the Dbfetch/WSDbfetch service (http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/) and also through the SRS server (http://srs.ebi.ac.uk/).
Sequence similarity/homology searches including FASTA (11), BLAST (12,13) and PSI-Search (14) against the databases are available through the web form submissions (http://www.ebi.ac.uk/Tools/sss/) and also through the corresponding EMBL-EBI SOAP/REST web services (15).
Keyword searches can be made using the EBI-Search engine (16) through both a web form (http://www.ebi.ac.uk/ebisearch) and the corresponding SOAP web services.

Approximately 10 000 sequence similarity/homology searches were performed using the databases during 2010. This grew to >36 000 searches in 2011, and it is estimated that ∼37 500 searches will have taken place during 2012. The same trend is seen for data retrieval via Dbfetch/WSDbfetch, which have grown from 450 000 in 2011 to a projected 510 000 for 2012. FTP downloads of these sequence data have also grown from 394 downloads in 2011 to a projected 540 for 2012.

Discussion and Future Implementation

The NR patent sequence databases are the first publicly available collection of NR patent sequences, at both the sequence and patent-family levels. Other efforts in the public domain have been made to collate NR patent sequence data to improve access and use of these data, such as PatGen (17) and Patome (18). Unfortunately, PatGen is no longer available online; the sequence redundancy in Patome was defined according to the patent number and the sequence ID in the sequence listing. As a result, identical sequences granted with different patent numbers by different patent offices are not classified.

Sequence similarity/homology searching against the NR patent sequence databases has become a fundamental approach in patent-related studies. Searches against NR sequences are faster and more sensitive than the equivalent searches against redundant libraries, and the search results are easier to interpret. Searches against level-1 clusters can result in identical or similar patent sequences; searches against level-2 clusters can result in identical or similar sequences from the same invention. These searches can be used to find the published patents that cite a sequence and the patent families associated with a sequence, to discover the earliest priority data and the equivalents of a patent family, and to retrieve biological annotation extracted from patent documents.

The NR patent sequence databases are an important resource for patent-related searches, especially for determining potential commercial use of biological sequences. The earliest publication dates offer direct tracking of patent-application history, enabling effective searches on prior art. The corrections on the publication numbers and kind codes enhance the data quality, enabling proper cross-referencing to full-text patent documents. These databases are also a repository of scientific innovation and inspiration.

We will continue to make improvements and add new features in the future. For example, to broaden data coverage by including data from other national and regional patent offices, to shorten the release cycle to a monthly schedule and to integrate cross-references to claimed sequences and provide claimed status. Currently, users can download the ID history tables to track entry changes, such as status, and entry additions, deletions, merging and unmerging; in the future, an online searchable system will be implemented.

Funding

This work has been supported by European Molecular Biology Laboratory (EMBL) and the European Patent Office (EPO). Funding for open access charge: EMBL.

Conflict of interest. None declared.

Acknowledgements

The authors want to acknowledge all database administrators, data curators and users at EMBL-EBI, DDBJ, NCBI and EPO, who have offered important support and valuable feedback throughout. Thanks to Andrew Cowley for additional manuscript input and language corrections.

References

1

Seeber

F

. ,

Patent searches as a complement to literature searches in the life sciences—a ‘how-to’ tutorial

,

Nat. Protoc.

,

2007

, vol.

2

(pg.

2418

-

2428

)

2

Thangaraj

H

. ,

Information from patent office could aid replication

,

Nature

,

2007

, vol.

447

pg.

638

3

Dufresne

G

,

Duval

M

. ,

Genetic sequences: how are they patented? Nat

,

Biotechnol.

,

2004

, vol.

22

(pg.

231

-

232

)

OpenURL Placeholder Text

4

Dufresne

G

,

Takács

L

,

Heus

HC

, et al. ,

Patent searches for genetic sequences: how to retrieve relevant records from patented sequence databases

,

Nat. Biotechnol.

,

2002

, vol.

20

(pg.

1269

-

1271

)

5

McDowall

J

. ,

Prioritizing patent sequence search results using annotation-rich data

,

World Pat. Inform.

,

2011

, vol.

33

(pg.

235

-

239

)

6

Li

W

,

McWilliam

H

,

Richart de la Torre

A

, et al. ,

Non-redundant patent sequence databases with value-added annotations at two levels

,

Nucleic Acids Res.

,

2010

, vol.

38

(pg.

D52

-

D56

)

7

Leinonen

R

,

Akhtar

R

,

Birney

E

, et al. ,

The European nucleotide archive

,

Nucleic Acids Res.

,

2011

, vol.

39

(pg.

D28

-

D31

)

8

Benson

DA

,

Karsch-Mizrachi

I

,

Clark

K

, et al. ,

GenBank

,

Nucleic Acids Res.

,

2012

, vol.

40

(pg.

D48

-

D53

)

9

Kaminuma

E

,

Kosuge

T

,

Kodama

Y

, et al. ,

DDBJ progress report

,

Nucleic Acids Res.

,

2011

, vol.

39

(pg.

D22

-

D27

)

10

Kallas

P

. ,

Open patent services

,

World Pat. Inform.

,

2006

, vol.

28

(pg.

296

-

304

)

11

Pearson

WR

,

Lipman

DJ

. ,

Improved tools for biological sequence comparison

,

Proc. Natl. Acad. Sci. USA

,

1988

, vol.

85

(pg.

2444

-

2448

)

12

Altschul

SF

,

Madden

TL

,

Schäffer

AA

, et al. ,

Gapped BLAST and PSIBLAST: a new generation of protein database search programs

,

Nucleic Acids Res.

,

1997

, vol.

25

(pg.

3389

-

3402

)

13

Lopez

R

,

Silventoinen

V

,

Robinson

S

, et al. ,

WU-Blast2 server at the European Bioinformatics Institute

,

Nucleic Acids Res.

,

2003

, vol.

31

(pg.

3795

-

3798

)

14

Li

W

,

McWilliam

H

,

Goujon

M

, et al. ,

PSI-Search: iterative HOE-reduced profile SSEARCH searching

,

Bioinformatics

,

2012

, vol.

28

(pg.

1650

-

1651

)

15

McWilliam

H

,

Valentin

F

,

Goujon

M

, et al. ,

Web services at the European Bioinformatics Institute-2009

,

Nucleic Acids Res.

,

2009

, vol.

37

(pg.

W6

-

W10

)

16

Valentin

F

,

Squizzato

S

,

Goujon

M

, et al. ,

Fast and efficient searching of biological data resources—using EB-eye

,

Brief. Bioinform.

,

2010

, vol.

1

(pg.

375

-

384

)

17

Rouse

RJ

,

Castagnetto

J

,

Niedner

RH

. ,

PatGen—a consolidated resource for searching genetic patent sequences

,

Bioinformatics

,

2005

, vol.

21

(pg.

1707

-

1708

)

18

Lee

B

,

Kim

T

,

Kim

SK

, et al. ,

Patome: a database server for biological sequence annotation and analysis in issued patents and published patent applications

,

Nucleic Acids Res.

,

2007

, vol.

35

(pg.

D47

-

D50

)

Author notes

Citation details: Li W., Kondratowicz B., McWilliam H., et al. The Annotation-enriched non-redundant patent sequence databases. Database (2013) Vol. 2013: article ID bat005; doi: 10.1093/database/bat005

© The Author(s) 2013. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Views

1,738

Altmetric

Total Views 1,738

1,008 Pageviews

730 PDF Downloads

Since 12/1/2016

Month:	Total Views:
December 2016	4
January 2017	3
February 2017	4
March 2017	4
April 2017	4
May 2017	6
June 2017	1
July 2017	2
August 2017	4
November 2017	2
December 2017	25
January 2018	17
February 2018	25
March 2018	19
April 2018	18
May 2018	17
June 2018	25
July 2018	22
August 2018	23
September 2018	7
October 2018	20
November 2018	19
December 2018	10
January 2019	17
February 2019	17
March 2019	24
April 2019	22
May 2019	18
June 2019	18
July 2019	20
August 2019	12
September 2019	21
October 2019	34
November 2019	26
December 2019	17
January 2020	11
February 2020	19
March 2020	15
April 2020	9
May 2020	14
June 2020	16
July 2020	17
August 2020	21
September 2020	8
October 2020	5
November 2020	10
December 2020	4
January 2021	7
February 2021	12
March 2021	20
April 2021	8
May 2021	10
June 2021	13
July 2021	19
August 2021	13
September 2021	3
October 2021	9
November 2021	3
December 2021	7
January 2022	8
February 2022	11
March 2022	7
April 2022	13
May 2022	20
June 2022	4
July 2022	6
August 2022	9
September 2022	9
October 2022	37
November 2022	22
December 2022	5
January 2023	12
February 2023	15
March 2023	6
April 2023	7
May 2023	5
June 2023	12
July 2023	14
August 2023	10
September 2023	3
October 2023	5
November 2023	7
December 2023	11
January 2024	24
February 2024	35
March 2024	15
April 2024	15
May 2024	11
June 2024	60
July 2024	109
August 2024	70
September 2024	9
October 2024	23
November 2024	35
December 2024	14
January 2025	19
February 2025	2
March 2025	13
May 2025	19
June 2025	12
July 2025	40
August 2025	27
September 2025	9
October 2025	11
November 2025	14
December 2025	3
January 2026	8
February 2026	9
March 2026	26
April 2026	11
May 2026	12
June 2026	21
July 2026	4