CerealsDB—new tools for the analysis of the wheat genome: update 2020

List of collections and mapping populations that have been genotyped on the Axiom® arrays

Category	Number of varieties/individuals
820K_array	476
35K_breeders_array	1962
35K_relatives_array	278
AvalonxCadenza	185
ParagonxTetraploid BC1F5	189
Doonan_sphaerococcumxCapelle	75
Boden_tilling_samples	141
ChineseSpringxParagon	273
NIAB_INEW_and_ERYCC	235
NIAB_smel	92
OakleyxGatsby	74
ParagonxStarkeshortday	136
ParagonxWatkinsINEW	458
ParagonxBaj	169
ParagonxBecard_kachu	97
ParagonxCimcog47	409
ParagonxCimcog49	369
ParagonxMisr1	95
ParagonxPfau	97
ParagonxSuper152	97
ParagonxWaxwing	95
ParagonxWyalkatchem	88
ShamrockxShango	90
WeebillxCimcog03	133
WeebillxCimcog32	90
WeebillxCimcog56	95
Yumai34xClaire	100
Yumai34xUkrainka	91
Total	6689

Category	Number of varieties/individuals
820K_array	476
35K_breeders_array	1962
35K_relatives_array	278
AvalonxCadenza	185
ParagonxTetraploid BC1F5	189
Doonan_sphaerococcumxCapelle	75
Boden_tilling_samples	141
ChineseSpringxParagon	273
NIAB_INEW_and_ERYCC	235
NIAB_smel	92
OakleyxGatsby	74
ParagonxStarkeshortday	136
ParagonxWatkinsINEW	458
ParagonxBaj	169
ParagonxBecard_kachu	97
ParagonxCimcog47	409
ParagonxCimcog49	369
ParagonxMisr1	95
ParagonxPfau	97
ParagonxSuper152	97
ParagonxWaxwing	95
ParagonxWyalkatchem	88
ShamrockxShango	90
WeebillxCimcog03	133
WeebillxCimcog32	90
WeebillxCimcog56	95
Yumai34xClaire	100
Yumai34xUkrainka	91
Total	6689

A large proportion of the CerealsDB genotyping datasets were produced for 25 bi-parental mapping populations consisting of 3973 genotyped individuals across them. These populations are described in more detail in Table 2.

Table 2.

Information on mapping populations genotyped for CerealsDB

Mapping population	Comment
Avalon x Cadenza	This is a population of doubled-haploid individuals produced within WGIN (www.wgin.org.uk) as a UK reference population. It was derived from F1 progeny of a cross between the cultivars Avalon and Cadenza. The parents were originally chosen to contrast for canopy architecture traits.
ChineseSpringxParagon	This population was developed as SSD to generation F6 produced within WGIN (www.wgin.org.uk). The parents are CS and Par. CS is the international reference genome and Par is the common parent in many UK efforts; both are sequenced. Par is a modern UK spring wheat and is used as the common parent of the landrace bread wheat nested association panel (15), with CSxPar being part of.
ParagonxStarke Short Day	This population was developed as SSD to generation F4 with Par and Starke parents, produced within DFW (www.designing-future-wheat.ac.uk). Starke is a Swedish variety released between 1960 and 1970, which exhibits a stronger than usual delay in heading time when grown under short days. Starke is in the Gediflux Northern European Wheat collection.
Paragonxtetraploid BC1F5	This population was created with Par and tetraploid (durum and dicoccoides as donor parents), as part of the DFW synthetics pillar. Seeds for this population have been deposited at the GRU.
Yumai 34 x Claire	This population was developed as SSD to generation F5 with Yumai 34 and Claire parents. Yumai 34 is a Chinese cultivar shown to contain high-soluble fiber in the HEALTHGRAIN project, Claire is a very successful UK biscuit wheat. Seeds for this population are available from the GRU.
Yumai32 x Ukrainka	This population was developed as SSD to generation F5 with Yumai 34 and Ukrainka parents. Yumai 34 is a Chinese cultivar shown to contain high-soluble fiber in the HEALTHGRAIN project; Ukrainka is a Hungarian variety that is productive and adaptable, with excellent baking quality. (Shewry, 2009). Seeds for this population are available from the GRU.
Paragon x Baj	This population was developed as SSD to generation F4 with Par and Baj parents as part of the IWYP (https://iwyp.org). Baj is a CIMMYT line (GID 5106304), very early maturing, heat- and drought-tolerant and well adapted to eastern Gangetic plains.
ParagonxBecard/Kachu	This population was developed as SSD to generation F5 with Par and Becard/Kachu parents as part of IWYP. Becard/Kachu is a CIMMYT line (GID 6174886) that performed well in trials in Obregon under the wheat yield consortium led by Matthew Reynolds. Kachu, another CIMMYT line for cultivation in northwestern India, is derived from Bacanora. Becard is derived from Weebill.
Paragon x CIMCOG47	This population was developed as SSD to generation F5 with Par and CIMCOG 47 parents as part of IWYP. CIMCOG 47 is CIMMYT line (GID 6000921).
Paragon x CIMCOG49	This population was developed as SSD to generation F5 with Par and CIMCOG 49 parents as part of IWYP. CIMCOG 49 is CIMMYT line (GID 6175024).
Paragon x MISR1	This population was developed as SSD to generation F5 with Par and MISR1 parents as part of IWYP. MISR1 was released by CIMMYT (GID 4902859) for Egypt, Afghanistan and Pakistan and carried Sr25 segment.
Paragon x Pfau	This population was developed as SSD to generation F5 with Par and Pfau parents as part of IWYP. Pfau is an early maturing CIMMYT variety (GID 12989).
Paragon x Super 152	This population was developed as SSD to generation F5 with Par and Super 152 parents as part of IWYP. Early maturing variety that performed very well in CIMMYT ESWYT trials, identified as having potential for wide adaptation in India (GID 5390612).
Paragon x Synth Type	This population was developed as SSD to generation F5 with Par and a CIMMYT synth type (GID 6176523) parents as part of IWYP.
Paragon x Waxwing	This population was developed as SSD to generation F5 with Par and Waxwing parents as part of IWYP. Waxwing is a CIMMYT line (GID 334864).
Paragon x Wyalkatchem	This population was developed as SSD to generation F6 with Par and Wyalkatchem parents as part of IWYP. Wyalkatchem is an Australian variety with CIMMYT parents (GID 3828890) very widely grown until recently.
Weebill x CIMCOG03	This population was developed as SSD to generation F5 with Weebill1 and CIMCOG 03 parents as part of IWYP. CIMCOG 03 is a CIMMYT line (GID 6175208).
Weebill x CIMCOG32	This population was developed as SSD to generation F5 with Weebill1 and CIMCOG 32 parents as part of IWYP. CIMCOG32 is a CIMMYT line (GID 4774392).
Weebill x CIMCOG56	This population was developed as SSD to generation F5 with Weebill1 and CIMCOG 56 parents as part of IWYP. CIMCOG 56 is a CIMMYT line (GID 6179253).

Mapping population	Comment
Avalon x Cadenza	This is a population of doubled-haploid individuals produced within WGIN (www.wgin.org.uk) as a UK reference population. It was derived from F1 progeny of a cross between the cultivars Avalon and Cadenza. The parents were originally chosen to contrast for canopy architecture traits.
ChineseSpringxParagon	This population was developed as SSD to generation F6 produced within WGIN (www.wgin.org.uk). The parents are CS and Par. CS is the international reference genome and Par is the common parent in many UK efforts; both are sequenced. Par is a modern UK spring wheat and is used as the common parent of the landrace bread wheat nested association panel (15), with CSxPar being part of.
ParagonxStarke Short Day	This population was developed as SSD to generation F4 with Par and Starke parents, produced within DFW (www.designing-future-wheat.ac.uk). Starke is a Swedish variety released between 1960 and 1970, which exhibits a stronger than usual delay in heading time when grown under short days. Starke is in the Gediflux Northern European Wheat collection.
Paragonxtetraploid BC1F5	This population was created with Par and tetraploid (durum and dicoccoides as donor parents), as part of the DFW synthetics pillar. Seeds for this population have been deposited at the GRU.
Yumai 34 x Claire	This population was developed as SSD to generation F5 with Yumai 34 and Claire parents. Yumai 34 is a Chinese cultivar shown to contain high-soluble fiber in the HEALTHGRAIN project, Claire is a very successful UK biscuit wheat. Seeds for this population are available from the GRU.
Yumai32 x Ukrainka	This population was developed as SSD to generation F5 with Yumai 34 and Ukrainka parents. Yumai 34 is a Chinese cultivar shown to contain high-soluble fiber in the HEALTHGRAIN project; Ukrainka is a Hungarian variety that is productive and adaptable, with excellent baking quality. (Shewry, 2009). Seeds for this population are available from the GRU.
Paragon x Baj	This population was developed as SSD to generation F4 with Par and Baj parents as part of the IWYP (https://iwyp.org). Baj is a CIMMYT line (GID 5106304), very early maturing, heat- and drought-tolerant and well adapted to eastern Gangetic plains.
ParagonxBecard/Kachu	This population was developed as SSD to generation F5 with Par and Becard/Kachu parents as part of IWYP. Becard/Kachu is a CIMMYT line (GID 6174886) that performed well in trials in Obregon under the wheat yield consortium led by Matthew Reynolds. Kachu, another CIMMYT line for cultivation in northwestern India, is derived from Bacanora. Becard is derived from Weebill.
Paragon x CIMCOG47	This population was developed as SSD to generation F5 with Par and CIMCOG 47 parents as part of IWYP. CIMCOG 47 is CIMMYT line (GID 6000921).
Paragon x CIMCOG49	This population was developed as SSD to generation F5 with Par and CIMCOG 49 parents as part of IWYP. CIMCOG 49 is CIMMYT line (GID 6175024).
Paragon x MISR1	This population was developed as SSD to generation F5 with Par and MISR1 parents as part of IWYP. MISR1 was released by CIMMYT (GID 4902859) for Egypt, Afghanistan and Pakistan and carried Sr25 segment.
Paragon x Pfau	This population was developed as SSD to generation F5 with Par and Pfau parents as part of IWYP. Pfau is an early maturing CIMMYT variety (GID 12989).
Paragon x Super 152	This population was developed as SSD to generation F5 with Par and Super 152 parents as part of IWYP. Early maturing variety that performed very well in CIMMYT ESWYT trials, identified as having potential for wide adaptation in India (GID 5390612).
Paragon x Synth Type	This population was developed as SSD to generation F5 with Par and a CIMMYT synth type (GID 6176523) parents as part of IWYP.
Paragon x Waxwing	This population was developed as SSD to generation F5 with Par and Waxwing parents as part of IWYP. Waxwing is a CIMMYT line (GID 334864).
Paragon x Wyalkatchem	This population was developed as SSD to generation F6 with Par and Wyalkatchem parents as part of IWYP. Wyalkatchem is an Australian variety with CIMMYT parents (GID 3828890) very widely grown until recently.
Weebill x CIMCOG03	This population was developed as SSD to generation F5 with Weebill1 and CIMCOG 03 parents as part of IWYP. CIMCOG 03 is a CIMMYT line (GID 6175208).
Weebill x CIMCOG32	This population was developed as SSD to generation F5 with Weebill1 and CIMCOG 32 parents as part of IWYP. CIMCOG32 is a CIMMYT line (GID 4774392).
Weebill x CIMCOG56	This population was developed as SSD to generation F5 with Weebill1 and CIMCOG 56 parents as part of IWYP. CIMCOG 56 is a CIMMYT line (GID 6179253).

SSD indicates single seed descent; CS, Chinese Spring; Par, Paragon.

Genotyping data have been converted into variant call format (VCF) files for the 820K array, 35K wheat breeders and 35K wheat-relative arrays and are hosted on CerealsDB.

Varietal information

CerealsDB now includes a varietal database providing descriptions on all wheat varieties and related species currently genotyped on the 820K HD array and 35K wheat-relative and breeders’ arrays. These data are especially useful to wheat breeders and consist of information on alternate names for the varieties, species ploidy, country of origin and the source of the material. The varietal database also includes information on the sample type, for example categorization of the variety as a ‘cultivar’—a wheat variety that has been generated via a breeding programme—in contrast to a ‘landrace’ that is locally adapted germplasm that has not been systematically bred. The ‘sample type’ field in the database includes information on specific wheat research or breeding programmes where applicable.

According to the Axiom® genotyping array used, genotype data are stored in the ‘820K array’ or the ‘35K array’ tables in the database. Each table contains 12 columns and there are currently 975 varieties in the 820K array table, compared with 6144 varieties in the 35K array table. The varieties come from a total of 51 or 57 countries, respectively, with UK varieties being the most highly represented in both tables (512 in the 820K array table and 1971 in the 35K array table). The distribution of varieties by country is displayed in Supplementary Figure 2.

The 820K array data come from 855 hexaploid wheat varieties, 65 tetraploid wheat and wheat relatives, 47 diploid species and a single decaploid species (T. ponticum). The majority of the 35K array genotyping data are derived from hexaploid wheat varieties (6080); however, there are also data for 37 tetraploid species and 11 diploid species. We were unable to assign a ploidy level to some species (5 on the 820K array and 16 on the 35K array) and have categorized them as ‘unknown’. There are 820K array data for 86 wheat relative and progenitor species (mainly Aegilops, Thinopyrum and related Triticae) compared with the 35K array that has data for only 11 wheat relatives, all of which are varieties of A. tauschii.

QTL data

As part of Wheat Improvement Strategic Programme and designing future wheat (DFW) programs, field trials were conducted on a set of bi-parental populations (Table 3). The segregating populations are part of the wheat landrace NAM population described in Wingen et al. (15). Population development, KASP genotyping and genetic mapping were conducted as described, with the exception of population Par x Watkins 1 190 094 (PW094), which was genotyped using the Breeders’ 35K Axiom® array (16). Due to the large size of the dataset, genetic mapping was conducted with MSTmap Online (17) (http://www.mstmap.org/) using an LOD 10 threshold and manually separating the linkage groups as markers carry information to which chromosome they belong.

Table 3.

List of bi-parental populations and their characteristics

Population name	Number of RILs	Linked marker loci in genetic map	Genotype
ParW094	109	3088	Axiom®
ParW103	94	172	KASP
ParW308	80	281	KASP
ParW471	92	281	KASP
ParW670	94	185	KASP
ParW740	94	164	KASP
ParW749	94	221	KASP

All populations share ‘Paragon’ as common parent and have a Watkins stabilized bread wheat landrace accession (http://gru-lims-test.nbi.ac.uk/search-browseaccessions.php?idCollection=39) as second parent.

Populations were developed as SSD up to generation F4 as described in (15) and genotyped with KASP markers apart from one exception.

W<number> indicates Watkins accession 1190<number>.

Populations were grown as part of the DFW WP3 multiplication plots in winter sown un-replicated field trials in the 2016/2017 season in 1 m² plots with 20% controls at the John Innes Farm at Bawburgh, Norwich, UK (52.63 degree N, 1.18 degree E), without nitrogen treatment. The populations were phenotyped for plant height (PH) in centimeter as measured from the soil surface to the highest point of the plant and for heading date (HD) in days after 1 May, when in half of the plot, half of the ears had emerged from the flag leaf. QTL mapping was performed for trait data and genotype and map information, using the package QTL (18) (version 1.42.8) from the R software suite (vs. 3.5.2) (R core team 2015) taking the cross-type and generation number of the populations into account. QTL analyses were performed in two steps: putative QTL were identified in an initial single QTL scan and subsequently tested in a final multiple QTL model using a significant threshold calculated from the data distribution.

Genotype and phenotype data are available from http://wisplandracepillar.jic.ac.uk/results_resources.htm, and seed for the populations is available by request.

The QTL database currently contains 65 QTL for 9 wheat phenotypes, consisting of 13 QTL for PH, 7 QTL for heading time, 8 QTL for grain length, 13 QTL for grain section area, 7 QTL for grain width, 5 QTL for grain yield, 1 QTL for HD, 1 QTL for presence of awns and 10 QTL for thousand grain weight.

For each QTL, markers identified at the start and end of the QTL confidence interval and at the QTL peak are given. Crop ontology (19) identifiers were assigned to help further classify data based on agriculture-related information and terminology related to phenotype, breeding, germplasm, pedigree and traits (http://www.cropontology.org/). Two web services were developed at the Earlham Institute to allow access to information on how the genotyping data relate to given markers. Both are part of the Grassroots infrastructure and hence use the Grassroots API and libraries for their operation. The first service is used to import and store the spreadsheets containing the genotyping information into a MongoDB database to store the relationships between the markers, the genotypes and the cross-parental data. The second service accepts queries for searching these data using the marker and/or population names. Clicking on the markers accesses the genotyping data by calling this web service (Supplementary Figure 3).

Phylogenetic trees

For the 820K array genotyping dataset, there were 773 301 (94.35%) SNPs that could be accurately assigned to a location on the International Wheat Genome Sequencing Consortium (IWGSC) RefSeq v1.0 whole genome assembly, compared with 31 926 (90.86%) for the 35K array.

A VCF file was generated for SNPs in accordance with version 4.1 specification with allelic information included. Genotypes were encoded with the Chinese Spring reference allele assigned to 0 and the alternative allele assigned to 1. There were 475 wheat varieties and wheat relative/progenitor species for the 820K genotyping dataset and 1962 wheat varieties that had been genotyped using the 35K array.

The VCF file was then submitted to the SNPhylo analysis pipeline, which excluded 664 280 (85.9%) SNPs due to their monomorphic nature, having a minor allele frequency less than 0.1 or a missing data rate greater than 0.1 (missing data are where a specific variety has failed to genotype IE encoded as −/− in the VCF file) compared with the 35K array genotyping data where 12 739 (39.9%) SNPs were excluded. The number of SNPs for each chromosome for the 820K array data is displayed in Supplementary Table 3 and for the 35K array data in Supplementary Table 4. A total of 58 553 SNPs was used in the 820K analysis and 15 627 in the 35K analysis.

The Newick strings generated by SNPhylo were imported into the Phylocanvas (20) package (version1.1) that allows the phylogenetic trees to be displayed online. The Phylocanvas JavaScript code was embedded into CerealsDB webpages to allow users to interactively explore the dendrograms (Figure 3).

Introgression identification

The assignment of a physical map position to the SNP markers was achieved by BLAST searching the probe sequences to the IWGSC whole genome assembly v1.0 (https://wheat-urgi.versailles.inra.fr/Seq-Repository/Assemblies). The identification of putative introgressions was performed by comparing the genotype calls for each of the 339 hexaploid lines with those of the 116 wheat relatives and progenitors over a 10 SNP window and calculating a percentage match. Analysis of control introgression lines indicated that a relative match of 0.4 or higher was indicative of an introgression in a hexaploid background. This threshold was chosen based on the screening of known introgressions such as 1B/1RS.

The next part of the introgression prediction pipeline takes the predicted scores for each chromosome from the previous step and uses a sliding window of 10 SNPs along the chromosome and calculates the average similarity scores for consecutive 10 SNP windows for each of the varieties, to identify putative regions of introgression along each chromosome. A summary file was then generated for each variety to identify regions of introgression for each chromosome. The summary file reports the chromosomal location and size of putative introgressions above 0.4 for each wheat-relative comparison. The analysis pipeline achieved this by looking for consecutive SNPs that had a relative match of 0.4 or higher and if the next SNP along did not have a relative match of 0.4 or greater, then the preceding SNP was classified as a small introgression. If successive SNPs all had a relative match of 0.4 or greater, then the distance between the first and last SNP with a match ≥0.4 was calculated to identify the approximate size of the introgression. For example, if five consecutive SNPs have a similarity score greater than 0.4, then the distance between the first and the fifth SNPs would be used to calculate the size of the introgression. In this way, it was possible to identify potential introgressions/deletions along each chromosome for every combination of hexaploid wheat variety and wheat relative/progenitor species.

CNV analysis

Copy number variation (CNV) analysis was performed using the Affymetrix CNV Tool software. CEL files from the Axiom® Wheat HD Genotyping Array were processed using Axiom® Analysis Suite as described above. The annotation file was generated using the Affymetrix Annotation Converter. CNV analyses were visualized in BioDiscovery Nexus Copy Number (El Segundo, CA, USA).

Introgression and CNV data can be visualized using the ‘Introgression Plotter’ tools located in the right-hand menu on the axiom genotyping page.

Webtools

An explanation of updated webtools and their current URLs are displayed in Table 4. These webtools can also be accessed from the right-hand menu of the CerealsDB home page. The webtool page contains brief explanations of each webtool: https://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/webtools.php

Table 4.

New webtools and their locations on CerealsDB

Webtool function	URL
Retrieving genotyping data	http://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/extract_genotypes.php
Searching varietal database	http://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/select_varietal_info_SeedStor.php
Searching the QTL database	http://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/select_QTL.php
Retrieving SNPs that align to related species	https://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/FORM_SNPs_vs_related.php
Selecting Axiom® SNPs between two known locations	https://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/axiom_extractor.php
Online dendrograms for Axiom® 820K and 35K arrays	https://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/820K_dendrogram.php https://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/35K_dendrogram.php
Search for putative introgressions/deletions	https://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/search_introgressions.php

Webtool function	URL
Retrieving genotyping data	http://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/extract_genotypes.php
Searching varietal database	http://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/select_varietal_info_SeedStor.php
Searching the QTL database	http://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/select_QTL.php
Retrieving SNPs that align to related species	https://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/FORM_SNPs_vs_related.php
Selecting Axiom® SNPs between two known locations	https://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/axiom_extractor.php
Online dendrograms for Axiom® 820K and 35K arrays	https://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/820K_dendrogram.phphttps://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/35K_dendrogram.php
Search for putative introgressions/deletions	https://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/search_introgressions.php

Retrieving genotyping data

While the previous iteration of CerealsDB included the genotyping data as downloadable Excel spreadsheets, the updated site now has a webtool that allows users to download genotyping information for a specific Axiom® SNP or for a specific wheat variety or wheat relative/progenitor species.

In the first step, the users must decide if they want to get genotype information for a variety or for a SNP. If the latter has been selected, users can search for allelic scores for any SNP on an Axiom® array by selecting one of the two arrays (either 820K or 35K) in the top selection box and providing the Axiom® SNP code (e.g. AX-94381147). Once the SNP code has been submitted, the webtool will list all varieties scored for the selected array with both the variety code generated by the Affymetrix array analysis pipeline and the genotype called. In addition, users can download the data as a tab-delimited text file.

The genotyping data can also be retrieved by searching for a specific variety in the first step mentioned above. This is accomplished by selecting an array or mapping population and then choosing a variety from a drop-down list; alternatively, users can enter a partial variety name and the field will autocomplete if a variety is present on that array (or mapping population). Users can then select genotyping results for probes on all chromosomes or a combination of chromosomes, and the webtool will display the genotype calls for the selected chromosomes, with the option of downloading the data in tab-delimited text format.

Searching varietal database

The varietal database can be searched via a simple webtool that allows users to select varieties genotyped on either the 820K or the 35K Axiom® array. Varieties can be selected from a drop-down list or the name can be typed into the entry field that auto-completes if the variety is present on the array. Alternatively, a general search term can be entered in this field, such as the country of origin or the source of the germplasm and the tool will retrieve varieties annotated with this information. For example, typing ‘Australia’ and searching the 820K array currently retrieves six varieties that were sourced from this country.

CerealsDB varieties have been linked to the Germplasm Resource Unit (GRU), a UK seed bank with a focus on crops, including wheat, brassica and peas. A recently developed API now connects CerealsDB to SeedStor, a database hosted at the GRU (21). This SeedStor API allows users to link any genotyped variety on CerealsDB with the correct GRU accessions providing a direct connection to the SeedStor website (https://www.seedstor.ac.uk/), which contains information like taxonomy, pedigrees, phenotypes and images if available and allows users to directly ordering seed for varieties of interest.

Searching the QTL database

The QTL database, currently holding QTL data identified in bi-parental populations of the DFW landrace NAM panel (wisplandracepillar.jic.ac.uk/results_resources.htm) can be searched online via a web interface. Currently, users can type in one of two phenotypes, ‘Plant height’ or ‘Heading time’, and information on all QTL present for either of these phenotypes is displayed in a table, including information on chromosome location, LOD score, QTL confidence interval start and end markers and the QTL peak marker, along with positions on the genetic map (Figure 1). Clicking on a QTL ID links to a page containing more information on that QTL, including physical positions of markers on the chromosome (based on alignment to the IWGSC assembly RefSeq v1.0) a genetic map of the markers associated with the QTL and links out to the region in EnsemblPlants (22) wheat browser. Ideograms of marker locations on wheat chromosomes were generated using the PhenoGram software (23). Crop ontology information is also provided with links out to the website http://www.cropontology.org/ that hosts agriculture-related information and terminology related to phenotype, breeding, germplasm, pedigree and traits (19).

Figure 1.

Searching the QTL database for a particular phenotype generates a table (a) displaying all QTLs for that phenotype along with chromosomal positions, population information and LOD scores. Clicking on an individual QTL links to QTL maps (b) where markers are shown at the start and end of the confidence interval of the QTL along with the peak marker. An ideogram of the chromosome (c) is also displayed where available showing the QTL highlighted in red and the chromosomal position of important markers are also highlighted.

Figure 2.

Colour-coded glyphs allow users to rapidly scroll through hundreds of results and identify cross-platform SNPs, which is a useful tool for plant breeders that often require ‘legacy’ markers for their breeding programmes.

The QTL database is predicted to be a useful tool for plant breeders looking for novel variation to cross into breeding programmes. For example, a breeder might be interested in the recently identified plant height (PH)-reducing locus Rht24 on chromosome 6A (24, 25) as ‘Rht24 appears in breeding programs from all countries of origin investigated, with increased frequency over the last decades, indicating that wheat breeders have actively selected for this locus.’ (25).

It seems reasonable to hypothesize that users can find different alleles for this locus in a QTL database as it has been speculated (24) that this locus is underlying a meta-QTL for PH on 6A in European elite germplasm (26). In the search for novel alleleic diversity, CerealsDB would be a good place to start, as it hosts information from geographically disparate accessions. In line with this, the QTL database holds QTL data from crosses between bread wheat landraces, from the diverse A E Watkins collection, and the European accession, ‘Paragon’. Searching the database with the key word ‘plant height’ currently brings up 13 PH QTL of which six are located on chromosome 6A. The position on the IWGSC RefSeq v1.0 is given for each QTL for the marker closest to the QTL peak and for the markers bordering the QTL confidence interval. Further markers within the confidence interval are listed and their position on the genome assembly can be retrieved by using the ‘Extract probes’ webtool. The alignment of the six PH QTL along the 6A chromosome and the comparison of their location to the position of the most significantly associated marker S3222505 (located at 420391134 Bp) and the recommended region for marker assisted breeding (25) reveals that the QTL confidence intervals overlap with that region or are near (Supplementary Figure 4). Given that local rearrangements in individual accessions in comparison with the reference sequence are frequent in wheat, QTL positions, which are genetically defined may appear at an unexpected distance to a candidate locus on the wheat reference genome. Therefore, all six QTL are potential candidates for alleleic variation of Rht24. Predicted alleleic effects are 2.6–5.2 centimorgans, with the increasing effect coming from the landrace parents. Depending on the combination with other Rht genes, and on the target environment of the breeding programme, these alleles could be useful to achieve an optimum PH. The usefulness of the QTL database will grow with the number of QTL for different populations, traits and environments deposited here.

Retrieving SNPs that align to related species

A webtool has been developed to find wheat SNP marker sequences on the genome sequences of related species. This webtool allows users to obtain the physical location of an SNP marker on the reference genomes of Brachypodium, oryzae (rice), Sorghum and of wheat (IWGSC RefSeq v1.0 whole genome assembly). This information can be useful for homology-based cloning approaches and rapid identification of candidate genes.

Users must enter an Axiom® SNP ID and can select a range in bps to retrieve all stored Axiom® marker SNPs within that range on either side of the query SNP. The Axiom® SNP marker sequences have been mapped to the reference genomes using BLASTN, and an e-value cut-off can be defined to exclude less reliable alignments. Users can choose a genome reference sequence and the SNP along with upstream and downstream flanking SNPs are displayed with positional information and the e-value score for that BLAST SNP alignment to the reference sequence. Other useful information includes the SNP status, i.e. is it a synonymous or non-synonymous SNP, and a FATHMM score (13) if this has been calculated.

If alternate SNP codes are present, these are also displayed, allowing users to quickly cross-reference for SNPs on other genotyping platforms. Clicking on the Axiom® code displays the sequence surrounding the SNP and links out to UniProt resources (27) if the sequence has been annotated.

Selecting Axiom® SNPs between two known locations

The ability to extract all stored SNPs that are located between two known SNPs of interest has been requested by several CerealsDB users. In response, we have created a webtool that can take two SNP IDs or a single SNP plus a specified distance in bps upstream or downstream of that SNP and the database will return a list of SNPs with the query SNP highlighted in red. Chromosomal location and physical positions are also indicated along with any annotations associated with the surrounding sequence. Many of the CerealsDB SNPs will have a putative annotation based on BLASTN alignments to the nr database. Clicking on the Axiom® probe names allows users to cross-reference this SNP with SNPs on other genotyping platforms that are hosted on CerealsDB.

Searching annotated Axiom® sequences

Located within the Axiom® SNP section of CerealsDB (accessed via the top menu bar) is a tool embedded in the right-hand vertical menu bar (called ‘Search axiom annotations’) that allows users to search for Axiom® probes that have been annotated by BLASTN alignment to the nr database. Users can enter the Axiom® code for a SNP, a GO term (e.g. GO:0048316, which is linked to seed development), a general search term such as ‘rust_resistance’ or a gene name, e.g. ‘spo11’. If the search term is found in the annotation table, the respective Axiom® SNPs are displayed along with the BLAST and GO terms (if available) and a weblink provided to the UniProt website (27) (a database of protein sequences and functional annotation) where more detailed information on the protein can be found.

Details on whether this SNP is available on one of the other genotyping platforms is also displayed using a colour-coded glyph that allows users to quickly scroll through hundreds of results and identify cross-platform SNPs (Figure 2). This is a useful functionality for plant breeders who frequently want to compare ‘legacy’ markers used in their breeding programmes. Clicking on the Axiom® code of a SNP provides more information on that SNP, such as alternative names, the Axiom® probe sequence, chromosomal location and any genetic mapping data, if available.

Figure 3.

A circular dendrogram rendered using phylocanvas shows Watkins lines highlighted in blue. Users can view relationships between wheat varieties genotyped on both the 820K and 35K arrays and display the data in a number of different styles of dendrogram.

Figure 4.

The introgression plotter generates a heatmap and a circos plot in this example for the variety Brompton. The heatmap (a) plots total introgression size for each comparison with relatives ordered vertically by relatedness to bread wheat and chromosomes ordered horizontally. Users can zoom in on the heatmap for more detail. The circos plot (b) has a number of tracks: track 1 (innermost track), putative introgressed/deleted regions; track 2, SNP density for each wheat chromosome; track 3, CNV gain; track 4, CNV loss; track 5, minor allele frequency. The outermost track represents wheat chromosomes (500 = 500 Mbp). The Brompton wheat variety is known to have the 1B/1RS translocation and this can clearly be seen in the circos plot.

Phylogenetic trees

Phylogenetic relationships between wheat varieties and wheat relative and progenitor species were investigated using Axiom® genotype calls. The SNPhylo (28) (version 20160204) pipeline was used to construct phylogenetic trees, separately for the 820K array and the 35K array datasets. These trees can be used to explore the genetic relationship between different accessions, for example, to understand the relationships between elite wheat varieties and wheat relative/progenitor species, revealing potentially interesting insights into the domestication of wheat and the development of modern ‘elite’ wheatvarieties.

It is possible to select one of three types of trees (circular, diagonal, hierarchical) and users can search for names of wheat varieties using the search field. Found names or text patterns will be immediately shown as blue tree nodes, with the names of the nodes given a blue background, which helps to locate varieties of interest in the tree. A subtree containing these selected varieties can be redrawn. Downloadable circular dendrograms for both datasets in high-resolution portable network graphic (png) format are also available and were generated using the R package ggtree (29). A screenshot of the 820K circular dendogram is displayed in Figure 3.

Introgression plotting tools

Two online webtools were created for viewing and interrogating putative introgressions/deletions. The first generates a heatmap of putative introgressions/deletions from all wheat relatives for each hexaploid wheat accession. The heat map is coloured according to the size of the putative introgression/deletions and allows users to determine which wheat relatives are possible donors. All heatmaps were pre-computed in R and exported as png images, which are embedded into the webpage. An associated interactive table details the total number and size of introgressions/deletions (in bps) detected for each relative comparison and allows users to sort and select a comparison to view in more detail.

The second tool generates circos plots to visualize putative introgressions/deletions from a single relative donor. This is combined with other parameters such as CNV, SNP density and genetic diversity measures for a straightforward comparison. An example of a heatmap and circos plot generated by the webtool is displayed in Figure 4, where the known 1B/1RS Rye introgression can clearly be seen in the circos plot when comparing the variety Brompton to wheat relative and progenitor species. The underlying introgression data for the plotter have also been used to confirm the presence of chromatin from the wheat wild relative A. sharonensis that was used to introgress stem rust resistance to bread wheat genomes (30). In Supplementary Figure 5, we can clearly see introgressed regions on chromosomes 1B and 5B for the ‘Pretorius’ wheat variety plotted as both heatmaps and circos plots.

Conclusions

CerealsDB is a highly accessed site and an invaluable resource for both wheat researchers and breeders. All CerealsDB data are publicly available without restrictions and the database is continually updated with curated genotyping data. CerealsDB contains a range of tools for the visualization of wheat genomic datasets and facilitates integration with other databases via the web services that we have developed. Future plans include the extension of web services to offer further functionalities and a continuous increase in the number of genotyped varieties. In particular, we plan an expansion of the web services to include more genotyping and QTL data along with the streamlining of the site navigation to allow visitors to rapidly locate and retrieve the information that they are particularly interested in.

Database architecture

CerealsDB version 3.0 uses a MySQL relational database management system database (version 14.14) running on a Linux server (Ubuntu OS version 18.04) and hosted using the Apache2 web server (version 2.4.7) with Perl, Python and PHP scripts used for all data retrieval and output. Genotyping data are also hosted in a MongoDB database to speed up retrieval of large genotyping datasets. The aforementioned software and architecture were used as they are platform independent and open source.

Web services and availability

The CerealsDB website is freely available and can be accessed at http://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/indexNEW.php. The site is free from IP restrictions and access is not password protected.

List of abbreviations

API indicates application programming interface; BC, Bristol contig; BLAST, basic local alignment search tool; BS, Bristol SNP; CSS, cascading style sheets; DArT, diversity array technology; DFW, designing future wheat; EST, expressed sequence tag; GO, Gene Ontology; HTTP, hypertext transfer protocol; ID, identification; IWGSC, International Wheat Genome Sequencing Consortium; IWYP, International Wheat Yield Partnership; JSON, Javascript object notation; MAS, marker-assisted selection; NCBI, National Center for Biotechnology Information; NGS, next generation sequencing; nr, NCBI non-redundant database; OS, operating system; PHP, hypertext pre-processor; QTL, quantitative trait locus; RDMS, relational database management system; REST, representational state transfer; SNP, single nucleotide polymorphism; SNV, single nucleotide variant; SSD, single seed descent; TGbyS, targeted genotyping by sequencing; URGI, Unité de Recherche Génomique Info; URI, uniform resource identifier; URL, uniform resource locator; VCF, variant call format.

Availability of data and materials

Datasets contained in CerealsDB are available on the site to download as Excel spreadsheets or VCF format. SNP data can be downloaded from the right-hand menu for each of the different genotyping platforms.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

P.A.W., A.M.A., L.W., J.Z. and S.T. wrote the manuscript. P.A.W., G.L.A.B. and M.O.W. built CerealsDB version 3.0; P.A.W., S.T., X.B. and R.D. contributed to the web services; L.W. and S.G. provided the QTL data; A.M.A. and A.B. carried out the SNP validation experiments; D.S. and A.B. collated the varietal data; J.Z. predicted FATHMM scores; and K.J.E. and G.L.A.B. conceived of the study, participated in its design and coordination and helped to draft the manuscript. All authors have read and approved the final manuscript.

Acknowledgements

We are grateful to Jane Coghill and Christy Waterfall from the University of Bristol Genomics facility for NGS sequencing and advice. We are also grateful to the Lady Emily Smyth Agricultural Research Station for providing the funds for next-generation sequencing.

Supplementary data

Supplementary data are available at Database Online.

Funding

Biotechnology and Biological Sciences Research Council, UK (BBS/E/C/000I0210, BBS/E/C/000I0280, BBS/E/J/000PR9782, BBS/E/T/000PR9783).

References

Gupta

P.K.

Rustgi

and

Mir

R.R.

(

2008

)

Array-based high-throughput DNA markers for crop improvement

Heredity

101

–

Wilkinson

P.A.

Winfield

M.O.

Barker

G.L.A.

et al. (

2012

)

CerealsDB 2.0: an integrated resource for plant breeders and scientists

BMC Bioinformatics

219

Wilkinson

P.A.

Winfield

M.O.

Barker

G.L.A.

et al. (

2016

)

CerealsDB 3.0: expansion of resources and data integration

BMC Bioinformatics

256

Winfield

Allen

A.M.

Burridge

A.J.

et al. (

2015

)

High density SNP genotyping array for hexaploid wheat and its secondary and tertiary gene pool

Plant Biotechnol. J.

1195

–

1206

Burridge

A.J.

Wilkinson

P.A.

Winfield

M.O.

et al. (

2018

)

Conversion of array-based single nucleotide polymorphic markers for use in targeted genotyping by sequencing in hexaploid wheat (Triticum aestivum)

Plant Biotechnol. J.

867

–

876

Bian

Tyrrell

and

Davey

R.P.

(

2017

)

The Grassroots life science data infrastructure

https://grassroots.tools

Winfield

M.O.

Wilkinson

P.A.

Allen

A.M.

et al. (

2012

)

Targeted re-sequencing of the allohexaploid wheat exome

Plant Biotechnol. J.

733

–

742

Altschul

S.F.

Gish

Miller

et al. (

1990

)

Basic local alignment search tool

J. Mol. Biol.

215

403

–

410

Binns

Dimmer

Huntley

et al. (

2009

)

QuickGO: a web-based tool for Gene Ontology searching

Bioinformatics

3045

–

3046

10.

Finn

R.D.

Clements

and

Eddy

S.R.

(

2011

)

HMMER web server: interactive sequence similarity searching

Nucleic Acids Res.

W29

–

11.

Suzek

B.E.

Wang

Huang

et al. (

2015

)

UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches

Bioinformatics

926

–

932

12.

Sjolander

Karplus

Brown

et al. (

1996

)

Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology

CABIOS

327

–

345

PubMed

OpenURL Placeholder Text

https://github.com/phylocanvas/phylocanvas

13.

Shihab

H.A.

Gough

Cooper

D.N.

et al. (

2013

)

Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models

Hum. Mutat.

–

14.

Przewieslik-Allen

Burridge

Wilkinson

et al. (

2019

)

Developing a high-throughput SNP-based marker system to facilitate the introgression of traits from Aegilops species into bread wheat (Triticum aestivum)

Front. Plant Sci.

1993

15.

Wingen

West

Leverington-Waite

et al. (

2017

)

Wheat landrace genome diversity

Genetics

205

1657

–

1676

16.

Allen

A.M.

Winfield

M.O.

Burridge

A.J.

et al. (

2017

)

Characterisation of a Wheat Breeders’ Array suitable for high throughput SNP genotyping of global accessions of hexaploid bread wheat (Triticum aestivum)

Plant Biotechnol. J.

390

–

401

17.

Bhat

P.R.

T.J.

et al. (

2008

)

Efficient and accurate construction of genetic linkage maps from the minimum spanning tree of a graph

PLoS Genet.

e1000212

18.

Broman

K.W.

Sen

et al. (

2003

)

R/QTL: QTL mapping in experimental crosses

Bioinformatics

889

–

890

19.

Shrestha

Arnaud

Mauleon

et al. (

2010

)

Multifunctional crop trait ontology for breeders’ data: field book, annotation, data discovery and semantic enrichment of the literature

AoB Plants

2010

plq008

20.

21.

Horler

R.S.P.

Turner

A.S.

Fretter

et al. (

2018

)

SeedStor: a germplasm information management system and public database

Plant Cell Physiol.

(1).

OpenURL Placeholder Text

22.

Bolser

Staines

D.M.

Pritchard

Kersey

(

2016

) Ensembl Plants: Integrating Tools for Visualizing, Mining, and Analyzing Plant Genomics Data. In:

Edwards

(ed).

Methods Mol Biol.

1374

115

–

140

23.

Wolfe

Dudek

Ritchie

M.D.

et al. (

2013

)

Visualizing genomic information across chromosomes with PhenoGram

BioData Min.

24.

Xiuling

Weie

et al. (

2017

)

Molecular mapping of reduced plant height gene Rht24 in bread wheat

Front. Plant Sci.

1379

25.

Würschum

Langer

S.M.

Longin

C.F.H.

et al. (

2017

)

A modern green revolution gene for reduced height in wheat

Plant J.

892

–

903

26.

Griffiths

Simmonds

Leverington

et al. (

2012

)

Meta-QTL analysis of the genetic control of crop height in elite European winter wheat germplasm

Mol. Breed.

159

–

171

Crossref

27.

The UniProt Consortium

(

2015

)

UniProt: a hub for protein information

Nucleic Acids Res.

D204

–

D212

Crossref

PubMed

28.

Lee

T.H.

Guo

Wang

et al. (

2014

)

SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data

BMC Genomics

162

29.

Smith

D.K.

Zhu

et al. (

2017

)

ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data

Methods Ecol. Evol.

–

Crossref

30.

Millet

Steffenson

B.J.

Prins

et al. (

2017

)

Genome targeted introgression of resistance to African stem rust from Aegilops sharonensis into bread wheat

Plant Genome.

(3).

OpenURL Placeholder Text