The non-obese diabetic mouse sequence, annotation and variation resource: an aid for investigating type 1 diabetes

Regions sequenced in the two strains of NOD mouse showing numbers of sequenced bp and numbers of sequenced BACs

Region	Chromosome	Strain	Library	Length bp	Number of BACs
Idd1 (MHC)	17	NOD/ShiLtJ	CHORI-29	4 810 977	29
Idd1 (MHC)	17	NOD/MrkTac	DIL	4 256 209	34
Idd10	3	NOD/MrkTac	DIL	1 195 666	10
Idd16.1	17	NOD/ShiLtJ	CHORI-29	1 908 920	12
Idd18.1	3	NOD/MrkTac	DIL	689 363	5
Idd18.2	3	NOD/MrkTac	DIL	466 430	3
Idd3	3	NOD/MrkTac	DIL	697 603	5
Idd4.1	11	NOD/MrkTac	DIL	1 550 185	11
Idd4.2	11	NOD/MrkTac	DIL	1 462 311	12
Idd4.2Q	11	NOD/ShiLtJ	CHORI-29	3 086 744	18
Idd5.1_CHORI	1	NOD/ShiLtJ	CHORI-29	456 798	2
Idd5.1	1	NOD/MrkTac	DIL	853 859	7
Idd5.3	1	NOD/MrkTac	DIL	3 710 957	32
Idd5.4	1	NOD/MrkTac	DIL	326 027	2
Idd6.1+2	6	NOD/ShiLtJ	CHORI-29	5 657 964	35
Idd6.AM	6	NOD/ShiLtJ	CHORI-29	1 609 414	11
Idd9.1	4	NOD/MrkTac	DIL	2 950 841	25
Idd9.1M	4	NOD/MrkTac	DIL	215 692	2
Idd9.2	4	NOD/MrkTac	DIL	2 864 054	22
Idd9.3	4	NOD/MrkTac	DIL	1 625 541	12
Total from DIL library				22 864 738	182
Total from CHORI-29 library				17 530 817	107
Total				40 395 555	289

Region	Chromosome	Strain	Library	Length bp	Number of BACs
Idd1 (MHC)	17	NOD/ShiLtJ	CHORI-29	4 810 977	29
Idd1 (MHC)	17	NOD/MrkTac	DIL	4 256 209	34
Idd10	3	NOD/MrkTac	DIL	1 195 666	10
Idd16.1	17	NOD/ShiLtJ	CHORI-29	1 908 920	12
Idd18.1	3	NOD/MrkTac	DIL	689 363	5
Idd18.2	3	NOD/MrkTac	DIL	466 430	3
Idd3	3	NOD/MrkTac	DIL	697 603	5
Idd4.1	11	NOD/MrkTac	DIL	1 550 185	11
Idd4.2	11	NOD/MrkTac	DIL	1 462 311	12
Idd4.2Q	11	NOD/ShiLtJ	CHORI-29	3 086 744	18
Idd5.1_CHORI	1	NOD/ShiLtJ	CHORI-29	456 798	2
Idd5.1	1	NOD/MrkTac	DIL	853 859	7
Idd5.3	1	NOD/MrkTac	DIL	3 710 957	32
Idd5.4	1	NOD/MrkTac	DIL	326 027	2
Idd6.1+2	6	NOD/ShiLtJ	CHORI-29	5 657 964	35
Idd6.AM	6	NOD/ShiLtJ	CHORI-29	1 609 414	11
Idd9.1	4	NOD/MrkTac	DIL	2 950 841	25
Idd9.1M	4	NOD/MrkTac	DIL	215 692	2
Idd9.2	4	NOD/MrkTac	DIL	2 864 054	22
Idd9.3	4	NOD/MrkTac	DIL	1 625 541	12
Total from DIL library				22 864 738	182
Total from CHORI-29 library				17 530 817	107
Total				40 395 555	289

Candidate regions Idd6.1 and Idd6.2 were combined for ease of mapping and sequencing due to their proximity in the genome and are referred to as Idd6.1 + 2. Regions with a letter suffix distinguish regions originally given the same name but were located in different regions of the genome. Idd5.1_CHORI is contained wholly within DIL Idd5.1 and was sequenced to establish inter-NOD strain differences.

All sequences have been submitted to the European Nucleotide Archive (ENA) (25) part of the International Nucleotide Sequence Database Collaboration (INSDC) (26) and can also be downloaded from the NOD mouse webpage (27), which also provides a central point for information on the project. Finished clones from the targeted Idd candidate regions are displayed in the NOD clone sequence section of the website (28), where they can be downloaded either as individual clone sequences or larger contigs that make up the accession golden path. All the sequence for a specific region can be selected from the relevant chromosome dropdown menu and is also available via the GRC website (29).

Annotation

We have annotated 738 genes across 19 Idd candidate regions in the NOD mouse spanning 31 328 369 bp of finished sequence and 765 genes in the homologous regions in the GRCm38 C57BL/6J reference genome. The difference in total numbers of loci is partly due to some structural variation between the two mouse strains and sequence gaps in the Idd regions, making it difficult to predict accurately numbers of missing genes.

Four hundred and eighteen of the genes annotated on the genomic sequence in NOD mouse were coding, 396 of which were known and a further 22 were novel coding loci. One hundred and thirty-seven non-coding loci were annotated, 72 of which were long intergenic non-coding RNAs (lincRNAs), 59 were antisense to a coding gene and 6 were sense intronic to a coding gene. One hundred and eighty-two pseudogenes (135 processed and 40 unprocessed, 3 transcribed processed and 4 transcribed unprocessed) and one nonsense-mediated decay read-through transcript were identified.

Four hundred and thirty of the genes annotated on the GRCm38 C57BL/6J reference genome sequence were coding, 425 of which were known and a further 5 were novel coding loci. One hundred and fifty non-coding loci were annotated, 79 of which were long intergenic non-coding RNAs (lincRNAs), 65 were antisense to a coding gene and 6 were sense intronic to a coding gene. One hundred and eighty-four pseudogenes (147 processed and 26 unprocessed, 3 transcribed processed and 8 transcribed unprocessed) and one nonsense-mediated decay read-through transcript were identified.

The gene content of each annotated Idd region for the NOD mouse and C57BL/6J mouse (B6) is presented in Table 2 and in further detail in Supplementary Data 2.

Table 2.

Gene content for annotated Idd regions in the NOD mouse and C57BL/6J (B6) mouse

Region	Chromosome	Loci		Coding		Non-coding		Pseudogenes
Region	Chromosome	NOD	B6	NOD	B6	NOD	B6	NOD	B6
Idd10*	3	18	22	8	11	2	3	8	8
Idd16.1	17	58	59	40	40	14	14	4	5
Idd18.1	3	4	4	3	3	0	0	1	1
Idd18.2	3	17	17	10	10	4	4	3	3
Idd3	3	18	18	8	8	4	4	6	6
Idd4.1*	11	80	78	60	59	6	6	14	13
Idd4.2	11	70	70	46	46	2	2	22	22
Idd4.2Q	11	64	64	40	40	16	16	8	8
Idd5.1_CHORI	1	11	12	3	3	2	2	6	7
Idd5.1*	1	15	38	5	11	3	12	7	15
Idd5.3	1	23	25	10	10	5	5	8	10
Idd5.4	1	7	7	4	4	3	3	0	0
Idd6.1+2*	6	76	82	36	38	25	26	15	18
Idd6.AM*	6	55	29	22	15	0	0	33	14
Idd9.1*	4	74	76	52	54	18	18	4	4
Idd9.1M	4	4	4	2	2	2	2	0	0
Idd9.2*	4	109	125	49	56	21	23	39	46
Idd9.3	4	35	35	20	20	10	10	5	5
Total		738	765	418	430	137	150	183	185

Region	Chromosome	Loci	Coding	Non-coding	Pseudogenes
Idd10*	3	18	22	8	11	2	3	8	8
Idd16.1	17	58	59	40	40	14	14	4	5
Idd18.1	3	4	4	3	3	0	0	1	1
Idd18.2	3	17	17	10	10	4	4	3	3
Idd3	3	18	18	8	8	4	4	6	6
Idd4.1*	11	80	78	60	59	6	6	14	13
Idd4.2	11	70	70	46	46	2	2	22	22
Idd4.2Q	11	64	64	40	40	16	16	8	8
Idd5.1_CHORI	1	11	12	3	3	2	2	6	7
Idd5.1*	1	15	38	5	11	3	12	7	15
Idd5.3	1	23	25	10	10	5	5	8	10
Idd5.4	1	7	7	4	4	3	3	0	0
Idd6.1+2*	6	76	82	36	38	25	26	15	18
Idd6.AM*	6	55	29	22	15	0	0	33	14
Idd9.1*	4	74	76	52	54	18	18	4	4
Idd9.1M	4	4	4	2	2	2	2	0	0
Idd9.2*	4	109	125	49	56	21	23	39	46
Idd9.3	4	35	35	20	20	10	10	5	5
Total		738	765	418	430	137	150	183	185

Note that regions marked with an asterisk indicate Idd regions where the sequence is not contiguous due to sequence gaps, which may be indicative of structural variation between the two mouse strains, or in the case of Idd5.1 and Idd10 where only portions of the defined candidate region were sequenced.

Manual annotation is made available publicly via the Vertebrate Genome Annotation (Vega) website (30) and can be accessed specifically from the mouse Idd regions section (31) (see Figure 1).

Figure 1.

Entry point to the Idd regions in Vega. The regions are represented graphically and shown in the relative position they are found in the C57BL/6J genome. Each region links through to a regional summary. The MHC annotation will be available in the resource by mid-2013.

Differences between strains can be visualised in Vega where it is possible to compare genomic sequence and genes in the candidate loci and is a useful way of identifying regions of difference between the two mouse strains quickly (Figure 2). By making annotation available via Vega and/or Ensembl, it is also formatted to enable it to be imported into other genomes browsers such as T1DBase’s GBrowse (32) as required by collaborators.

Figure 2.

The NOD and C57BL/6J mouse sequences can be aligned against each other. Homologous genes are connected with lines to help identify them. Blocks of homologous sequence are coloured green, and regions with different sequence or no sequence are coloured light blue. It is clear that there are different intronic sequences present in gene Bcat1 in CHORI-29 (lower panel) with respect to C57BL/6J, possibly resulting in changes to regulatory regions or other functional sequences.

Variation

To get an insight into the differences between the annotated Idd regions of the NOD and C57BL/6J genomes, their sequences were compared, and the variation consequences were analysed. We found 123 926 SNPs and 18 821 indels across the annotated Idd regions (Table 3). As not all of the Idd regions were defined from congenic mice using the C57BL/6J mouse as the non-NOD strain, such as the C57BL/10SnJ mouse, this comparison is not correct. However, the best complete reference strain for variation work is C57BL/6J, and if one strain is to be used, then this is the best choice.

Table 3.

Number of changes by type and variation rate in the NOD Idd regions

Region	SNPs	Indels	Length	SNPs/Mb	bp/SNP
Idd10	2833	485	1 531 595	1850	541
Idd16.1	1598	464	1 774 776	900	1111
Idd18.1	253	111	948 181	267	3748
Idd18.2	1506	289	536 172	2809	356
Idd3	3640	583	478 088	7614	131
Idd4.1: 1-1248286	923	315	1 174 414	786	1272
Idd4.2	534	173	1 490 088	358	2790
Idd4.2Q	16 787	2543	2 526 934	6643	151
Idd5.1	3191	466	2 736 539	1166	858
Idd5.1_CHORI	1157	233	400 982	2885	347
Idd5.3	35 745	4311	3 033 670	11 783	85
Idd5.4	599	126	221 559	2704	370
Idd6.1_2	23 206	3528	5 528 888	4197	238
Idd6.AM: 2496-440951	5202	480	425 133	12 236	82
Idd9.1	8601	1634	2 896 193	2970	337
Idd9.1M	1	3	127 081	8	127 081
Idd9.2: 1066933-3054144	13 443	2148	1 861 587	7221	138
Idd9.3	4707	929	1 336 693	3521	284
All	123 926	18 821	29 028 573	4269	234

Region	SNPs	Indels	Length	SNPs/Mb	bp/SNP
Idd10	2833	485	1 531 595	1850	541
Idd16.1	1598	464	1 774 776	900	1111
Idd18.1	253	111	948 181	267	3748
Idd18.2	1506	289	536 172	2809	356
Idd3	3640	583	478 088	7614	131
Idd4.1: 1-1248286	923	315	1 174 414	786	1272
Idd4.2	534	173	1 490 088	358	2790
Idd4.2Q	16 787	2543	2 526 934	6643	151
Idd5.1	3191	466	2 736 539	1166	858
Idd5.1_CHORI	1157	233	400 982	2885	347
Idd5.3	35 745	4311	3 033 670	11 783	85
Idd5.4	599	126	221 559	2704	370
Idd6.1_2	23 206	3528	5 528 888	4197	238
Idd6.AM: 2496-440951	5202	480	425 133	12 236	82
Idd9.1	8601	1634	2 896 193	2970	337
Idd9.1M	1	3	127 081	8	127 081
Idd9.2: 1066933-3054144	13 443	2148	1 861 587	7221	138
Idd9.3	4707	929	1 336 693	3521	284
All	123 926	18 821	29 028 573	4269	234

Fragments with structural variation were removed from Idd4.1, Idd6.AM and Idd9.2. Length shown is the remaining after repeat masking the region sequence.

The average variation rate of the Idd regions attending to SNPs alone is one change every 234 bp, ∼2.3 times higher than the mean variation rate of the NOD/ShiLtJ genome (Table 4) calculated according to data from a previous study (33).

Table 4.

Number of SNPs and variation rate in the annotated Idd regions and the whole NOD/ShiLtJ genome

Region	SNPs	Length	SNPs/Mb	bp/SNP
ALL Idd regions (BAC)	123 926	29 028 573	4269	234
ALL Idd regions (MGP)	102 848	29 362 570	3503	285
NOD/ShiLtJ genome (MGP)	4 168 714	2 233 177 854	1867	536

Length for the MGP data refers to the number of confidently mapped bases (see ‘Materials and Methods’ section).

We next analysed whether the BAC-based sequencing project had provided a more comprehensive set of variants than the Mouse Genomes Project (MGP) (34) NOD/ShiLtJ genome sequencing. The MGP found 98 480 high-quality SNPs in confidently mapped positions in the annotated NOD Idd regions, of which 92 770 (94.2%) were confirmed by this study that, in addition, identified 5649 unique SNPs (Figure 3). Of these, 4363 were novel or at least not present in dbSNP when this analysis was performed.

Figure 3.

Comparison of SNP sets in the NOD Idd regions obtained by the BAC sequencing and the MGP.

To minimize the possible data distortion introduced by sequence gaps, structural variation and so forth, we restricted the study of the variation effects on the Idd annotation to the homologous transcripts between the NOD and C57BL/6J genomes. Approximately 4.4% of variants were discovered in exons. Among them, 1428 synonymous coding, 640 non-synonymous coding and 26 codon changes were found across all coding transcripts (Table 5). A similar number of changes involved non-coding transcripts in protein-coding genes (2142), while variation also affected exonic regions of long non-coding RNA (lncRNAs) genes (736) and pseudogenes (660). Details of the most significant variation consequences at the gene level for all annotated regions can be seen in Figure 4. See Supplementary Data 3 for detailed SNP consequences for all annotated Idd regions.

Figure 4.

Percentage of variants affecting each genomic element (introns excluded from the chart).

Table 5.

Effects of the variation between homologous transcript sequences of the NOD and C57BL/6J Idd regions on the GRCm38 reference genome annotation

Variation effect	All transcripts	Canonical transcripts
CODON_CHANGE_PLUS_CODON_DELETION	8	2
CODON_CHANGE_PLUS_CODON_INSERTION	2	2
CODON_DELETION	6	3
CODON_INSERTION	10	5
EXON	2142	0
FRAME_SHIFT	7	3
INTRON	189 366	61 717
NON_SYNONYMOUS_CODING	640	326
SPLICE_SITE_ACCEPTOR	3	0
SPLICE_SITE_DONOR	12	6
START_GAINED	68	17
STOP_GAINED	1	1
STOP_LOST	1	0
SYNONYMOUS_CODING	1428	723
SYNONYMOUS_STOP	4	3
UTR_3_PRIME	2372	1192
UTR_5_PRIME	571	230
WITHIN_NON_CODING_GENE	736	609
WITHIN_PSEUDOGENE	660	636

Variation effect	All transcripts	Canonical transcripts
CODON_CHANGE_PLUS_CODON_DELETION	8	2
CODON_CHANGE_PLUS_CODON_INSERTION	2	2
CODON_DELETION	6	3
CODON_INSERTION	10	5
EXON	2142	0
FRAME_SHIFT	7	3
INTRON	189 366	61 717
NON_SYNONYMOUS_CODING	640	326
SPLICE_SITE_ACCEPTOR	3	0
SPLICE_SITE_DONOR	12	6
START_GAINED	68	17
STOP_GAINED	1	1
STOP_LOST	1	0
SYNONYMOUS_CODING	1428	723
SYNONYMOUS_STOP	4	3
UTR_3_PRIME	2372	1192
UTR_5_PRIME	571	230
WITHIN_NON_CODING_GENE	736	609
WITHIN_PSEUDOGENE	660	636

The account of effects varies as all homologous transcripts in a gene or only the canonical transcript (the one with longest CDS or the longest length) are considered. EXON, WITHIN_NON_CODING_GENE and WITHIN_PSEUDOGENE refer to variants affecting exons of non-coding transcripts in coding genes, lncRNA genes and pseudogenes, respectively.

Discussion

Sequence

Currently there is sequence available for 21 Idd candidate regions in the NOD mouse from two different substrains. Clones from the CHORI-29 and DIL library were sequenced across chromosomes 1, 3, 4, 6, 11 and 17. These include the MHC (Idd1), to date the only Idd region identified as essential for the manifestation of T1D (8), which was sequenced in both mouse strains.

The initial construction of the NOD mouse BAC libraries and subsequent mapping of the BAC end-sequences to the GRCm38 C57BL/6J reference genome has facilitated the targeted sequencing of NOD mouse Idd susceptibility loci. Confirmation of a contiguous tile path for targeted CHORI-29 and DIL clones required that the underlying C57BL/6J genome was sufficiently homologous so that the positioning of the NOD BAC end-sequences could be established confidently (15). Although it might be possible to identify the location of specific genes in the NOD genome from just the BAC end-sequence positioning, potential diabetes specific differences are unlikely to have been inferred from the BAC end-sequence alignments alone. Most of the sequenced NOD BAC clones appear to have a co-linear relationship with the GRCm38 C57BL/6J reference genome, although some significant differences in the amount of sequence and/or number of genes present between the GRCm38 C57BL/6J reference genome and the NOD genome have been identified in Idd6.AM (35), Idd4.1 (36) and Idd9.2.

A number of smaller gaps remain in the CHORI-29-derived Idd1 MHC region, Idd6.1+2 and Idd9.1, where it was not possible to define tilepaths across the regions of interest, which again is suggestive of either inter-strain structural variation or a lack of sequencing coverage of appropriate BACs. Other remaining gaps were due to using a targeted-gene sequencing approach.

Using the Illumina platform (37), the Mouse Genomes group has produced a whole genome assembly for the NOD/ShiLtJ mouse genome (33), which used the CHORI-29-derived BAC sequences to calibrate the SNP calling software. However, the error rate in these assemblies could be higher than the genetic differences between the NOD mouse and the C57Bl/6J mouse. Furthermore, this assembly was guided by the MGSC37 C57BL/6J reference genome and as such would be biased towards the reference sequence. Although the variation information gained from the mouse genomes in this project has been essential, difficulties in producing reliable sequence in regions where structural variation exists continues to prove problematic. Thus, the importance of generating finished sequence rather than draft sequence cannot be over-emphasized, as errors in the genomic assembly, which arise from draft sequence can lead to false variants being called as well as miss-assemblies and missing sequence. Although regions of structural variation between C57BL/6J and NOD mouse can be identified clearly by full BAC sequencing, regions that are deleted or expanded in NOD mouse with respect to C57BL/6J are difficult to identify in the next-generation sequencing derived assembly. Supplementary Data 4 illustrates two regions of structural variation between C57BL/6J and NOD mouse, Idd6.AM and Idd4.1, and the clear technical difficulties that such regions currently cause for current next-generation sequencing and assembly techniques versus traditionally derived sequencing and assembly methods. The dotplots were created using Dotter (38), whereas the next-generation assembly is viewed using LookSeq (39) via the Mouse Genomes webpage.

However, the NOD/ShiLtJ mouse has been re-sequenced recently on the HiSeq platform with longer read lengths and to a higher depth than previously. Using these data and sequencing data from the ends of large fragments (3, 6 and 40 kb), a completely de novo assembly is being generated, which will form the basis of the NOD/ShiLtJ draft genome sequence. Furthermore, the BAC end-sequences derived from the NOD mouse project will play an important role in scaffolding the new genome assembly and the finished BAC sequence in the Idd regions providing high quality finished genome sequence. As such, the new assembly has the potential to give a more complete overview of the NOD mouse genome.

Annotation

The average number of loci per Mb in the C57BL/6J reference genome using Ensembl database version 70.38, assembly version GRCm38 was calculated at 7.7 for protein coding genes and 11.1 for all genes. The average number of loci per Mb in the NOD Idd regions is 12.9 coding genes and 23 for all genes, which suggests that the Idd regions are typically more gene dense than the genome average. Similarly, the average genomic span for loci in C57BL/6J was calculated at 44 542 bp for protein-coding genes and 28 816 bp for all biotypes. The average genomic span for loci in the Idd regions is 37 028 bp for protein-coding genes and 23 352 bp for all biotypes, suggesting that genes typically found in Idd regions have a smaller than average genomic span. Most of the Idd regions appear to show close homology, with a similar number of genes present between C57BL/6J and NOD, apart from Idd6.AM and Idd9.2, which are regions with considerable structural variation.

To investigate gene expression differences between the two mice, next-generation-derived RNA-seq data can be aligned and viewed in Vega. This provides a useful way of investigating the transcriptional activity of genes that are not located within the finished NOD BAC sequences, allowing identification of splicing variation, potential differential gene expression and non-coding RNAs between the NOD mouse and the C57BL/6J reference mouse. Furthermore, this feature has also allowed the verification and confirmation of existing annotation (Figures 5 and 6).

Figure 5.

Gene Bhlhe41 (yellow box) in the Idd6.1+2 region from GRCm38 C57BL/6J reference does not have a homolog annotated in NOD owing to a sequence gap (orange box). It is therefore not possible to be confident whether this gene is present and expressed in NOD mouse.

Figure 6.

A higher resolution view of gene Bhlhe41 taken from the Vega genome browser. RNA-Seq data from NOD has been uploaded into the browser and aligned to the GRCm38 C57BL/6J reference. This shows that the gene is clearly expressed in the NOD mouse. On closer inspection, it would appear that there may be evidence of a 3′ overlapping non-coding RNA locus supported by three mouse mRNAs from AK032333.1, AK040945.1 and AK079251.1 as illustrated by the yellow box with the blue outline. The ability to upload RNA-seq data provides a way to investigate gene expression for sequences not yet represented in the NOD Idd regions and could also prove useful in observing differential intergenerational gene expression.

Variation

The average variation rate in the annotated NOD Idd regions that resulted from our BAC-based sequencing was ∼22% greater than the variation rate for the same regions calculated using the NOD/ShiLtJ sequence of the MGP. This is not completely unexpected as we were able to call SNPs in regions that were inaccessible for that project. When the authors calibrated their SNP-calling pipeline using the NOD/ShiLtJ BAC sequences presented in this study as a reference, they found that the density of SNPs in the NOD/ShiLtJ BAC sequence was 2.78-fold higher in inaccessible regions (32).

The Idd region variation rate was found to be 2.3 times higher (or ∼1.9 times higher according to the MGP data) than the mean NOD genome variation rate inferred from the NOD/ShiLtJ genome sequence. This likely reflects the fact that positive selection for functional variation in immune genes is beneficial to the species in regard to host defense (40), and most Idd genes are likely to be immune genes that function in various aspects of disease pathogenesis. In addition, inbred strains of laboratory mice have inherited a mosaic of haplotype blocks with extremely high SNP rates (40 SNPs per 10 kb) that represent ancient divergence within Mus musculus species, making it likely that functional variants of immune genes and their surrounding DNA have evolved separately for hundreds of thousands of years (41).

We have identified consequences of SNPs that affect protein-coding genes for regions that have not previously had published sequence data (Idd5.3 and Idd5.4, Idd6.1 and Idd6.2, Idd4.2Q and Idd16). Looking specifically for non-synonymous SNPs and variations that result in codon deletions or insertions, as such variation would be most likely to affect protein function, we found 49 affected loci. This included identification of sequence polymorphisms in two previously identified candidate genes, Lrmp and Bcat1, associated with Idd6.2 (42). Data generated from this resource have already contributed towards producing some important results for a number of Idd regions. Five genes have been identified in the Idd4.1 region (Alox15, Alox12e, Psmb6, Pld2 and Cxcl16) as being good candidates for the effects of this region (36). Sequence analysis has also identified likely causative SNPs in Idd5.1 (43), Idd9.3 (44), Idd10 (11) (45) and Idd18.1 (12). Other regions have revealed much greater sequence differences as is apparent in Idd6.AM (35). This region contains a gene cluster of Ly49 and human killer cell immunoglobulin-like receptors genes, which are known to be involved in autoimmune disease. The NOD mouse appears to be expanded with respect to C57BL/6J, having the largest known mouse Ly49 haplotype, variation that continues to confound next-generation sequencing and assembly techniques.

It is clear that traditionally derived sequencing and manually generated annotation have played an essential role in helping to identify sequence variation in important Idd candidate disease regions. Although most of the Idd regions appear to be gene rich, regions that are less gene dense could be candidates for investigating the effects of long-distance gene regulation or other mechanisms (2). Furthermore, much of the analysis that has been carried out for the Idd regions has focused primarily on protein-coding genes. However, it is becoming increasingly clear that lncRNAs have an important role to play in the regulation of gene expression, such as assembling chromatin-modifying complexes (46). Thus, sequence differences that might be identified between strains may affect lncRNA secondary structure and consequently their function. Knockout mouse projects such as the European Conditional Mouse Mutagenesis Program (EUCOMM) could in future investigate phenotypic differences between the C57BL/6J and NOD strains further to help elucidate factors influencing T1D (47). Current knockout mouse resources could be used for non-isogenic targeting of NOD mouse strains. Where this is not possible, the NOD BAC libraries could be used for direct targeting of NOD mouse genes. As T1D susceptibility loci appear to be shared with other immune disorders such as rheumatoid arthritis and Grave’s disease, suggesting shared aetiologies (48), the study of T1D genetics may provide a greater understanding of other autoimmune diseases.

Materials and Methods

Mapping, sequencing and finishing

BAC end-sequences from the DIL and the CHORI-29 library were mapped to the MGSCv3 C57BL/6J reference mouse build in Ensembl (15). In the regions of interest, a series of minimally overlapping tile paths of BACs was selected. Candidate BAC clones were analysed using HindIII restriction fingerprinting and assembled into contigs in FPC (49). Each BAC had a subclone library prepared, which was sequenced using T7 and SP6 primers on the vector with AB Big Dye Terminator Mix v3.1™ and the data analysed on AB 3730 automated sequencing instruments at WTSI. These data were assembled and subjected to automated primer walking, before re-assembly using PHRAP (P Green) and then passed into directed manual finishing for completion to phase 3 (50), where the estimated error rate is less than 1/100 000 (33). At the same time, as the NOD sequence was produced, the fidelity of the corresponding sequence in the C57BL/6J mouse was checked and assembly errors corrected where possible in conjunction with the GRC. Sequence progress was monitored via the NOD mouse website that was constructed for this purpose. The finished BAC sequences in this article have been submitted to the INSDC via the ENA at the European Bioinformatics Institute (EBI) (51). See Supplementary Data 1 for accession numbers. Numbers of reads performed for sequencing and finishing are also available here.

Annotation

Finished mouse clones were subjected to automated analysis for similarity searches and ab initio gene predictions in an extended Ensembl analysis pipeline system (52), which is stored in a MySQL database. Interspersed repeats were identified and classified using RepeatMasker (53) and tandem repeats with TRF (54). Manually annotated sequences have been generated using in-house developed software (55) in accordance with the manual annotation guidelines (56). Designation of biotypes (coding, non-coding and pseudogenes) and gene structures was carried out as defined by the standards in GENCODE (57). Known loci that are represented in the mouse genome database (58), RefSeq (59) or UniProt (60) were tagged as ‘known’ in C57BL/6J and NOD. Genes are categorized not only at the gene level but also at the transcript level, definitions of which can be found in Vega (61). Gene structures were transferred using exonerate’s (62) cdna2genome model to align transcripts between the C57BL/6J and NOD mouse strains where clear homology existed and verified. Where clear homologs could not be identified, the gene models were built independently and named after the NOD BAC clone they aligned to (see Figure 7). See Supplementary Data 2 for annotation for each Idd region.

Figure 7.

Analysis pipeline for the NOD mouse project. C57BL/6J genomic sequence in Idd regions is annotated before an annotation transfer using exonerate, shown here by the orange arrow. Transcript objects are then manually inspected again in the NOD mouse and further manual annotation carried out where appropriate. Unlike the C57BL/6J annotation, the NOD mouse annotation is only available in Vega.

Variation analysis

Nucleotide sequences of homologous genes in the C57BL/6J mouse and NOD mouse regions were aligned with MAFFT v6.857, and variants were derived from the alignments using an ‘ad-hoc’ Perl script. Variants overlapping simple and tandem repeats in the mouse genome sequence according to the Ensembl database v70 (‘dust’ and ‘TRF’ analyses) were filtered out. Variant consequences were obtained with SnpEff v3.1h (63) based on annotations extracted from the human and vertebrate analysis and annotation internal database taking the GRCm38 C57BL/6J mouse as the reference genome. For the comparison with the MGP derived NOD/ShiLtJ sequence, the confidently mapped genome positions were obtained using SAMtools mpileup (64) with minimum mapping quality of 30, minimum base quality of 30 and read length between 10 and 200. See Supplementary Data 3 for details. The SNPs identified in this study have been deposited in dbSNP under the WTSI_NOD_MOUSE handle.

RNA-Seq alignments in Vega

NOD BAM files were downloaded from the ENA (25) (NOD_Offspring1Brain—accession ERR033017, NOD_Offspring2Brain—accession ERR032989, NOD_FatherBrain—accession ERR032990, NOD_MotherBrain—accession ERR032991), mapped to the GRCm38 build and finished NOD contigs using default TopHat settings (65), uploaded to the Sanger NGS server and attached to the Vega genome browser using the ‘Attach remote file’ function.

Acknowledgements

The authors thank Thomas Keane for his advice and help with the variation calling and the teams in mapping, sequence production, finishing, annotation and analysis for their contributions.

Funding

This research was conducted as part of the Immune Tolerance Network (ITN), via contract AI 15416, which is jointly funded by the National Institute of Allergy and Infectious Diseases (NIAID), the National Institute of Diabetes, Digestive, and Kidney Disorders (NIDDK), both part of the National Institutes of Health and the Juvenile Diabetes Research Foundation (JDRF). L.S.W. was supported by Wellcome Trust (096388), Juvenile Diabetes Research Foundation International (9-2011-253), and the Cambridge Institute for Medical Research is in receipt of Wellcome Trust Strategic Award (100140). The project was also supported by the Wellcome Trust Sanger Institute. Funding for open access charge: The Wellcome Trust.

Conflict of interest. None declared.

References

Bluestone

Herold

Eisenbarth

. ,

Genetics, pathogenesis and clinical interventions in type 1 diabetes

Nature

2010

, vol.

464

(pg.

1293

1300

)

Pociot

Akolkar

Concannon

, et al. ,

Genetics of type 1 diabetes: what's next?

Diabetes

2010

, vol.

(pg.

1561

1571

)

Todd

. ,

From genome to aetiology in a multifactorial disease, type 1 diabetes

Bioessays

1999

, vol.

(pg.

164

174

)

Wen

Ley

Volchkov

, et al. ,

Innate immunity and intestinal microbiota in the development of Type 1 diabetes

Nature

2008

, vol.

455

(pg.

1109

1113

)

Todd

Wicker

. ,

Genetic protection from the inflammatory disease type 1 diabetes in humans and animal models

Immunity

2001

, vol.

(pg.

387

395

)

Wicker

Clark

Fraser

, et al. ,

Type 1 diabetes genes and pathways shared by humans and NOD mice

J. Autoimmun.

2005

, vol.

Suppl.

(pg.

)

Makino

Kunimoto

Muraoka

, et al. ,

Breeding of a non-obese, diabetic strain of mice

Jikken Dobutsu

1980

, vol.

(pg.

)

PubMed

OpenURL Placeholder Text

Wicker

Todd

Peterson

. ,

Genetic control of autoimmune diabetes in the NOD mouse

Annu. Rev. Immunol.

1995

, vol.

(pg.

179

200

)

Gan

Albanese-O'Neill

Haller

. ,

Type 1 diabetes: current concepts in epidemiology, pathophysiology, clinical care, and research

Curr. Probl Pediatr Adolesc Health Care

2012

, vol.

(pg.

269

291

)

Chaparro

Dilorenzo

. ,

An update on the use of NOD mice to study autoimmune (Type 1) diabetes

Expert Rev. Clin. Immunol.

2010

, vol.

(pg.

939

955

)

Rainbow

Moule

Fraser

, et al. ,

Evidence that Cd101 is an autoimmune diabetes gene in nonobese diabetic mice

J. Immunol.

2011

, vol.

187

(pg.

325

336

)

Fraser

Dendrou

Healy

, et al. ,

Nonobese diabetic congenic strain analysis of autoimmune diabetes reveals genetic complexity of the Idd18 locus and identifies Vav3 as a candidate gene

J. Immunol.

2010

, vol.

184

(pg.

5075

5084

)

Ridgway

Peterson

Todd

, et al. ,

Gene-gene interactions in the NOD mouse model of type 1 diabetes

Adv. Immunol.

2008

, vol.

100

(pg.

151

175

)

PubMed

OpenURL Placeholder Text

http://www.sanger.ac.uk/research/projects/vertebrategenome/havana/

Rogner

Avner

. ,

Congenic mice: cutting tools for complex immune disorders

Nat. Rev. Immunol.

2003

, vol.

(pg.

243

252

)

Steward

Humphray

Plumb

, et al. ,

Genome-wide end-sequenced BAC resources for the NOD/MrkTac() and NOD/ShiLtJ() mouse genomes

Genomics

2010

, vol.

(pg.

105

110

)

Flicek

Amode

Barrell

, et al. ,

Ensembl 2012

Nucleic Acids Res.

2012

, vol.

(pg.

D84

D90

)

National Institutes of Health

2013

http://www.nih.gov/

Lyons

Hancock

Denny

, et al. ,

The NOD Idd9 genetic interval influences the pathogenicity of insulitis and contains molecular variants of Cd30, Tnfr2, and Cd137

Immunity

2000

, vol.

(pg.

107

115

)

Onengut-Gumuscu

Ewens

Spielman

, et al. ,

A functional polymorphism (1858C/T) in the PTPN22 gene is linked and associated with type I diabetes in multiplex families

Genes Immun.

2004

, vol.

(pg.

678

680

)

Human and Vertebrate Analysis and Annotation team

2013

Shiraki

Kondo

Katayama

, et al. ,

Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage

Proc. Natl Acad. Sci. USA

2003

, vol.

100

(pg.

15776

15781

)

Crossref

ftp://ftp.sanger.ac.uk/pub/NODmouse/NOD_resource_paper_data/

Church

Schneider

Graves

, et al. ,

Modernizing reference genome assemblies

PLoS Biol.

2011

, vol.

pg.

e1001091

NOD mouse ftp site

2013

Wendl

Dear

Hodgson

, et al. ,

Automated sequence preprocessing in a large-scale sequencing environment

Genome Res.

1998

, vol.

(pg.

975

984

)

European Nucleotide Archive

2013

http://www.ebi.ac.uk/ena/

Nakamura

Cochrane

Karsch-Mizrachi

. ,

The International Nucleotide Sequence Database Collaboration

Nucleic Acids Res.

2012

, vol.

(pg.

D33

D37

)

NOD mouse webpage

2013

http://www.sanger.ac.uk/resources/mouse/nod/

NOD mouse sequences

2013

http://www.sanger.ac.uk/cgi-bin/Projects/M_musculus/mouse_NOD_clones_TPF

Genome Reference Consortium (GRC)

2013

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/mouse/

Wilming

Gilbert

Howe

, et al. ,

The vertebrate genome annotation (Vega) database

Nucleic Acids Res.

2008

, vol.

(pg.

D753

D760

)

Vega Idd regions

2012

http://vega.sanger.ac.uk/info/data/mouse_regions.html

Burren

Adlem

Achuthan

, et al. ,

T1DBase: update 2011, organization and presentation of large-scale data sets for type 1 diabetes research

Nucleic Acids Res.

2011

, vol.

(pg.

D997

D1001

)

Keane

Goodstadt

Danecek

, et al. ,

Mouse genomic variation and its effect on phenotypes and gene regulation

Nature

2011

, vol.

477

(pg.

289

294

)

Mouse-Genomes

2012

http://www.sanger.ac.uk/resources/mouse/genomes/

Belanger

Tai

Anderson

, et al. ,

Ly49 cluster sequence analysis in a mouse model of diabetes: an expanded repertoire of activating receptors in the NOD genome

Genes Immun.

2008

, vol.

(pg.

509

521

)

Ivakine

Gulban

Mortin-Toth

, et al. ,

Molecular genetic analysis of the Idd4 locus implicates the IFN response in type 1 diabetes susceptibility in nonobese diabetic mice

J. Immunol.

2006

, vol.

176

(pg.

2976

2990

)

Bentley

Balasubramanian

Swerdlow

, et al. ,

Accurate whole human genome sequencing using reversible terminator chemistry

Nature

2008

, vol.

456

(pg.

)

Sonnhammer

Durbin

. ,

A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis

Gene

1995

, vol.

167

(pg.

GC1

GC10

)

Manske

Kwiatkowski

. ,

LookSeq: a browser-based viewer for deep sequencing data

Genome Res.

2009

, vol.

(pg.

2125

2132

)

Waterston

Lindblad-Toh

Birney

, et al. ,

Initial sequencing and comparative analysis of the mouse genome

Nature

2002

, vol.

420

(pg.

520

562

)

Wade

Kulbokas

III

Kirby

, et al. ,

The mosaic structure of variation in the laboratory mouse genome

Nature

2002

, vol.

420

(pg.

574

578

)

Grimm

Rogner

Avner

. ,

Lrmp and Bcat1 are candidates for the type I diabetes susceptibility locus Idd6

Autoimmunity

2003

, vol.

(pg.

241

246

)

Wicker

Chamberlain

Hunter

, et al. ,

Fine mapping, gene content, comparative sequencing, and expression analyses support Ctla4 and Nramp1 as candidates for Idd5.1 and Idd5.2 in the nonobese diabetic mouse

J. Immunol.

2004

, vol.

173

(pg.

164

173

)

Kachapati

Adams

, et al. ,

The B10 Idd9.3 locus mediates accumulation of functionally superior CD137+ regulatory T cells in the nonobese diabetic type 1 diabetes model

J. Immunol.

2012

, vol.

189

(pg.

5001

5015

)

Penha-Goncalves

Moule

Smink

, et al. ,

Identification of a structurally distinct CD101 molecule encoded in the 950-kb Idd10 region of NOD mice

Diabetes

2003

, vol.

(pg.

1551

1556

)

Mattick

. ,

RNA driving the epigenetic bus

EMBO J.

2012

, vol.

(pg.

515

516

)

Skarnes

Rosen

West

, et al. ,

A conditional knockout resource for the genome-wide study of mouse gene function

Nature

2011

, vol.

474

(pg.

337

342

)

Saleh

Raj

Smyth

, et al. ,

Genetic association analyses of atopic illness and proinflammatory cytokine genes with type 1 diabetes

Diabetes Metab. Res. Rev.

2011

, vol.

(pg.

838

843

)

Soderlund

Humphray

Dunham

, et al. ,

Contigs built with fingerprints, markers, and FPC V4.7

Genome Res.

2000

, vol.

(pg.

1772

1787

)

Bird

Grafham

. ,

BAC finishing strategies

Methods Mol. Biol.

2004

, vol.

255

(pg.

255

277

)

PubMed

OpenURL Placeholder Text