SWine IMputation(SWIM) 일배체형 참조 패널을 사용하면 돼지의 뉴클레오티드 분해 유전자 매핑이 가능합니다.
Nov 11, 2023
커뮤니케이션 생물학 6권, 기사 번호: 577(2023) 이 기사 인용
192 액세스
2 알트메트릭
측정항목 세부정보
돼지와 같은 가축동물에서 경제적으로 중요한 정량적 특성 변화와 연관되거나 이를 유발하는 유전자 및 대립유전자를 식별하기 위한 유전자 매핑은 동물 유전적 개선의 주요 목표입니다. 처리량이 많은 유전자형 분석 기술의 최근 발전에도 불구하고 돼지의 유전자 매핑 해상도는 부분적으로 유전자형 변이 부위의 밀도가 낮기 때문에 여전히 열악합니다. 본 연구에서는 44개 돼지 품종을 대표하는 2259마리의 전체 게놈 서열 분석 동물을 기반으로 돼지용 참조 일배체형 패널을 개발하여 이러한 한계를 극복했습니다. 우리는 대치 절차를 최적화하기 위해 소프트웨어 조합과 품종 구성을 평가했으며 96%를 초과하는 평균 일치율, 88%의 비참조 일치율, r2 0.85를 달성했습니다. 우리는 두 가지 사례 연구에서 이 자원을 사용한 유전자형 대체가 유전자 매핑의 해상도를 극적으로 향상시킬 수 있음을 입증했습니다. 돼지 유전학 커뮤니티가 이 리소스를 완전히 활용할 수 있도록 공개 웹 서버가 개발되었습니다. 우리는 이 자원이 돼지의 유전자 지도 작성을 촉진하고 유전자 개선을 가속화할 것으로 기대합니다.
사육돼지(Sus scrofa)는 중요한 가축종이자 생물의학 연구를 위한 모델 유기체입니다1. 역사적으로 가축화와 강렬한 인공 선택을 통해 서로 및 야생 친척과 유전적으로, 표현형적으로 구별되는 많은 돼지 품종이 탄생했습니다2,3,4. 최근에는 처리량이 많은 DNA 염기서열 분석 및 유전자형 분석 기술5이 돼지의 유전적 개선을 촉진했습니다. 예를 들어, 수백 건의 게놈 전체 연관 및 정량적 특성 유전자좌(QTL) 매핑 연구에서 다양한 생산, 생리학적 및 행동 표현형과 관련된 수많은 게놈 영역이 확인되었습니다6. 이러한 연구는 성장7, 생식력8, 질병 저항성9과 같은 경제적, 생물의학적으로 중요한 특성의 유전적, 생물학적 기초를 이해하는 데 중요합니다.
돼지의 유전자 매핑 해상도는 부분적으로 단일 염기 다형성(SNP) 유전자형 배열의 밀도가 낮기 때문에 여전히 열악합니다. 분해능의 한계를 극복하기 위한 입증되고 비용 효율적인 접근 방식 중 하나는 유전자형 대치, 즉 연관 불균형을 활용하여 관찰되지 않은 다형성 유전자좌에서 유전자형을 추론하는 것입니다. 전체 게놈 시퀀싱을 통해 생성된 대규모 일배체형 참조 패널을 사용하면 대체가 시퀀스 수준 유전자형을 제공할 가능성이 있습니다. QTL 식별과 유전적 예측이 두 가지 주요 목표이고 연관 불균형이 광범위한 가축 동물의 경우 상대적으로 적은 수의 참조 일배체형이지만 적절한 정확도로 서열 수준 유전자형 대체가 성공적으로 적용되었습니다12, 13. 특히 돼지의 경우 적어도 두 개의 공개 대치 서버를 사용할 수 있습니다14, 15. 그러나 참조 패널에 매우 제한된 수의 동물이 포함되어 있거나 주요 상업용 품종15에 대한 좋은 표현이 부족하여 적용이 제한되었습니다. 또한, 많은 연구에서 매핑 해상도16 및 게놈 예측 정확도17의 개선이 입증되었지만 이들 중 어느 것도 공개적으로 접근할 수 없습니다.
이 연구에서 우리는 새로 배열된 1530마리의 돼지로부터 전체 게놈 서열 데이터를 생성하고 이를 공개 데이터베이스의 729마리의 추가 동물과 결합하여 변종을 호출하고 현재까지 돼지의 일배체형에 대한 가장 크고 다양한 참조 패널을 개발했습니다. 이용 가능한 게놈 수의 이러한 실질적인 증가를 통해 우리는 SNP 배열 유전형을 전체 게놈 서열에 신속하고 정확하게 귀속시킬 수 있었습니다. 우리는 대치의 정확성을 평가하고 게놈 전체 연관 매핑에서 이 일배체형 참조 패널의 유용성을 입증했습니다. 우리는 사용자가 배열 유전자형을 제출하고 귀속된 전체 게놈 서열 수준 유전자형을 검색할 수 있는 새로운 공개 웹 서버(swimgeno.org)를 소개합니다. 이 리소스는 고정확도 유전자형 대치에 대한 접근을 크게 향상시켜 잠재적으로 돼지의 뉴클레오티드 분해능 유전자 매핑을 촉진합니다.
0.5) in 435 Durocs, 522 Landraces, 493 Yorkshires, 36 Meishans, 24 European wild boars, and 27 Asian wild boars. c Scatter plot of first two principal components of genotype matrix for common (MAF > 0.05) and LD-pruned variants. Points are color-coded according to their reported breed information. A preliminary principal component analysis was performed to visually inspect and remove clear outliers from clusters, which indicated errors in breed information. d Ancestries of pigs were estimated with variable (K = 2, 4, 6) numbers of postulated ancestral populations using the ADMIXTURE software. Estimated ancestries were plotted as stacked bar charts with breeds annotated on the top. In addition to annotations above the bar chart, broad geographical locations are also annotated below the bar chart for K = 6./p> 0.005 to construct the haplotype reference panel. To investigate factors that influence imputation accuracy, we considered different combinations of commonly used phasing and imputation software, including SHAPEIT4/IMPUTE5, Beagle5.2/Beagle5.2, and Eagle2.4/Minimac4. We defined imputation accuracy using three metrics, the overall concordance rate between imputed and observed genotypes, non-reference concordance rate summarizing accuracy for non-reference genotypes only, and squared correlation (r2) between imputed and observed genotypes. We focused on Landrace as the target set because it has the largest number of animals in the dataset. We held out 100 Landrace pigs sequenced at high coverage (>15X) and compared observed genotypes with imputed genotypes starting from sequencing-based genotypes at sites on a 50 K SNP array (GeneSeek GGP). Regardless of breed composition in the haplotype reference panel of fixed size, SHAPEIT4/IMPUTE5 outperformed Beagle5.2/Beagle5.2 and Eagle2.4/Minimac4 in all three metrics (Fig. 2a–c). SHAPEIT4/IMPUTE5 was therefore chosen for all subsequent analyses./p>94.24%), imputation using the SWIM panel developed in the present study was consistently higher than PHARP within each breed (Fig. 4b). The improvement was much more pronounced when considering the non-reference concordance rate and r2, two metrics that more faithfully reflect the accuracy, especially at low frequency (Fig. 4c, d). The difference between SWIM and PHARP could simply be a sample size difference, especially for the breeds evaluated. The final reference haplotype panel consisting of all 2259 animals is expected to achieve a concordance rate in excess of 95.84%, a non-reference concordance rate of 88.26%, and an r2 of 0.85./p>A) has been suggested as the causative mutation21 and extensively replicated in multiple genetic backgrounds23. Furthermore, mutations in MC4R are strongly associated with early onset obesity in humans24, and its role in the regulation of energy homeostasis is well established25. Importantly, the putative causal mutation in MC4R has been included in one of the commercially available SNP genotyping arrays, the Geneseek GGP Porcine 50K SNP Chip (Neogen, Lincoln, NE). However, the same SNP is not present in the more widely used Illumina PorcineSNP60 chip. To see if genotype imputation was able to correctly impute the genotypes of this SNP, we excluded the MC4R SNP and imputed whole-genome genotypes from a population of 3769 Duroc pigs genotyped using the GGP Porcine 50K SNP arrays. Remarkably, the concordance rate and r2 between the imputed and array MC4R SNP genotypes were 99.71% and 0.9916, respectively. We performed GWAS using array and imputed genotypes; both showed a major peak on chromosome 1 (Fig. 5a, Supplementary Data 3 and 4) and a clear deviation of P-value distribution from the null (Supplementary Fig. 4a). Using imputed genotypes, the highest hit from imputed SNPs (chr1:161511936:T > C, P = 2.98 × 10−13) explained 2.85% of the total phenotypic variance (Fig. 5a). Under this peak in a 4-Mb region (158.5–162.5 Mb), there were 7138 variants within 22 genes. Linkage disequilibrium in this region was extensive, with 1050 variants in strong LD (r2 > 0.8) with the top hit, including the MC4R SNP (Fig. 5b). The highest hit was an intronic SNP in the gene CCBE1 (Fig. 5b). However, the extensive LD in this region makes it difficult to pinpoint a causative mutation by genetic data alone. Additional functional information and genetic data that break the LD are necessary to further fine-map causative genes and mutations. Nevertheless, the ability to identify the putative MC4R causative SNP as one of the top associated variants in a long stretch of high LD region clearly demonstrated the improvement of resolution using imputed genotypes. In our analysis, the MC4R SNP was initially removed and would otherwise be invisible without the imputation, as would be the case if the Illumina PorcineSNP60 chips were used./p> C) is indicated by a gradient of blue color. Locations of genes are indicated in the box below the plot, where blue boxes and gene names with a left arrowhead (<) indicate genes transcribed on the reverse strand, and red boxes and gene names with a right arrowhead (>) indicate genes transcribed from the forward strand. Genes that are not marked do not have gene symbols. Gene locations are based on the Ensembl Release 98 annotation./p>T, P = 3.45 × 10−39). Remarkably, this variant explained 13.65% of the total phenotypic variance, and the homozygous C/C animals were, on average, 4.01 cm longer than the T/T homozygotes (Fig. 6b, c). BMP2 has been repeatedly shown to be associated with growth traits in pigs. A recent study implicated a regulatory variant upstream of the BMP2 gene and validated its functional impact using reporter genes26. This regulatory variant was the third most significant SNP under this peak in our analysis. Whether one or both of these potentially regulatory variants are the causative mutations remains to be determined. Given the strong association, high MAF of these SNPs, and less extensive LD in this region, it is unlikely that these regulatory variants were tagging protein-coding and less common variants in the BMP2 gene. In addition to the genetic support from this Yorkshire population, the body length increasing C allele was much more prevalent in Landrace than in other breeds. A hallmark of the Landrace breed is its long body size; thus, regulatory variation of the BMP2 gene may be a major contributor to the phenotypic differentiation between pig breeds. In contrast, although the SNP chip was able to broadly identify this region, the most significant SNP (chr17:15827832:T>G, P = 1.58 × 10−25) in an SNP chip-based GWAS was about 184 kb away from the lead SNP and explained a substantially smaller variance (8.22% versus 13.65%)./p>T) are indicated by a gradient of blue color. Locations of genes are indicated in the box below the plot and according to the Ensembl Release 98 annotation. All three genes are colored in red and transcribed from the forward strand. The only gene with a symbol in this region is BMP2. c Scatter and box plots of body length (in cm) for the three genotypes of the chr17:15643342:C>T SNP. The lower and upper boundaries of the box are, respectively, 25% and 75% quantiles of the data, the midline median, and the whiskers minimum and maximum. d Allele frequencies of the chr17:15643342:C>T SNP in different breeds./p> 54.69") were removed. Variant quality score recalibration (VQSR) on SNPs was performed with truth SNP sets compiled from commercial SNP arrays, including 50K, 60K, and 80K SNP chips (prior = 15.0) on the Illumina platform and the 660K (prior = 12.0), SowPro90 (prior = 15.0) SNP chips from the Affymetrix platform. SNPs were filtered with a truth sensitivity filter level at 99.0. Without a truth set of indels, we applied hard filtering on them by excluding indels with QD < 2.0, QUAL < 50.0, FS > 100.0, ReadPosRankSum < −20.0, as recommended by GATK's best practices. Additionally, we filtered out animals with a missing rate >0.20, heterozygosity >0.20, and retained bi-allelic sites with a missing rate <0.2 and mean sequencing depth between 5 and 500. Filtering was performed using a combination of VCFtools 0.1.1332 and BCFtools 1.1333 commands./p> 0.5) and low-frequency variants (MAF < 0.05). To understand the genetic structure in the population, we retained variants with MAF > 0.05 and missing rate <0.1 and pruned SNPs with LD (r2 < 0.3, -indep-pairwise 50 10 0.3) using PLINK 1.935. Principal component analysis (PCA) was performed on the filtered list of 1,223,882 variants using GCTA 1.93.236 for all individuals. Ancestries were estimated using ADMIXTURE 1.337 on 185 individuals randomly selected according to breed representation in the dataset or at least four individuals per breed. The downsampling was necessary to properly visualize population structure./p>0.1 and MAF < 0.005 were removed. Additionally, variants with a Hardy–Weinberg equilibrium test P-value < 10−10 implemented separately in PLINK in all three of the Duroc, Landrace, and Yorkshire pigs were removed. Only autosomal variants were retained for imputation./p>