Complex structural variation, phylogeny, and disease associations of the mucin pangenome
Mucins are large glycoproteins that provide hydration and barrier function to epithelial tissues. Although genetically heterogeneous, all mucins harbor a large exon composed of variable number tandem repeats (VNTRs). Short-read sequencing has limited our understanding of mucin VNTR diversity and makes disease association studies challenging. We leverage 296 long-read phased genome assemblies to characterize 14 mucin family members, achieving [≥]97% accuracy across 572 haplotypes. Phylogenetic haplogroup analysis reveals extraordinary structural heterozygosity, with MUC4 harboring the greatest allelic diversity (n=240 distinct lengths) and MUC12 the greatest size range ({Delta} = 55,233 bp;