GeneCAD: Plant Genome Annotation with a DNA Foundation Model
Accurate genome annotation is fundamental to biological discovery, yet identifying gene structures directly from DNA sequence remains a major challenge in complex genomes. We introduce GeneCAD, a sequence-only framework that predicts biologically coherent gene models without requiring species-matched transcriptomic or proteomic evidence. GeneCAD integrates lineage-specific DNA representations from the PlantCAD2 foundation model with a transformer encoder and a chromosome-scale conditional random field (CRF) to enforce structural constraints, such as splice-phase and feature order. To ensure high-quality supervision, we implement a curation strategy using a sequence-based masked-motif score t