Rare diseases, though individually uncommon, collectively affect millions worldwide. Despite major advances in genome sequencing and interpretation of genomic data on affected individuals, for many families the search for a genetic diagnosis remains inconclusive. This diagnostic gap stems from multiple obstacles: genetic variants that fall outside protein coding regions, limitations in analysis pipelines, missing reference genome annotations, limited functional data, and the sheer diversity of rare conditions.
To address this important need, five research centers—including University of Washington—came together in 2021 to form the Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR) Consortium. GREGoR—supported by the National Human Genome Research Institute (NHGRI)—aims to systematically evaluate the usefulness of emerging sequencing technologies for different disease types, build multi-omic datasets, develop new computational approaches, and rapidly share data with the global scientific community. A new Perspective, published in Nature in November 2025, lays out progress made by the GREGoR Consortium and outlines remaining challenges that lie ahead.
What has the consortium accomplished so far? “GREGoR collects and shares data from over 7,500 individuals across ~3,000 families — many of which had previously undergone clinical testing (e.g., exome sequencing) and remained undiagnosed. It makes all generated data publicly available via a shared, globally accessible platform (NIH’s AnVIL), lowering barriers for researchers worldwide to reanalyze, reuse, and build on the data.” lead author Moez Dawood shares. “The work essentially shifts rare-disease research from isolated case studies to a scalable, collaborative, global infrastructure — a “foundation” for future discovery. The consortium model helps tackle the problem that more than half of rare-disease cases remain unsolved even after clinical sequencing.”
But GREGoR is going beyond just amassing genomic data— these researchers are evaluating the performance, limitations, and ideal use-cases of genomics technologies to guide clinicians and researchers on which methods to use for unsolved rare diseases. Since researchers’ understanding of the human genome is heavily weighted towards protein coding regions, if a rare disease patient does not exhibit genetic variation in the coding genome (detected by exome sequencing), it can be difficult to determine next steps. To this end, GREGoR has developed computational approaches to extract new diagnosis by reanalyzing existing exome sequencing data. For example, they have improved techniques determining whether one or both alleles for a gene are disrupted (known as “phasing variants”) and developed better tools for detecting pathogenic structural variants such as deletions, duplications, insertions, inversions, and translocations.
Aside from reanalysis of existing data, GREGoR provides a roadmap for genetic testing when exome sequencing is inconclusive, starting with short-read genomic sequencing (srGS). In fact, their data suggests that srGS could be more effective as a first line diagnostic test than exome sequencing due to its higher diagnostic yield. GREGoR is developing tools that infer structural and copy-number variants from srGS, which would replace both SNP arrays and exomes as a single, cost-effective assay.
Variation in noncoding regions could have several possible effects including gene regulation, gene expression and mRNA splicing, and to assess each of these possibilities, technologies such as short- and long-read sequencing, RNA sequencing, structural variant discovery, and epigenomic profiling can be applied. These types of can data provide critical context about when, where, and how genes are expressed, helping researchers link genetic changes to their functional consequences in disease-relevant tissues. GREGoR has made progress towards demystifying these regions by identifying noncoding regions intolerant of mutation. GREGoR is also evaluating how short- and long-read sequencing compare in diagnostic power and cost, showing, for instance, that targeted long-read methods can uncover structural and repetitive-sequence variants that short-read sequencing often misses.