Most DNA analysis compares your file with the human reference genome, a shared sequence used for coordinates. If your DNA has a base that differs from the reference base at a position, that base is labeled alternate. In population data, an alternate allele frequency of 0.99 means about 99% of observed alleles in that dataset carry the alternate base.
What the reference genome is
The human reference genome gives every position an address so tools, databases, and papers can point to the same place. It is a representative sequence, not a list of the most common allele in every population. The Genome Reference Consortium (GRC) maintains the current reference assembly and publishes patches and alternate loci for complex regions.
For most positions, the reference lists one base. People usually carry two copies of each autosome, so the reference cannot show that a position is commonly heterozygous, and it cannot represent every population at once.
A short history
The Human Genome Project
The first human reference came out of the Human Genome Project, a public research project that ran from 1990 to 2003. The project announced a draft in June 2000, published draft details in 2001, and announced an essentially complete sequence in April 2003. The 2003 sequence covered about 92% of the genome and still had fewer than 400 gaps. Celera produced a private sequence over a similar period.
The public sequence was assembled from cloned DNA fragments contributed by anonymous donors who gave blood samples after informed consent. NHGRI says most of the original samples came from volunteers recruited in Buffalo, New York. The final reference was not a population average: about 70% came from one anonymous person of blended ancestry, and the remaining 30% came from 19 other anonymous people, mostly of European ancestry.
Builds and version numbers
The reference has been revised many times. You will see names like NCBI36 (hg18), GRCh37 (hg19), and GRCh38 (hg38). The GRC describes GRCh37 as its first major human reference assembly release and lists later patch releases for GRCh37 and GRCh38. The consortium also adds alternate sequences for regions that vary a lot between people.
GRCh37 and GRCh38 are the two builds you are most likely to meet in files. The same variant can have a different position number in each build, so the build matters whenever you compare a file with a database.
Newer references
In 2022 a telomere-to-telomere assembly called T2T-CHM13 filled in regions that earlier builds left as gaps. It was built from a cell line with two identical copies of each chromosome, which made the hard, repetitive regions easier to resolve. In 2023 a draft human pangenome reference added 47 phased diploid assemblies from genetically diverse individuals, so more common variation can be represented directly instead of against one sequence.
Why a common allele can be called alternate
Variant files describe each position with a reference allele, taken from the reference genome, and an alternate allele, which is anything that differs from it. Alternate allele frequency is the fraction of observed alleles in a population dataset that carry the alternate base.
If the reference happens to carry the rarer base at a position, the common base becomes the alternate. An alternate allele frequency of 0.90 means about 90% of observed alleles carry that base. A value of 0.99 means about 99% do. Nothing is wrong with the data. The reference simply lists the less common base at that spot.
This is also why a field labeled MAF should stay at or below 0.5, while an alternate allele frequency can be much higher. If you see 0.99 next to a variant, read it as "about 99% of this population dataset differs from the reference here," not as "rare" and not as "harmful." Allele frequency and MAF in variant review covers that distinction in more detail.
What this means for review
| Situation | What you see | How to read it |
|---|---|---|
| Reference carries the common allele | Low alternate allele frequency for a rare change | The rare alternate deserves a closer look at consequence and ClinVar. |
| Reference carries the rarer allele | Alternate allele frequency of 0.99, meaning about 99% of observed alleles | The change is common; the label is positional, not a warning. |
| Different genome build | Same variant, different position number | Confirm the build before comparing with a database. |
| Region with high variation | Reference may use alternate sequences | Frequency and mapping can be less certain there. |
What the reference cannot tell you
The reference genome is a coordinate sequence. It does not say whether an allele is good or bad. Frequency, consequence, ClinVar, predictor scores, and genotype quality still decide what deserves attention.
Related review pages
To turn frequency into a decision, read allele frequency and MAF in variant review. For variant type context, see what a missense variant is and how splice variants are reviewed. For file-type limits, compare SNP arrays with WGS and WES files.
Sources
- NHGRI Human Genome Project fact sheet, for project dates, the 2000 draft, the 2003 sequence, donor blood samples, donor recruitment, and the 70% / 30% composition of the original reference.
- Genome Reference Consortium human genome overview, for GRCh37, GRCh38 patch releases, alternate loci, and current reference maintenance.
- NIH release on the complete gapless human genome sequence, for T2T-CHM13 and the remaining regions filled by the Telomere-to-Telomere consortium.
- Liao et al., A draft human pangenome reference, Nature, 2023, for the 47 phased diploid assemblies and the motivation for a pangenome reference.

