Your DNA Test Is 5x Less Accurate If You're Not White. Here's How We Fix It.
Polygenic Risk Scores work 4.9x better for European-ancestry patients. We discovered that the size of a model's internal representation — not adversarial training — is what actually controls fairness.
8 min read
Imagine getting a genetic test that tells you your risk of heart disease. You act on it — adjust your diet, start medication, schedule screenings.
Now imagine that test is 4.9 times less accurate for you than for your neighbor, simply because of your ancestry. You don’t know this. Your doctor doesn’t know this. The test result looks the same.
This is not hypothetical. This is happening right now, to hundreds of millions of people.
The Scale of the Problem
Polygenic Risk Scores (PRS) predict disease risk by aggregating thousands of genetic variants. They’re used for cancer screening, cardiovascular risk, diabetes prevention, and pharmacogenomics.
The datasets that train these models are overwhelmingly European-ancestry. Over 80% of genome-wide association studies (GWAS) use European participants. The result:
- European-ancestry patients: High PRS accuracy
- African-ancestry patients: 18-20% of European accuracy (Martin et al., Nature Genetics, 2019; Prive et al., AJHG, 2022)
- East Asian, South Asian, Indigenous populations: Similarly degraded
A doctor in Lagos and a doctor in London both order the same polygenic risk score. The London patient gets an accurate prediction. The Lagos patient gets noise with a confidence score attached.
Every PRS provider — consumer and clinical — ships scores that are systematically wrong for non-European patients. Doctors make clinical decisions on these numbers.
Cross-Ancestry Transfer: The Commercial Proof
Before explaining the mechanism, here is the result that matters most commercially.
Using a dual-stream architecture — one stream for ancestry features, one bottleneck-constrained stream for phenotype — we demonstrated positive cross-ancestry transfer:
- European-to-East-Asian height prediction: +0.077 R² improvement over training on East Asian data alone
- The phenotype stream (d=8 to d=32) achieved 78% stream separation from the ancestry stream, confirming the model genuinely learned ancestry-invariant disease signals
This means a model trained primarily on European data can help non-European patients instead of failing on them — if the bottleneck forces it to learn the right features. That is a product differentiator, not just a compliance checkbox.
Why Standard AI Fairness Doesn’t Work
The dominant approach to making AI fair is called adversarial debiasing. The idea: add a component to the model that tries to detect the protected attribute (ancestry, race, sex), then penalize the model whenever it succeeds. A parameter called lambda controls how hard you penalize.
Hundreds of papers. Millions of GPU-hours. Entire research careers. All devoted to tuning lambda.
Our research tested this systematically across six clinical traits (height, type 2 diabetes, coronary artery disease, BMI, atrial fibrillation, breast cancer) using 1000 Genomes data from five ancestral populations.
The result: lambda, across a 20x range of adversarial strength, produced 2.2 percentage points of variation in bias. Negligible.
Lambda is a gradient signal — a suggestion. The model can ignore it, get stuck, or be overwhelmed by the task it’s trying to learn. Once the model is deployed, lambda is gone. There is no mechanism ensuring fairness persists after training ends.
What Actually Controls Fairness
We discovered that the critical variable is not adversarial strength. It is the size of the model’s internal representation.
Every AI model compresses its input into an internal summary — a list of numbers. The length of that list is the bottleneck dimensionality, which we call d. In our system, d is an integer: 8, 16, 32, 64, or 128 dimensions.
When d is large (128 dimensions), the model has excess capacity. It stores everything — disease risk AND ancestry. It has room to be biased.
When d is small (8 dimensions), the model must choose. There is not enough room for everything. Given even minimal pressure to prefer disease-relevant signals over ancestry, the model retains what matters clinically and discards the ancestry signal. It physically cannot store the bias.
The numbers:
| Bottleneck d | Ancestry Leakage | Task Accuracy |
|---|---|---|
| d=8 | 32.8% | 95% of baseline |
| d=16 | 38.1% | 97% of baseline |
| d=32 | 46.8% | 98% of baseline |
| d=64 | 60.4% | 99% of baseline |
| d=128 | 79.4% | 100% of baseline |
Dimensionality across a 16x range: 46.6 percentage points of variation in bias.
This is not a training-time suggestion. It is an architectural constraint. The model is incapable of encoding bias because the information capacity of its representation is too small to hold both signals.
The Auditing Advantage
The difference between lambda and d is not incremental. It is categorical.
With lambda: A regulator asks “prove your model is fair.” You say: “During training, we set lambda to 0.5, used a two-layer MLP adversary with 32 hidden units, warmed up linearly over 20 epochs, ran 3 seeds…” The regulator cannot verify any of this without access to your full training pipeline, data, and compute infrastructure.
With d: A regulator asks “prove your model is fair.” You say: “The model’s representation has 8 dimensions. Here — look at the architecture.” The regulator inspects the model. Sees d=8. Verification complete. 30 seconds. No training data needed. No proprietary code needed. No patient data exposed.
This is the only fairness mechanism that allows external verification without exposing proprietary processes or protected health information.
The Privacy Dimension
We also addressed a secondary problem: how do you process genomic data without exposing the genome?
Using DCT-II frequency-domain encoding, we compress 80.8 million genetic variants into 128 coefficients. The compression ratio is 727:1. From those 128 numbers, we demonstrated:
- 81.9% ancestry classification accuracy (proving the encoding preserves meaningful signal)
- 0% reconstruction of the original genotype (three independent attacks failed)
- The encoding is irreversible — not because it is encrypted, but because the information is mathematically destroyed
Your genome never leaves your machine. Only 128 numbers travel the network. They cannot be reversed back to your DNA.
Try It Yourself
One Docker command. Your data stays on your machine.
docker pull ghcr.io/paragon-dao/paragon-fairness:latest
docker run --rm -v $(pwd)/data:/data \
ghcr.io/paragon-dao/paragon-fairness:latest \
train --data /data/your_data.csv \
--target outcome \
--protected ancestry
The output: a Fairness Certificate — cryptographically signed, machine-verifiable, showing exactly how your model performs across every demographic group at each bottleneck dimension.
Try the interactive demo to see how adjusting d changes fairness in real time, using real data from 2,504 individuals across 5 ancestral populations.
The Urgency
The EU AI Act takes full effect August 2, 2026. High-risk AI systems — including medical AI — must demonstrate fairness with auditable documentation.
Every genomics company, health AI platform, and medical device manufacturer deploying AI that touches patients needs to answer one question: can you prove your model is fair?
If your answer involves explaining training hyperparameters, regulators will not be satisfied.
If your answer is “the model has d=16 dimensions — here, look,” you are done.
The larger question is not just compliance. It is whether your model actually works for the 80% of the world that current models fail. That is the difference between avoiding a fine and reaching a market.
Read the documentation or contact us to get started.