The PCA Model
Locator’s networks struggle when the genotype data has many more SNPs than samples – for example, whole-genome data with hundreds of thousands or millions of SNPs but only a few hundred individuals. The first layer of the network grows with the number of SNPs, so with millions of SNPs it has hundreds of millions of values to learn. That is far too many for a few hundred samples: the network memorizes the training samples instead of learning a real pattern, and its predictions on new samples get worse as more SNPs are added.
The PCA model fixes this. Before the network sees the genotypes, Locator runs a PCA and keeps only a handful of components. The network then learns from those few components instead of from millions of raw SNPs, so it stays small and accurate no matter how many SNPs you give it.
When to use it
Turn the PCA model on when you have many more SNPs than samples – roughly, whole-genome data with 100,000 or more SNPs and a few hundred samples. With fewer SNPs the plain network works well and PCA is not needed. The PCA model’s job is to keep accuracy steady as the SNP count grows into the millions, where a plain network gets worse.
It works with normal training, holdouts, k-fold and leave-one-out cross-validation, and ensembles.
Basic usage
Let Locator choose the size (recommended)
Set pca_components to "auto" and Locator decides how many components to
keep, based on the data:
from locator import Locator
config = {
"out": "wgs_analysis",
"batch_size": 32,
"width": 256,
"nlayers": 8,
"max_epochs": 500,
"patience": 100,
"pca_components": "auto",
}
locator = Locator(config)
genotypes, samples = locator.load_genotypes(zarr="genotypes.zarr")
locator.train(genotypes=genotypes, samples=samples)
Locator prints the number it chose at the start of training and stores it, so every fold of a run uses the same number.
Set the size yourself
Pass a number instead to keep exactly that many components:
config["pca_components"] = 64
Leaving pca_components out (or set to None) turns the PCA model off.
That is the default.
How it works
When pca_components is set, the genotypes pass through an extra PCA step
before the rest of the network:
genotype data -> PCA step (keeps a few components) -> rest of the network
A few details:
The PCA is run on the training samples only. Samples held out for testing are never used to build it, so cross-validation stays fair.
The PCA step starts as an exact PCA. It is set up to reproduce the PCA result exactly, and training is then allowed to adjust it. The rest of the network starts from random values, as usual.
Training happens in two stages. In the first stage the PCA step is held fixed while the rest of the network learns. In the second stage the PCA step is allowed to change too, more slowly, so it can adjust to better predict location. Set
pca_finetunetoFalseto skip the second stage and keep the PCA step fixed throughout.
Choosing how many components to keep
With pca_components="auto", Locator looks at how much variation each
component captures. The first few components capture a lot; after that, each
one adds little. Locator keeps components up to the point where the curve
levels off – the natural cut-off in the data. This is usually a small number,
and in practice it predicts just as well as a much larger hand-picked number.
To see this cut-off yourself before choosing a number:
from locator.pca import scree_elbow
# training genotypes, shape (samples, SNPs)
n_components = scree_elbow(train_genotypes)
The number of components cannot be larger than the number of training samples or the number of SNPs; a larger value raises a clear error.
Settings
Setting |
Default |
What it does |
|---|---|---|
|
|
Turns the PCA model on or off: |
|
|
Whether the second training stage adjusts the PCA step. |
|
|
How fast the PCA step is allowed to change in the second stage. |
When you cannot use it
The PCA model does not work with:
Bootstrap or jacknife runs. These resample or reorder the SNPs on every replicate, and the PCA step needs a fixed set of SNPs. Passing
site_ordertogether withpca_componentsraises an error.Windowed analysis. Each window uses its own set of SNPs, so windowed runs reject
pca_components.
For these, leave pca_components off.