The PCA Model

Locator’s networks struggle when the genotype data has many more SNPs than samples – for example, whole-genome data with hundreds of thousands or millions of SNPs but only a few hundred individuals. The first layer of the network grows with the number of SNPs, so with millions of SNPs it has hundreds of millions of values to learn. That is far too many for a few hundred samples: the network memorizes the training samples instead of learning a real pattern, and its predictions on new samples get worse as more SNPs are added.

The PCA model fixes this. Before the network sees the genotypes, Locator runs a PCA and keeps only a handful of components. The network then learns from those few components instead of from millions of raw SNPs, so it stays small and accurate no matter how many SNPs you give it.

When to use it

Turn the PCA model on when you have many more SNPs than samples – roughly, whole-genome data with 100,000 or more SNPs and a few hundred samples. With fewer SNPs the plain network works well and PCA is not needed. The PCA model’s job is to keep accuracy steady as the SNP count grows into the millions, where a plain network gets worse.

It works with normal training, holdouts, k-fold and leave-one-out cross-validation, and ensembles.

Basic usage

Set the size yourself

Pass a number instead to keep exactly that many components:

config["pca_components"] = 64

Leaving pca_components out (or set to None) turns the PCA model off. That is the default.

How it works

When pca_components is set, the genotypes pass through an extra PCA step before the rest of the network:

genotype data  ->  PCA step (keeps a few components)  ->  rest of the network

A few details:

  • The PCA is run on the training samples only. Samples held out for testing are never used to build it, so cross-validation stays fair.

  • The PCA step starts as an exact PCA. It is set up to reproduce the PCA result exactly, and training is then allowed to adjust it. The rest of the network starts from random values, as usual.

  • Training happens in two stages. In the first stage the PCA step is held fixed while the rest of the network learns. In the second stage the PCA step is allowed to change too, more slowly, so it can adjust to better predict location. Set pca_finetune to False to skip the second stage and keep the PCA step fixed throughout.

Choosing how many components to keep

With pca_components="auto", Locator looks at how much variation each component captures. The first few components capture a lot; after that, each one adds little. Locator keeps components up to the point where the curve levels off – the natural cut-off in the data. This is usually a small number, and in practice it predicts just as well as a much larger hand-picked number.

To see this cut-off yourself before choosing a number:

from locator.pca import scree_elbow

# training genotypes, shape (samples, SNPs)
n_components = scree_elbow(train_genotypes)

The number of components cannot be larger than the number of training samples or the number of SNPs; a larger value raises a clear error.

Settings

Setting

Default

What it does

pca_components

None

Turns the PCA model on or off: None is off, a number keeps that many components, and "auto" lets Locator choose.

pca_finetune

True

Whether the second training stage adjusts the PCA step. False keeps the PCA step fixed the whole time.

pca_finetune_lr

1e-4

How fast the PCA step is allowed to change in the second stage.

When you cannot use it

The PCA model does not work with:

  • Bootstrap or jacknife runs. These resample or reorder the SNPs on every replicate, and the PCA step needs a fixed set of SNPs. Passing site_order together with pca_components raises an error.

  • Windowed analysis. Each window uses its own set of SNPs, so windowed runs reject pca_components.

For these, leave pca_components off.