The PCA Model ============= Locator's networks struggle when the genotype data has many more SNPs than samples -- for example, whole-genome data with hundreds of thousands or millions of SNPs but only a few hundred individuals. The first layer of the network grows with the number of SNPs, so with millions of SNPs it has hundreds of millions of values to learn. That is far too many for a few hundred samples: the network memorizes the training samples instead of learning a real pattern, and its predictions on new samples get *worse* as more SNPs are added. The PCA model fixes this. Before the network sees the genotypes, Locator runs a PCA and keeps only a handful of components. The network then learns from those few components instead of from millions of raw SNPs, so it stays small and accurate no matter how many SNPs you give it. .. contents:: Table of Contents :local: :depth: 2 When to use it -------------- Turn the PCA model on when you have many more SNPs than samples -- roughly, whole-genome data with 100,000 or more SNPs and a few hundred samples. With fewer SNPs the plain network works well and PCA is not needed. The PCA model's job is to keep accuracy steady as the SNP count grows into the millions, where a plain network gets worse. It works with normal training, holdouts, k-fold and leave-one-out cross-validation, and ensembles. Basic usage ----------- Let Locator choose the size (recommended) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Set ``pca_components`` to ``"auto"`` and Locator decides how many components to keep, based on the data: .. code-block:: python from locator import Locator config = { "out": "wgs_analysis", "batch_size": 32, "width": 256, "nlayers": 8, "max_epochs": 500, "patience": 100, "pca_components": "auto", } locator = Locator(config) genotypes, samples = locator.load_genotypes(zarr="genotypes.zarr") locator.train(genotypes=genotypes, samples=samples) Locator prints the number it chose at the start of training and stores it, so every fold of a run uses the same number. Set the size yourself ~~~~~~~~~~~~~~~~~~~~~~ Pass a number instead to keep exactly that many components: .. code-block:: python config["pca_components"] = 64 Leaving ``pca_components`` out (or set to ``None``) turns the PCA model off. That is the default. How it works ------------ When ``pca_components`` is set, the genotypes pass through an extra PCA step before the rest of the network:: genotype data -> PCA step (keeps a few components) -> rest of the network A few details: * **The PCA is run on the training samples only.** Samples held out for testing are never used to build it, so cross-validation stays fair. * **The PCA step starts as an exact PCA.** It is set up to reproduce the PCA result exactly, and training is then allowed to adjust it. The rest of the network starts from random values, as usual. * **Training happens in two stages.** In the first stage the PCA step is held fixed while the rest of the network learns. In the second stage the PCA step is allowed to change too, more slowly, so it can adjust to better predict location. Set ``pca_finetune`` to ``False`` to skip the second stage and keep the PCA step fixed throughout. Choosing how many components to keep ------------------------------------ With ``pca_components="auto"``, Locator looks at how much variation each component captures. The first few components capture a lot; after that, each one adds little. Locator keeps components up to the point where the curve levels off -- the natural cut-off in the data. This is usually a small number, and in practice it predicts just as well as a much larger hand-picked number. To see this cut-off yourself before choosing a number: .. code-block:: python from locator.pca import scree_elbow # training genotypes, shape (samples, SNPs) n_components = scree_elbow(train_genotypes) The number of components cannot be larger than the number of training samples or the number of SNPs; a larger value raises a clear error. Settings -------- .. list-table:: :header-rows: 1 :widths: 25 15 60 * - Setting - Default - What it does * - ``pca_components`` - ``None`` - Turns the PCA model on or off: ``None`` is off, a number keeps that many components, and ``"auto"`` lets Locator choose. * - ``pca_finetune`` - ``True`` - Whether the second training stage adjusts the PCA step. ``False`` keeps the PCA step fixed the whole time. * - ``pca_finetune_lr`` - ``1e-4`` - How fast the PCA step is allowed to change in the second stage. When you cannot use it ---------------------- The PCA model does not work with: * **Bootstrap or jacknife runs.** These resample or reorder the SNPs on every replicate, and the PCA step needs a fixed set of SNPs. Passing ``site_order`` together with ``pca_components`` raises an error. * **Windowed analysis.** Each window uses its own set of SNPs, so windowed runs reject ``pca_components``. For these, leave ``pca_components`` off.