CLI Guide

This guide covers the command-line interface (CLI) for Locator.

Basic Usage

To fit a model to a dataset and predict locations for validation samples:

locator --vcf data/test_genotypes.vcf.gz --sample_data data/test_sample_data.txt --out out/test/test

This will produce 4 files in the output directory:

  • test_predlocs.txt – predicted locations

  • test_history.txt – training history

  • test_params.json – run parameters

  • test_fitplot.pdf – plot of training history

See all parameters with locator --help.

Uncertainty and Windowed Analysis

Generating multiple predictions by fitting separate models to windows across the genome allows estimates of uncertainty and intragenomic variation for an individual-level prediction. Using the --windows option will generate separate predictions for nonoverlapping windows of size --window_size (default 500,000bp).

This option requires zarr input for fast chunked array access. For large VCFs, we recommend converting to zarr format first using bio2zarr (installed via pip install locator[fast-vcf]):

# Recommended: bio2zarr (fast, multi-threaded, uses htslib)
# VCFs must be indexed first
bcftools index -t data/test_genotypes.vcf.gz
vcf2zarr convert -p 8 data/test_genotypes.vcf.gz data/test_genotypes.zarr

# Alternative: scikit-allel wrapper (slower, no additional dependencies)
vcf_to_zarr --vcf data/test_genotypes.vcf.gz --zarr data/test_genotypes.zarr

Locator supports zarr files produced by either tool. Once converted, run a windowed analysis with:

mkdir out/test_windows/
locator --zarr data/test_genotypes.zarr --sample_data data/test_sample_data.txt --out out/test_windows/ --windows --window_size 250000

This should take around 5 minutes on a GPU. For analyses in humans, mosquitoes, and malaria parasites described in our paper, we used window sizes yielding 100,000-200,000 SNPs.

Bootstraps

You can also train replicate models on bootstrap samples of the full VCF (sampling SNPs with replacement) with the --bootstrap argument. To fit 5 bootstrap replicates, run:

mkdir out/bootstrap
locator --vcf data/test_genotypes.vcf.gz --sample_data data/test_sample_data.txt --out out/bootstrap/test --bootstrap --nboots 5

This is slow (you’re fitting new models to each replicate), but should give a good idea of uncertainty in predicted locations.

Jacknife

Last, a quicker and probably worse estimate of uncertainty can also be generated by the --jacknife option. This uses a single trained model and generates predictions while treating a random 5% of sites as missing data. We recommend running bootstraps for “final” predictions instead, but for a quick look at uncertainty you can run jacknife samples with:

mkdir out/jacknife
locator --vcf data/test_genotypes.vcf.gz --sample_data data/test_sample_data.txt --out out/jacknife/test --jacknife --nboots 20

See Also