CLI Guide
=========

This guide covers the command-line interface (CLI) for Locator.

Basic Usage
-----------

To fit a model to a dataset and predict locations for validation samples:

.. code-block:: bash

    locator --vcf data/test_genotypes.vcf.gz --sample_data data/test_sample_data.txt --out out/test/test

This will produce 4 files in the output directory:

* ``test_predlocs.txt`` -- predicted locations
* ``test_history.txt`` -- training history
* ``test_params.json`` -- run parameters
* ``test_fitplot.pdf`` -- plot of training history

See all parameters with ``locator --help``.

Uncertainty and Windowed Analysis
---------------------------------

Generating multiple predictions by fitting separate models to windows across the genome allows estimates of uncertainty and intragenomic variation for an individual-level prediction. Using the ``--windows`` option will generate separate predictions for nonoverlapping windows of size ``--window_size`` (default 500,000bp).

This option requires zarr input for fast chunked array access. For large VCFs, we recommend converting to zarr format first using ``bio2zarr`` (installed via ``pip install locator[fast-vcf]``):

.. code-block:: bash

    # Recommended: bio2zarr (fast, multi-threaded, uses htslib)
    # VCFs must be indexed first
    bcftools index -t data/test_genotypes.vcf.gz
    vcf2zarr convert -p 8 data/test_genotypes.vcf.gz data/test_genotypes.zarr

    # Alternative: scikit-allel wrapper (slower, no additional dependencies)
    vcf_to_zarr --vcf data/test_genotypes.vcf.gz --zarr data/test_genotypes.zarr

Locator supports zarr files produced by either tool. Once converted, run a windowed analysis with:

.. code-block:: bash

    mkdir out/test_windows/
    locator --zarr data/test_genotypes.zarr --sample_data data/test_sample_data.txt --out out/test_windows/ --windows --window_size 250000

This should take around 5 minutes on a GPU. For analyses in humans, mosquitoes, and malaria parasites described in our paper, we used window sizes yielding 100,000-200,000 SNPs.


Bootstraps
----------

You can also train replicate models on bootstrap samples of the full VCF (sampling SNPs with replacement) with the ``--bootstrap`` argument. To fit 5 bootstrap replicates, run:

.. code-block:: bash

    mkdir out/bootstrap
    locator --vcf data/test_genotypes.vcf.gz --sample_data data/test_sample_data.txt --out out/bootstrap/test --bootstrap --nboots 5

This is slow (you're fitting new models to each replicate), but should give a good idea of uncertainty in predicted locations.

Jacknife
--------

Last, a quicker and probably worse estimate of uncertainty can also be generated by the ``--jacknife`` option. This uses a single trained model and generates predictions while treating a random 5% of sites as missing data. We recommend running bootstraps for "final" predictions instead, but for a quick look at uncertainty you can run jacknife samples with:

.. code-block:: bash

    mkdir out/jacknife
    locator --vcf data/test_genotypes.vcf.gz --sample_data data/test_sample_data.txt --out out/jacknife/test --jacknife --nboots 20

See Also
--------

* :doc:`usage` — Python API guide
* :doc:`parallel_analysis_guide` — Multi-GPU parallel analysis