CLI Guide ========= This guide covers the command-line interface (CLI) for Locator. Basic Usage ----------- To fit a model to a dataset and predict locations for validation samples: .. code-block:: bash locator --vcf data/test_genotypes.vcf.gz --sample_data data/test_sample_data.txt --out out/test/test This will produce 4 files in the output directory: * ``test_predlocs.txt`` -- predicted locations * ``test_history.txt`` -- training history * ``test_params.json`` -- run parameters * ``test_fitplot.pdf`` -- plot of training history See all parameters with ``locator --help``. Uncertainty and Windowed Analysis --------------------------------- Generating multiple predictions by fitting separate models to windows across the genome allows estimates of uncertainty and intragenomic variation for an individual-level prediction. Using the ``--windows`` option will generate separate predictions for nonoverlapping windows of size ``--window_size`` (default 500,000bp). This option requires zarr input for fast chunked array access. For large VCFs, we recommend converting to zarr format first using ``bio2zarr`` (installed via ``pip install locator[fast-vcf]``): .. code-block:: bash # Recommended: bio2zarr (fast, multi-threaded, uses htslib) # VCFs must be indexed first bcftools index -t data/test_genotypes.vcf.gz vcf2zarr convert -p 8 data/test_genotypes.vcf.gz data/test_genotypes.zarr # Alternative: scikit-allel wrapper (slower, no additional dependencies) vcf_to_zarr --vcf data/test_genotypes.vcf.gz --zarr data/test_genotypes.zarr Locator supports zarr files produced by either tool. Once converted, run a windowed analysis with: .. code-block:: bash mkdir out/test_windows/ locator --zarr data/test_genotypes.zarr --sample_data data/test_sample_data.txt --out out/test_windows/ --windows --window_size 250000 This should take around 5 minutes on a GPU. For analyses in humans, mosquitoes, and malaria parasites described in our paper, we used window sizes yielding 100,000-200,000 SNPs. Bootstraps ---------- You can also train replicate models on bootstrap samples of the full VCF (sampling SNPs with replacement) with the ``--bootstrap`` argument. To fit 5 bootstrap replicates, run: .. code-block:: bash mkdir out/bootstrap locator --vcf data/test_genotypes.vcf.gz --sample_data data/test_sample_data.txt --out out/bootstrap/test --bootstrap --nboots 5 This is slow (you're fitting new models to each replicate), but should give a good idea of uncertainty in predicted locations. Jacknife -------- Last, a quicker and probably worse estimate of uncertainty can also be generated by the ``--jacknife`` option. This uses a single trained model and generates predictions while treating a random 5% of sites as missing data. We recommend running bootstraps for "final" predictions instead, but for a quick look at uncertainty you can run jacknife samples with: .. code-block:: bash mkdir out/jacknife locator --vcf data/test_genotypes.vcf.gz --sample_data data/test_sample_data.txt --out out/jacknife/test --jacknife --nboots 20 See Also -------- * :doc:`usage` — Python API guide * :doc:`parallel_analysis_guide` — Multi-GPU parallel analysis