Usage Guide
===========

This guide covers how to use Locator for predicting geographic coordinates
from genotype matrices.

Basic Usage
-----------

Loading Data
~~~~~~~~~~~~

Locator supports multiple input formats for genotype data:

.. code-block:: python

   from locator import Locator

   # Create a Locator instance with configuration
   config = {
       "out": "my_analysis",
       "batch_size": 32,
       "width": 256,
       "nlayers": 8,
       "dropout_prop": 0.25,
   }

   locator = Locator(config)

   # Load data from various formats:
   #
   # 1. From VCF
   genotypes, samples = locator.load_genotypes(vcf="path/to/genotypes.vcf")
   #
   # 2. From zarr (recommended for large datasets)
   #    See :doc:`cli` for VCF-to-Zarr conversion instructions.
   genotypes, samples = locator.load_genotypes(zarr="path/to/genotypes.zarr")
   #
   # 3. From pandas DataFrame
   locator = Locator({
       "out": "my_analysis",
       "genotype_data": genotype_df,  # DataFrame: samples as index, SNPs as columns
       "sample_data": coords_df,      # DataFrame with sampleID, x, y columns
   })

Training and Prediction
-----------------------

Train the model and make predictions:

.. code-block:: python

   # Train the model
   history = locator.train(genotypes=genotypes, samples=samples)

   # Make predictions
   predictions = locator.predict(return_df=True)  # Returns DataFrame with sampleID, x, y

Holdout Analysis
----------------

Evaluate model performance by holding out samples:

.. code-block:: python

   # Hold out k samples during training
   locator.train_holdout(
       genotypes=genotypes,
       samples=samples,
       k=10,
   )

   # Get predictions for held-out samples
   holdout_preds = locator.predict_holdout(
       return_df=True,
       plot_summary=True,
   )

Ensemble Models
---------------

Train ensemble models using k-fold cross-validation for improved predictions
with uncertainty estimates:

.. code-block:: python

   # Train 5-fold ensemble and predict with uncertainty
   locator.train_ensemble(genotypes=genotypes, samples=samples, k=5)
   predictions = locator.predict_ensemble(
       genotypes=genotypes, samples=samples, return_std=True,
   )

See :doc:`ensemble_guide` for comprehensive ensemble documentation including
parallel multi-GPU training.

Windowed Analysis
-----------------

Analyze predictions across genomic windows:

.. code-block:: python

   # Run windowed analysis
   window_predictions = locator.run_windows(
       genotypes=genotypes,
       samples=samples,
       window_size=5e5,  # 500kb windows
       return_df=True,
   )

Jacknife Analysis
-----------------

Assess prediction uncertainty:

.. code-block:: python

   # Run jacknife analysis
   jacknife_predictions = locator.run_jacknife(
       genotypes=genotypes,
       samples=samples,
       prop=0.05,  # Proportion of SNPs to mask
       n_replicates=100,
       return_df=True,
   )

Using Range Masks
-----------------

Incorporate species range constraints:

.. code-block:: python

   # Configure model with range penalty
   config = {
       "out": "range_constrained",
       "use_range_penalty": True,
       "species_range_shapefile": "path/to/range.shp",
       "resolution": 0.05,
       "penalty_weight": 1.0,
   }

   locator = Locator(config)

Memory-Efficient Data Pipeline
------------------------------

Locator uses an efficient ``tf.data`` pipeline by default. ``IndexSet`` handles
train/test/validation splits using index arrays rather than copying genotype
matrices, providing up to 50% memory savings for large datasets.

GPU Configuration
-----------------

Locator includes automatic GPU optimizations that are **enabled by default**.
These provide 3-5x speedup on large datasets.

Basic GPU configuration:

.. code-block:: python

   # GPU optimizations are enabled by default
   config = {
       "out": "gpu_analysis",
       "gpu_number": 0,  # Use first GPU (optional)
   }

   # To disable GPU entirely
   config = {
       "out": "cpu_analysis",
       "disable_gpu": True,
   }

   # To disable specific optimizations
   config = {
       "out": "custom_gpu",
       "use_mixed_precision": False,  # Disable mixed precision
       "gpu_batch_size": 128,         # Use fixed batch size instead of auto
   }

GPU Configuration Parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``use_mixed_precision`` (bool, default ``True``)
   Enables FP16 mixed-precision training for approximately 2x speedup on GPUs
   with Tensor Core support (NVIDIA Volta and newer).

``gpu_batch_size`` (``"auto"`` or int, default ``"auto"``)
   Controls training batch size. When set to ``"auto"``, Locator tunes the
   batch size based on available GPU memory. Set to a fixed integer to
   override automatic tuning.

``gpu_memory_mode`` (``"growth"`` or ``"full"``, default ``"growth"``)
   GPU memory allocation strategy. ``"growth"`` allocates memory
   incrementally as needed, which is friendlier to multi-process
   workflows. ``"full"`` pre-allocates all GPU memory for maximum
   throughput.

``enable_xla`` (bool, default ``False``)
   Enables XLA (Accelerated Linear Algebra) JIT compilation. Can improve
   performance for some model architectures, but increases initial
   compilation time.

``gradient_accumulation_steps`` (int, default ``1``)
   Number of forward passes before performing a weight update. Effectively
   simulates a larger batch size without requiring additional GPU memory.
   Useful when GPU memory is limited but a larger effective batch size is
   desired.

Data Augmentation
-----------------

Enable data augmentation during training:

.. code-block:: python

   config = {
       "out": "augmented",
       "augmentation": {
           "enabled": True,
           "flip_rate": 0.05,  # Rate at which to flip genotypes
       },
   }

Handling Missing Coordinates
----------------------------

Locator provides three modes for handling samples with missing coordinates:
``separate`` (default), ``exclude``, and ``fail``. See :doc:`na_handling_guide`
for full details and per-method behavior.

Multi-GPU Parallel Analysis
---------------------------

For large-scale analyses with multiple GPUs, Locator provides Ray-based
parallel implementations of its analysis methods. See
:doc:`parallel_analysis_guide` for comprehensive documentation on multi-GPU
analysis.

Next Steps
----------

* See the :doc:`api` reference for detailed information about all available
  functions and classes.
* Explore :doc:`parallel_analysis_guide` for multi-GPU workflows.
* Learn about visualization in :doc:`plotting_guide`.
* Learn how to contribute in :doc:`contributing`.