Ensemble Models Guide
=====================

Ensemble models combine predictions from multiple neural networks trained on
different k-fold subsets of the data. Averaging across folds improves accuracy,
reduces overfitting, and provides per-sample uncertainty estimates. The
functionality is integrated into the ``Locator`` class through the
``EnsembleMixin``.

.. contents:: Table of Contents
   :local:
   :depth: 2

Basic Ensemble Training
-----------------------

Sequential Training
~~~~~~~~~~~~~~~~~~~

Train an ensemble using k-fold cross-validation:

.. code-block:: python

   from locator import Locator

   config = {
       "out": "ensemble_analysis",
       "batch_size": 32,
       "width": 256,
       "nlayers": 8,
       "dropout_prop": 0.25,
       "max_epochs": 1000,
       "patience": 100,
   }

   locator = Locator(config)
   genotypes, samples = locator.load_genotypes(vcf="genotypes.vcf.gz")

   ensemble_result = locator.train_ensemble(
       genotypes=genotypes,
       samples=samples,
       k=5,
       save_fold_models=True,
       use_model_manager=True,
       verbose=True,
   )

The returned ``ensemble_result`` contains training histories, model
information, averaged normalization parameters, and fold split details.

Parallel Training (Multi-GPU)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For faster training across multiple GPUs:

.. code-block:: python

   from locator.parallel import parallel_train_ensemble

   ensemble_result = parallel_train_ensemble(
       locator=locator,
       genotypes=genotypes,
       samples=samples,
       k=5,
       gpu_ids=[0, 1, 2, 3],
       save_fold_models=True,
       use_model_manager=True,
       verbose=True,
   )

Pass an empty list for CPU-only mode:

.. code-block:: python

   ensemble_result = parallel_train_ensemble(
       locator=locator,
       genotypes=genotypes,
       samples=samples,
       k=5,
       gpu_ids=[],
       verbose=True,
   )

Making Predictions
------------------

After training, call ``predict_ensemble`` to get averaged predictions across
folds. Set ``return_std=True`` for per-sample uncertainty and
``include_fold_predictions=True`` for individual fold outputs:

.. code-block:: python

   predictions = locator.predict_ensemble(
       genotypes=genotypes,
       samples=samples,
       return_std=True,
       include_fold_predictions=True,
       save_predictions=True,
   )

The result is a DataFrame with columns:

- ``sampleID``: sample identifier
- ``x``, ``y``: ensemble-mean longitude and latitude
- ``x_std``, ``y_std``: standard deviation across folds
- ``x_fold0``, ``y_fold0``, ...: per-fold predictions (when
  ``include_fold_predictions=True``)

Advanced Features
-----------------

Training Optimizations
~~~~~~~~~~~~~~~~~~~~~~

Mixed precision is auto-detected when a compatible GPU is present:

.. code-block:: python

   ensemble_result = locator.train_ensemble(
       genotypes=genotypes,
       samples=samples,
       k=5,
       use_mixed_precision=None,
       patience_multiplier=1.5,
       verbose=True,
   )

Data Augmentation
~~~~~~~~~~~~~~~~~

.. code-block:: python

   ensemble_result = locator.train_ensemble(
       genotypes=genotypes,
       samples=samples,
       k=5,
       augment_data=True,
       flip_rate=0.05,
       verbose=True,
   )

Partial Training Sets
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   training_indices = [0, 1, 2, 5, 10, 15, 20]

   ensemble_result = locator.train_ensemble(
       genotypes=genotypes,
       samples=samples,
       k=5,
       training_set_indices=training_indices,
       verbose=True,
   )

Model Persistence
-----------------

Saving
~~~~~~

When ``save_fold_models=True`` and ``use_model_manager=True``, models are
written to ``{out}_ensemble/`` automatically:

::

   ensemble_analysis_ensemble/
       metadata.json
       fold_0_model.json
       fold_0_weights.h5
       fold_0_norm_params.json
       ...

Loading
~~~~~~~

Reload a saved ensemble and predict:

.. code-block:: python

   locator = Locator(config)
   locator.load_ensemble("ensemble_analysis_ensemble")

   predictions = locator.predict_ensemble_from_manager(
       genotypes=genotypes,
       samples=samples,
       save_predictions=True,
   )

Models are loaded one at a time during prediction, so memory usage stays low
even for large ensembles.

See Also
--------

- :doc:`parallel_analysis_guide` -- Multi-GPU parallel analysis
- :doc:`api` -- Complete API reference
- :doc:`na_handling_guide` -- Handling missing coordinates