Ensemble Models Guide

Ensemble models combine predictions from multiple neural networks trained on different k-fold subsets of the data. Averaging across folds improves accuracy, reduces overfitting, and provides per-sample uncertainty estimates. The functionality is integrated into the Locator class through the EnsembleMixin.

Basic Ensemble Training 

Sequential Training 

Train an ensemble using k-fold cross-validation:

from locator import Locator

config = {
    "out": "ensemble_analysis",
    "batch_size": 32,
    "width": 256,
    "nlayers": 8,
    "dropout_prop": 0.25,
    "max_epochs": 1000,
    "patience": 100,
}

locator = Locator(config)
genotypes, samples = locator.load_genotypes(vcf="genotypes.vcf.gz")

ensemble_result = locator.train_ensemble(
    genotypes=genotypes,
    samples=samples,
    k=5,
    save_fold_models=True,
    use_model_manager=True,
    verbose=True,
)

The returned ensemble_result contains training histories, model information, averaged normalization parameters, and fold split details.

Parallel Training (Multi-GPU)

For faster training across multiple GPUs:

from locator.parallel import parallel_train_ensemble

ensemble_result = parallel_train_ensemble(
    locator=locator,
    genotypes=genotypes,
    samples=samples,
    k=5,
    gpu_ids=[0, 1, 2, 3],
    save_fold_models=True,
    use_model_manager=True,
    verbose=True,
)

Pass an empty list for CPU-only mode:

ensemble_result = parallel_train_ensemble(
    locator=locator,
    genotypes=genotypes,
    samples=samples,
    k=5,
    gpu_ids=[],
    verbose=True,
)

Making Predictions 

After training, call predict_ensemble to get averaged predictions across folds. Set return_std=True for per-sample uncertainty and include_fold_predictions=True for individual fold outputs:

predictions = locator.predict_ensemble(
    genotypes=genotypes,
    samples=samples,
    return_std=True,
    include_fold_predictions=True,
    save_predictions=True,
)

The result is a DataFrame with columns:

sampleID: sample identifier
x, y: ensemble-mean longitude and latitude
x_std, y_std: standard deviation across folds
x_fold0, y_fold0, …: per-fold predictions (when include_fold_predictions=True)

Advanced Features 

Training Optimizations 

Mixed precision is auto-detected when a compatible GPU is present:

ensemble_result = locator.train_ensemble(
    genotypes=genotypes,
    samples=samples,
    k=5,
    use_mixed_precision=None,
    patience_multiplier=1.5,
    verbose=True,
)

Data Augmentation 

ensemble_result = locator.train_ensemble(
    genotypes=genotypes,
    samples=samples,
    k=5,
    augment_data=True,
    flip_rate=0.05,
    verbose=True,
)

Partial Training Sets 

training_indices = [0, 1, 2, 5, 10, 15, 20]

ensemble_result = locator.train_ensemble(
    genotypes=genotypes,
    samples=samples,
    k=5,
    training_set_indices=training_indices,
    verbose=True,
)

Model Persistence 

Saving 

When save_fold_models=True and use_model_manager=True, models are written to {out}_ensemble/ automatically:

ensemble_analysis_ensemble/
    metadata.json
    fold_0_model.json
    fold_0_weights.h5
    fold_0_norm_params.json
    ...

Loading 

Reload a saved ensemble and predict:

locator = Locator(config)
locator.load_ensemble("ensemble_analysis_ensemble")

predictions = locator.predict_ensemble_from_manager(
    genotypes=genotypes,
    samples=samples,
    save_predictions=True,
)

Models are loaded one at a time during prediction, so memory usage stays low even for large ensembles.