Ensemble Models Guide
Ensemble models combine predictions from multiple neural networks trained on
different k-fold subsets of the data. Averaging across folds improves accuracy,
reduces overfitting, and provides per-sample uncertainty estimates. The
functionality is integrated into the Locator class through the
EnsembleMixin.
Basic Ensemble Training
Sequential Training
Train an ensemble using k-fold cross-validation:
from locator import Locator
config = {
"out": "ensemble_analysis",
"batch_size": 32,
"width": 256,
"nlayers": 8,
"dropout_prop": 0.25,
"max_epochs": 1000,
"patience": 100,
}
locator = Locator(config)
genotypes, samples = locator.load_genotypes(vcf="genotypes.vcf.gz")
ensemble_result = locator.train_ensemble(
genotypes=genotypes,
samples=samples,
k=5,
save_fold_models=True,
use_model_manager=True,
verbose=True,
)
The returned ensemble_result contains training histories, model
information, averaged normalization parameters, and fold split details.
Parallel Training (Multi-GPU)
For faster training across multiple GPUs:
from locator.parallel import parallel_train_ensemble
ensemble_result = parallel_train_ensemble(
locator=locator,
genotypes=genotypes,
samples=samples,
k=5,
gpu_ids=[0, 1, 2, 3],
save_fold_models=True,
use_model_manager=True,
verbose=True,
)
Pass an empty list for CPU-only mode:
ensemble_result = parallel_train_ensemble(
locator=locator,
genotypes=genotypes,
samples=samples,
k=5,
gpu_ids=[],
verbose=True,
)
Making Predictions
After training, call predict_ensemble to get averaged predictions across
folds. Set return_std=True for per-sample uncertainty and
include_fold_predictions=True for individual fold outputs:
predictions = locator.predict_ensemble(
genotypes=genotypes,
samples=samples,
return_std=True,
include_fold_predictions=True,
save_predictions=True,
)
The result is a DataFrame with columns:
sampleID: sample identifierx,y: ensemble-mean longitude and latitudex_std,y_std: standard deviation across foldsx_fold0,y_fold0, …: per-fold predictions (wheninclude_fold_predictions=True)
Advanced Features
Training Optimizations
Mixed precision is auto-detected when a compatible GPU is present:
ensemble_result = locator.train_ensemble(
genotypes=genotypes,
samples=samples,
k=5,
use_mixed_precision=None,
patience_multiplier=1.5,
verbose=True,
)
Data Augmentation
ensemble_result = locator.train_ensemble(
genotypes=genotypes,
samples=samples,
k=5,
augment_data=True,
flip_rate=0.05,
verbose=True,
)
Partial Training Sets
training_indices = [0, 1, 2, 5, 10, 15, 20]
ensemble_result = locator.train_ensemble(
genotypes=genotypes,
samples=samples,
k=5,
training_set_indices=training_indices,
verbose=True,
)
Model Persistence
Saving
When save_fold_models=True and use_model_manager=True, models are
written to {out}_ensemble/ automatically:
ensemble_analysis_ensemble/
metadata.json
fold_0_model.json
fold_0_weights.h5
fold_0_norm_params.json
...
Loading
Reload a saved ensemble and predict:
locator = Locator(config)
locator.load_ensemble("ensemble_analysis_ensemble")
predictions = locator.predict_ensemble_from_manager(
genotypes=genotypes,
samples=samples,
save_predictions=True,
)
Models are loaded one at a time during prediction, so memory usage stays low even for large ensembles.
See Also
Parallel Analysis Guide – Multi-GPU parallel analysis
API Reference – Complete API reference
Handling Missing Coordinates Guide – Handling missing coordinates