API Reference
Core Module
Locator
- class Locator(config=None)[source]
Bases:
DataLoaderMixin,TrainingMixin,PredictionMixin,AnalysisMixin,EnsembleMixin,PlottingMixinA class for predicting geographic locations from genetic data.
This class implements a neural network approach to predict sample locations from genetic data. It can handle various input formats including:
- Genotype data:
VCF or VCF.gz files
Zarr format
Pandas DataFrame with samples as index, SNP positions as columns
- Sample location data:
Tab-delimited file
Pandas DataFrame
The model can be configured through a dictionary of parameters passed during initialization. Sample location data can be provided either as a file path or as a pandas DataFrame.
- Variables:
(dict) (config)
(keras.Model) (model)
(keras.callbacks.History) (history)
(numpy.ndarray) (samples)
(float) (sdlat)
(float)
(float)
(float)
Example
>>> # Using a file path for sample data >>> locator = Locator({ ... "out": "analysis_1", ... "sample_data": "samples.txt", ... "zarr": "genotypes.zarr" ... })
>>> # Using a DataFrame for sample data >>> locator = Locator({ ... "out": "analysis_1", ... "sample_data": sample_df, # pandas DataFrame ... "zarr": "genotypes.zarr" ... })
>>> # Using DataFrames for both inputs >>> # Coordinate DataFrame must have columns: sampleID, x, y >>> coords_df = pd.DataFrame({ ... "sampleID": ["sample1", "sample2"], ... "x": [longitude1, longitude2], ... "y": [latitude1, latitude2] ... }) >>> >>> # Genotype DataFrame has samples as index, SNP positions as columns >>> geno_df = pd.DataFrame({ ... 1001: [0, 1], # SNP position 1001 ... 2001: [1, 2], # SNP position 2001 ... }, index=["sample1", "sample2"]) >>> >>> locator = Locator({ ... "out": "analysis_1", ... "sample_data": coords_df, ... "genotype_data": geno_df ... })
- __init__(config=None)[source]
Initialize Locator with configuration parameters.
- Parameters:
config (dict, optional) – Configuration dictionary that can include the following keys:
Top-level keys:
sample_data (str or pandas.DataFrame): Path to sample data file or a DataFrame with columns ‘sampleID’, ‘x’, ‘y’.
genotype_data (pandas.DataFrame): DataFrame with samples as index, SNP positions as columns, and genotype counts (0, 1, 2) as values.
zarr (str): Path to Zarr format genotype data.
vcf (str): Path to VCF format genotype data.
out (str): Output root name for all output files.
train_split (float): Proportion of data to use for training.
batch_size (int): Batch size for training.
max_epochs (int): Maximum number of training epochs.
patience (int): Patience for early stopping.
min_mac (int): Minimum minor allele count for SNP filtering.
max_SNPs (int): Maximum number of SNPs to use.
width (int): Width of neural network layers.
nlayers (int): Number of neural network layers.
dropout_prop (float): Dropout proportion.
pca_components (int or “auto”): If set, prepend a PCA-initialized linear projection of this width as the first layer and fine-tune it. Use
"auto"to pick the width from the genotype-PCA scree elbow. Recommended when n_SNPs >> n_samples. Default None (disabled).pca_finetune (bool): Whether to unfreeze the PCA projection for a low-learning-rate fine-tuning phase. Default True. False keeps the projection frozen at its PCA initialization.
pca_finetune_lr (float): Learning rate for the PCA fine-tuning phase. Default 1e-4.
keras_verbose (int): Verbosity level for Keras training.
impute_missing (bool): Whether to impute missing genotypes.
validation_split (float): Proportion of data to use for validation.
learning_rate (float): Learning rate for the optimizer.
min_epochs (int): Minimum number of epochs to train.
patience (int): Number of epochs with no improvement to wait before stopping.
min_delta (float): Minimum change in validation loss to qualify as an improvement.
restore_best_weights (bool): Whether to restore model weights from the epoch with the best validation loss.
prediction_frequency (int): Frequency (in epochs) of making predictions during training.
optimizer_algo (str): Optimizer algorithm to use (“adam” or “adamw”).
weight_decay (float): Weight decay coefficient for AdamW optimizer.
- augmentation (dict): Dictionary of augmentation parameters:
enabled (bool): Whether data augmentation is enabled.
flip_rate (float): Rate at which to randomly flip genotypes during augmentation.
- weight_samples (dict): Dictionary of sample weighting parameters:
enabled (bool): Whether to weight samples by distance.
method (str): Method for weighting samples (“KD”, “histogram”, “df”).
xbins (int): Number of bins for histogram.
ybins (int): Number of bins for histogram.
lam (float): Exponent for weights.
bandwidth (float): Bandwidth for KDE.
weightdf (pandas.DataFrame): DataFrame containing sample weights.
use_range_penalty (bool): Whether to apply a range penalty in the loss function.
penalty_weight (float): Weight assigned to the range penalty term.
species_range_geom (shapely.geometry): Shapely geometry object defining the valid species range.
- na_action (str): How to handle samples without coordinates. Options:
‘separate’ (default): Include all samples, train on known, predict unknown.
‘exclude’: Only use samples with known coordinates.
‘fail’: Raise error if any samples lack coordinates.
- property sample_data: DataFrame
Returns the sample data as a pandas DataFrame.
- Returns:
pd.DataFrame
- Return type:
The sample data DataFrame with columns [‘sampleID’, ‘x’, ‘y’, …].
- Raises:
ValueError – If sample data is not available.:
Example
>>> locator = Locator({"sample_data": coords_df}) >>> df = locator.sample_data
- get_sample_status(samples, sample_data=None)[source]
Analyze sample coordinate status.
This method identifies which samples have known geographic coordinates and which have missing (NA) coordinates. This is useful for understanding your data and for methods that need to handle samples with and without coordinates differently.
- Parameters:
samples (numpy.ndarray) – Array of sample IDs from genotype data
sample_data (pandas.DataFrame, optional) – DataFrame with columns ‘sampleID’, ‘x’, ‘y’. If not provided, uses the stored sample data or loads from config.
- Returns:
dict –
‘known_indices’ (numpy.ndarray): Array indices of samples with coordinates
’na_indices’ (numpy.ndarray): Array indices of samples without coordinates
’known_samples’ (numpy.ndarray): Sample IDs with coordinates
’na_samples’ (numpy.ndarray): Sample IDs without coordinates
’n_known’ (int): Count of samples with known coordinates
’n_na’ (int): Count of samples with NA coordinates
’total’ (int): Total number of samples
- Return type:
A dictionary containing:
Example
>>> locator = Locator(config) >>> status = locator.get_sample_status(samples) >>> print(f"Found {status['n_known']} samples with coordinates") >>> print(f"Found {status['n_na']} samples without coordinates")
- check_data(genotypes, samples, verbose=True)[source]
Check data quality and report statistics.
This is a convenience method to help users understand their data before running analyses. It reports the number of samples, SNPs, and identifies samples with missing coordinates.
- Parameters:
genotypes (numpy.ndarray or allel.GenotypeArray) – Genotype data
samples (numpy.ndarray) – Array of sample IDs
verbose (bool) – If True, print detailed statistics. Default: True
- Returns:
dict
- Return type:
Sample status dictionary from get_sample_status()
Example:
>>> locator = Locator(config) >>> genotypes, samples = locator.load_genotypes() >>> status = locator.check_data(genotypes, samples) Data Summary ================================================== Total samples: 231 Samples with coordinates: 211 Samples without coordinates: 20 Total SNPs: 1000 Current NA handling mode: separate - Will train on samples with known locations - Can predict on samples without locations Samples without coordinates (first 10): - sample_001 - sample_002 ...
- create_ensemble_early_stopping(patience_multiplier=1.5)
Create early stopping callback with ensemble-specific settings.
- Parameters:
patience_multiplier – Multiply base patience for ensemble training (ensembles often benefit from longer training)
- Returns:
keras.callbacks.EarlyStopping
- Return type:
Configured callback
- create_ensemble_folds(genotypes, samples, k=5, training_set_indices=None, na_action=None)
Create k-fold splits for ensemble training using IndexSet.
- Parameters:
genotypes – GenotypeArray containing genetic data
samples – Array of sample IDs
k – Number of folds (default: 5)
training_set_indices – Optional array of indices to use for training+validation. If provided, only these samples will be used to create k-folds.
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action
- Returns:
dict –
‘index_sets’: List of IndexSet objects for each fold
’fold_indices’: Legacy format dict for backward compatibility
’sample_status’: Sample status information
- Return type:
Dictionary with fold information:
- create_ensemble_lr_scheduler(fold_idx)
Create learning rate scheduler for ensemble training.
Each fold can start with a slightly different learning rate to improve ensemble diversity.
- Parameters:
fold_idx – Current fold index
- Returns:
keras.callbacks.ReduceLROnPlateau
- Return type:
Configured callback
- get_ensemble_batch_size(dataset_size, fold_idx=0)
Determine optimal batch size for ensemble training.
Uses GPUOptimizer to find the best batch size, with caching to avoid recomputing for each fold.
- Parameters:
dataset_size – Size of training dataset
fold_idx – Current fold index (for logging)
- Returns:
int
- Return type:
Optimal batch size
- load_ensemble(ensemble_path)
Load a saved ensemble for prediction.
- Parameters:
ensemble_path – Path to the saved ensemble directory
- Returns:
dict
- Return type:
Ensemble information including models and parameters
- load_genotypes(vcf=None, zarr=None, matrix=None, microsat=None, microsat_min_allele_freq=0.01)
Load genotype data from various input sources.
This method can load genotype data from: 1. A stored DataFrame provided during initialization 2. A VCF file 3. A zarr file (scikit-allel or bio2zarr format) 4. A tab-delimited matrix file 5. A tab-delimited microsatellite genotype table
For windowed analysis, SNP positions must be available either from: - Column names in the genotype DataFrame - The zarr file’s variants/POS array - The VCF file’s POS field (automatically loaded)
- Parameters:
vcf (str, optional) – Path to VCF format genotype data
zarr (str, optional) – Path to zarr format genotype data
matrix (str, optional) – Path to tab-delimited matrix file
microsat (str, optional) – Path to tab-delimited microsatellite genotype table
microsat_min_allele_freq (float, optional) – Drop microsat alleles below this per-locus frequency. Default 0.01.
- Returns:
tuple –
genotypes is an allel.GenotypeArray of shape (n_sites, n_samples, 2) for VCF/zarr/integer-matrix inputs, or a float32 ndarray of shape (n_sites, n_samples) for continuous-dosage (matrix float / microsat) inputs
samples is a numpy array of sample IDs
- Return type:
(genotypes, samples) where:
Examples
>>> # Using stored DataFrame from initialization >>> locator = Locator({ ... "genotype_data": geno_df, # DataFrame with genotypes ... "sample_data": coords_df # DataFrame with coordinates ... }) >>> genotypes, samples = locator.load_genotypes()
>>> # Using zarr file (recommended for windowed analysis) >>> locator = Locator({"sample_data": coords_df}) >>> genotypes, samples = locator.load_genotypes(zarr="path/to/geno.zarr")
>>> # Using VCF file >>> genotypes, samples = locator.load_genotypes(vcf="path/to/geno.vcf")
>>> # Using matrix file >>> genotypes, samples = locator.load_genotypes(matrix="path/to/geno.txt")
>>> # Using microsatellite genotypes >>> genotypes, samples = locator.load_genotypes(microsat="path/to/microsats.tsv")
- Raises:
ValueError – If no input source is provided or if input format is invalid:
- load_model(weights_path)
Load a trained model from saved weights.
This method loads a model from HDF5 weights file and restores the preprocessing parameters needed for making predictions.
- Parameters:
weights_path (str) – Path to the saved HDF5 weights file
- Returns:
dict
- Return type:
Dictionary containing loaded metadata including normalization params
- Raises:
ValueError – If weights file cannot be loaded or is missing metadata:
- predict(boot=0, verbose=True, prediction_genotypes=None, genotypes=None, samples=None, indices=None, return_df=False, save_preds_to_disk=True, site_order=None)
Make predictions for samples with unknown locations.
- Parameters:
boot (int, optional) – Bootstrap replicate number. Defaults to 0.
verbose (bool, optional) – Whether to print validation metrics. Defaults to True.
prediction_genotypes (numpy.ndarray, optional) – DEPRECATED - use genotypes parameter. Override default prediction genotypes. Used for jacknife resampling. Defaults to None.
genotypes (numpy.ndarray, optional) – Full genotype array for creating tf.data dataset. Should be the original unfiltered genotypes. Defaults to None.
samples (numpy.ndarray, optional) – Sample IDs corresponding to genotypes. Defaults to None.
indices (numpy.ndarray, optional) – Indices of samples to predict on. If None, predicts on samples without coordinates (self.pred_indices). Defaults to None.
return_df (bool, optional) – Whether to return predictions as pandas DataFrame. Defaults to False.
save_preds_to_disk (bool, optional) – Whether to save predictions to disk. Defaults to True.
site_order (np.ndarray, optional) – Array of SNP indices for bootstrap resampling. If provided, SNPs will be reordered according to these indices during prediction. Used for bootstrap analyses to ensure consistent resampling between train and predict.
- Returns:
numpy.ndarray or pandas.DataFrame – x,y coordinates and sampleID columns
- Return type:
Array of predicted coordinates or DataFrame with
- predict_ensemble(genotypes=None, samples=None, indices=None, include_fold_predictions=False, return_std=False, return_df=True, save_predictions=True)
Make predictions using the ensemble of models.
- Parameters:
genotypes – GenotypeArray for prediction (if None, uses stored data)
samples – Sample IDs (if None, uses stored samples)
indices – Specific indices to predict on (if None, predicts all)
include_fold_predictions – Include individual fold predictions in output
return_std – Return standard deviation across ensemble predictions
return_df – Return results as DataFrame (default: True)
save_predictions – Save predictions to disk (default: True)
- Returns:
pd.DataFrame or np.ndarray
- Return type:
Ensemble predictions with optional std
- predict_ensemble_from_manager(genotypes, samples, indices=None, return_df=True, save_predictions=True)
Make predictions using loaded ensemble with model manager.
This method efficiently loads models on-demand for prediction, reducing memory usage for large ensembles.
- Parameters:
genotypes – GenotypeArray for prediction
samples – Sample IDs
indices – Specific indices to predict on (if None, predicts all)
return_df – Return results as DataFrame (default: True)
save_predictions – Save predictions to disk (default: True)
- Returns:
pd.DataFrame or np.ndarray
- Return type:
Ensemble predictions
- predict_from_weights(weights_path, genotypes, samples, sample_data_file=None, save_preds_to_disk=True, return_df=True)
Convenience method to load weights and make predictions.
This method combines loading a saved model and making predictions in a single call. It handles preprocessing the genotypes using the same parameters that were used during training.
- Parameters:
weights_path (str) – Path to saved HDF5 weights file
genotypes (numpy.ndarray) – Genotype data to predict on
samples (numpy.ndarray) – Sample IDs corresponding to genotypes
sample_data_file (str, optional) – Path to sample data file
save_preds_to_disk (bool) – Whether to save predictions to disk
return_df (bool) – Whether to return predictions as DataFrame
- Returns:
numpy.ndarray or pandas.DataFrame
- Return type:
Predictions
- predict_holdout(verbose=True, return_df=False, save_preds_to_disk=True, plot_summary=True, plot_map=True)
Predict locations for held out samples.
- Parameters:
verbose – Print progress and metrics
return_df – Return predictions as pandas DataFrame
save_preds_to_disk – Save predictions to disk
plot_summary – Display error summary plot in notebook (only if return_df=True)
plot_map – Display map of predictions (only if plot_summary=True)
- Returns:
If return_df is True, returns pandas DataFrame with predictions
Otherwise returns None
- run_bootstraps(genotypes, samples, n_bootstraps=50, return_df=False, save_full_pred_matrix=True, na_action=None)
Run bootstrap analysis by resampling SNPs with replacement.
- Parameters:
genotypes – Array of genotype data
samples – Sample IDs corresponding to genotypes
n_bootstraps – Number of bootstrap replicates to run
return_df – Whether to return DataFrame with all predictions
save_full_pred_matrix – Whether to save full prediction matrix to disk
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action
- Returns:
pandas.DataFrame or None – for each bootstrap, otherwise None
- Return type:
If return_df=True, returns DataFrame with predictions
Notes
With na_action=’separate’: Trains on samples with known locations, can predict on samples with NA locations
With na_action=’exclude’: Only uses samples with known locations
With na_action=’fail’: Raises error if any NA samples found
- run_holdouts(genotypes, samples, k=10, n_reps=10, holdout_indices=None, holdout_sample_ids=None, return_df=False, save_full_pred_matrix=True, na_action=None)
Run multiple holdout replicates for cross-validation.
- Parameters:
genotypes – Array of genotype data
samples – Sample IDs corresponding to genotypes
k – Number of samples to hold out in each replicate
n_reps – Number of holdout replicates to run
holdout_indices – Optional list of lists, each containing indices to hold out
holdout_sample_ids – Optional list of sample IDs to hold out. If provided, these specific samples will be held out (overrides k and holdout_indices). Can be a single list (used for all replicates) or list of lists (different samples per replicate).
return_df – Whether to return DataFrame with all predictions
save_full_pred_matrix – Whether to save full prediction matrix to disk
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action
- Returns:
- If return_df=True, returns DataFrame with predictions
for each holdout replicate containing columns: - sampleID: Sample identifier - x_pred: Predicted longitude - y_pred: Predicted latitude - rep: Replicate number (0 to n_reps-1)
Note: True locations are not included. Merge with sample metadata to calculate errors.
- Return type:
pandas.DataFrame or None
Notes
With na_action=’separate’: Currently behaves like ‘exclude’ (holdouts must have known locations). Future versions may support predicting NA samples.
With na_action=’exclude’: Only uses samples with known locations (current behavior)
With na_action=’fail’: Raises error if any NA samples found
- run_jacknife(genotypes, samples, prop=0.05, return_df=False, save_full_pred_matrix=True, na_action=None)
Run jacknife analysis by dropping SNPs.
- Parameters:
genotypes – Array of genotype data
samples – Sample IDs corresponding to genotypes
prop (float, optional) – Proportion of SNPs to drop in each replicate. Defaults to 0.05.
return_df (bool, optional) – Whether to return DataFrame of all predictions. Defaults to False.
save_full_pred_matrix (bool, optional) – Whether to save the full prediction matrix. Defaults to True.
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action
- Returns:
pandas.DataFrame or None – all predictions, with columns named ‘x_0’, ‘y_0’, ‘x_1’, ‘y_1’, etc. for each jacknife replicate. Row index contains sample IDs.
- Return type:
If return_df=True, returns DataFrame containing
Notes
With na_action=’separate’: Trains on samples with known locations, can predict on samples with NA locations
With na_action=’exclude’: Only uses samples with known locations
With na_action=’fail’: Raises error if any NA samples found
- run_jacknife_holdouts(genotypes, samples, k=10, prop=0.05, n_boots=50, holdout_indices=None, return_df=False, save_full_pred_matrix=True, na_action=None)
Run jacknife analysis on holdout samples.
- Parameters:
genotypes – Array of genotype data
samples – Sample IDs corresponding to genotypes
k – Number of samples to hold out
prop – Proportion of SNPs to drop in each jacknife replicate
n_boots – Number of jacknife replicates
holdout_indices – Optional specific indices to hold out
return_df – Whether to return DataFrame with all predictions
save_full_pred_matrix – Whether to save full prediction matrix to disk
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action
- Returns:
pandas.DataFrame or None – for each jacknife replicate containing columns: - sampleID: Sample identifier - x_pred: Predicted longitude - y_pred: Predicted latitude - boot: Jacknife replicate number (0 to n_boots-1)
Note: True locations are not included. Merge with sample metadata to calculate errors.
- Return type:
If return_df=True, returns DataFrame with predictions
Notes
With na_action=’separate’: Currently behaves like ‘exclude’ (holdouts must have known locations). Future versions may support predicting NA samples.
With na_action=’exclude’: Only uses samples with known locations (current behavior)
With na_action=’fail’: Raises error if any NA samples found
- run_k_fold_holdouts(genotypes, samples, k=10, return_df=False, save_full_pred_matrix=True, verbose=True, na_action=None)
Run true k-fold cross-validation with nonoverlapping holdout sets.
- Parameters:
genotypes – Array of genotype data
samples – Sample IDs corresponding to genotypes
k – Number of folds (holdout sets)
return_df – Whether to return DataFrame with all predictions
save_full_pred_matrix – Whether to save full prediction matrix to disk
verbose – Whether to show training progress and intermediate output
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action
- Returns:
- If return_df=True, returns DataFrame with one prediction
per held-out sample containing columns: - sampleID: Sample identifier - x_pred: Predicted longitude - y_pred: Predicted latitude
Note: True locations are not included. To calculate prediction errors, merge the returned DataFrame with your sample metadata using the sampleID column.
- Return type:
pandas.DataFrame or None
Notes
With na_action=’separate’: Currently behaves like ‘exclude’ (k-fold requires known locations). Future versions may support predicting NA samples.
With na_action=’exclude’: Only uses samples with known locations (current behavior)
With na_action=’fail’: Raises error if any NA samples found
Example
>>> # Run k-fold cross-validation >>> predictions = locator.run_k_fold_holdouts(genotypes, samples, k=10, return_df=True) >>> >>> # Merge with true locations to calculate errors >>> sample_data = pd.read_csv('samples.tsv', sep='\t') >>> merged = predictions.merge(sample_data[['sampleID', 'x', 'y']], on='sampleID') >>> merged['error_km'] = np.sqrt( ... (merged['x'] - merged['x_pred'])**2 + ... (merged['y'] - merged['y_pred'])**2 ... ) * 111.32 # Convert degrees to km
- run_leave_one_out(genotypes, samples, return_df=True, save_full_pred_matrix=True, na_action=None)
Perform leave-one-out cross-validation: for each sample with a known location, train without it and predict its location.
This is a convenience wrapper around run_k_fold_holdouts with k equal to the number of samples with known locations.
- Parameters:
genotypes – Array of genotype data
samples – Sample IDs corresponding to genotypes
return_df – Whether to return DataFrame with all predictions
save_full_pred_matrix – Whether to save full prediction matrix to disk
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action
- Returns:
pandas.DataFrame or None
- Return type:
DataFrame with predictions for each left-out sample
- run_windows(genotypes, samples, window_start=0, window_size=500000.0, window_stop=None, respect_chromosomes=True, return_df=False, save_full_pred_matrix=True, na_action=None)
Run windowed prediction analysis.
- Parameters:
genotypes – GenotypeArray containing genetic data
samples – Array of sample IDs
window_start – Start position for windows (default: 0)
window_size – Size of windows in base pairs (default: 500kb)
window_stop – Stop position for windows (default: None)
respect_chromosomes – Whether to respect chromosome boundaries when creating windows (default: True). If True, windows will not span chromosome boundaries. Requires chromosome information from VCF/Zarr input.
return_df – Whether to return DataFrame with all predictions
save_full_pred_matrix – Whether to save full prediction matrix to disk
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action
- Returns:
pandas.DataFrame or None – for each window, otherwise None
- Return type:
If return_df=True, returns DataFrame with predictions
Notes
With na_action=’separate’: Trains on samples with known locations, can predict on samples with NA locations
With na_action=’exclude’: Only uses samples with known locations
With na_action=’fail’: Raises error if any NA samples found
Warning
When respect_chromosomes=False, window analysis treats all SNP positions as continuous along a single coordinate axis. If your data contains multiple chromosomes, windows may span across chromosome boundaries. Use respect_chromosomes=True (default) for biologically meaningful windows.
- run_windows_holdouts(genotypes, samples, k=10, window_start=0, window_size=500000.0, window_stop=None, respect_chromosomes=True, holdout_indices=None, holdout_sample_ids=None, return_df=False, save_full_pred_matrix=True, na_action=None)
Run windowed analysis on holdout samples.
- Parameters:
genotypes – Array of genotype data
samples – Sample IDs corresponding to genotypes
k – Number of samples to hold out
window_start – Start position for windows
window_size – Size of windows in base pairs
window_stop – Stop position for windows
respect_chromosomes – Whether to respect chromosome boundaries when creating windows (default: True). If True, windows will not span chromosome boundaries. Requires chromosome information from VCF/Zarr input.
holdout_indices – Optional specific indices to hold out
holdout_sample_ids – Optional list of sample IDs to hold out. If provided, these specific samples will be held out (overrides k and holdout_indices).
return_df – Whether to return DataFrame with all predictions
save_full_pred_matrix – Whether to save full prediction matrix to disk
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action
- Returns:
pandas.DataFrame or None – for each window, otherwise None
- Return type:
If return_df=True, returns DataFrame with predictions
Notes
With na_action=’separate’: Currently behaves like ‘exclude’ (holdouts must have known locations). Future versions may support predicting NA samples.
With na_action=’exclude’: Only uses samples with known locations (current behavior)
With na_action=’fail’: Raises error if any NA samples found
Warning
When respect_chromosomes=False, window analysis treats all SNP positions as continuous along a single coordinate axis. If your data contains multiple chromosomes, windows may span across chromosome boundaries. Use respect_chromosomes=True (default) for biologically meaningful windows.
- set_sample_weights(wdict)
Set sample weights for training. :param wdict: Dictionary returned by utils.weight_samples() containing sample weights. :type wdict: dict
- setup_ensemble_gpu_optimization(use_mixed_precision=None)
Setup GPU optimizations for ensemble training.
- Parameters:
use_mixed_precision – Whether to use mixed precision training. If None, uses config value or auto-detects based on GPU.
- Returns:
bool
- Return type:
Whether mixed precision was enabled
- sort_samples(samples=None, sample_data_file=None, reorder=True)
Sort samples and match with location data.
Matches samples with their location data and ensures consistent ordering between genotype and location data.
- Parameters:
samples (numpy.ndarray) – Array of sample IDs from the genotype data
sample_data_file (str, optional) – Override path to tab-delimited file with columns ‘sampleID’, ‘x’, ‘y’. If not provided, uses stored sample data.
reorder (bool) – If True, automatically reorder metadata to match genotype order. If False, raise error on order mismatch (default: True)
- Returns:
tuple
- Return type:
(sample_data DataFrame, locs array of shape (n_samples, 2))
- train(*, genotypes, samples, sample_data_file=None, boot=None, train_gen=None, test_gen=None, pred_gen=None, train_locs=None, test_locs=None, setup_only=False, na_action=None, site_order=None)
Train the Locator model on genotype and location data.
This method trains the neural network model to predict geographic locations from genetic data. It supports both standard training and advanced workflows such as bootstrapping, by accepting pre-processed genotype and location arrays. The model is configured using the parameters provided at initialization.
- Parameters:
genotypes (allel.GenotypeArray or np.ndarray) – Genotype data for all samples. Should be of shape (n_sites, n_samples, ploidy).
samples (np.ndarray) – Array of sample IDs corresponding to the genotype data.
sample_data_file (str, optional) – Path to a tab-delimited file with columns ‘sampleID’, ‘x’, ‘y’ for sample locations. Used if not provided in config or as a DataFrame.
boot (int, optional) – Bootstrap replicate number. Used for bootstrapping analyses. Defaults to None.
train_gen (np.ndarray, optional) – Pre-processed training genotype data. Used for bootstrapping. If None, will be generated from genotypes. Defaults to None.
test_gen (np.ndarray, optional) – Pre-processed test genotype data. Used for bootstrapping. If None, will be generated from genotypes. Defaults to None.
pred_gen (np.ndarray, optional) – Pre-processed prediction genotype data. Used for bootstrapping. If None, will be generated from genotypes. Defaults to None.
train_locs (np.ndarray, optional) – Pre-processed training locations. Used for bootstrapping. If None, will be generated from sample data. Defaults to None.
test_locs (np.ndarray, optional) – Pre-processed test locations. Used for bootstrapping. If None, will be generated from sample data. Defaults to None.
setup_only (bool, optional) – If True, only sets up the model and data without training. Defaults to False.
na_action (str, optional) – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action. Defaults to None.
site_order (np.ndarray, optional) – Array of SNP indices for bootstrap resampling. If provided, SNPs will be reordered according to these indices during training. Used for bootstrap analyses to resample SNPs with replacement.
- Returns:
keras.callbacks.History or None
- Return type:
The Keras training history object if training is performed, or None if setup_only is True.
- Raises:
ValueError – If required sample data is missing or improperly formatted.:
Example
>>> # Standard training >>> loc = Locator({"out": "analysis", "sample_data": "samples.txt", "zarr": "genotypes.zarr"}) >>> genotypes, samples = loc.load_genotypes(zarr="genotypes.zarr") >>> history = loc.train(genotypes=genotypes, samples=samples)
>>> # Bootstrapping with pre-processed data >>> history = loc.train( ... genotypes=None, ... samples=samples, ... boot=1, ... train_gen=boot_train_gen, ... test_gen=boot_test_gen, ... pred_gen=boot_pred_gen, ... train_locs=boot_train_locs, ... test_locs=boot_test_locs ... )
- train_ensemble(genotypes, samples, k=5, training_set_indices=None, na_action=None, augment_data=False, flip_rate=0.05, save_fold_models=True, verbose=True, use_model_manager=True, use_mixed_precision=None, patience_multiplier=1.0)
Train an ensemble of k models using k-fold cross-validation.
This method trains k models, each on a different k-fold split of the data. It uses the modern tf.data pipeline for memory efficiency and supports all standard Locator features including NA handling and data augmentation.
- Parameters:
genotypes – GenotypeArray containing genetic data
samples – Array of sample IDs
k – Number of folds/models in ensemble (default: 5)
training_set_indices – Optional array of indices to restrict training
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’)
augment_data – Whether to apply data augmentation (default: False)
flip_rate – Rate for genotype flipping augmentation (default: 0.05)
save_fold_models – Whether to save individual fold models (default: True)
verbose – Whether to show training progress (default: True)
use_model_manager – Whether to use model manager for saving (default: True)
use_mixed_precision – Whether to use mixed precision training (default: None, auto-detect)
patience_multiplier – Multiply patience for ensemble training (default: 1.0)
- Returns:
dict –
‘histories’: List of training histories for each fold
’models’: List of trained model configurations
’normalization_params’: Averaged normalization parameters
’fold_info’: Information about fold splits
- Return type:
Dictionary containing:
- train_holdout(genotypes=None, samples=None, k=10, holdout_indices=None, filtered_genotypes=None)
Train the model while holding out samples with known locations.
- Parameters:
genotypes – Array of genotype data. Required unless filtered_genotypes is provided.
samples – Sample IDs corresponding to genotypes
k – Number of samples to hold out (ignored if holdout_indices provided)
holdout_indices – Optional specific indices of samples to hold out
filtered_genotypes – Pre-filtered allele count array. If provided, skips internal filter_snps call and avoids loading the full genotype array. Used by parallel dispatch to share one filtered copy across all workers.
- Return type:
keras.callbacks.History object from model training
- train_window(genotypes, samples, window_snp_indices, index_set, normalized_locs)
Train the model for a specific genomic window using efficient tf.data pipeline.
This is an internal method used by run_windows_holdouts to train models on specific genomic windows without creating intermediate arrays.
- Parameters:
genotypes – Full genotype array (not filtered)
samples – Sample IDs
window_snp_indices – Indices of SNPs in this window
index_set – Pre-computed IndexSet with train/test/holdout splits
normalized_locs – Pre-normalized location coordinates
- Return type:
keras.callbacks.History object from model training
Ensemble Functionality
The ensemble functionality is integrated into the main Locator class through the EnsembleMixin.
EnsembleMixin
- class EnsembleMixin[source]
Bases:
objectMixin class providing ensemble functionality for Locator.
- create_ensemble_folds(genotypes, samples, k=5, training_set_indices=None, na_action=None)[source]
Create k-fold splits for ensemble training using IndexSet.
- Parameters:
genotypes – GenotypeArray containing genetic data
samples – Array of sample IDs
k – Number of folds (default: 5)
training_set_indices – Optional array of indices to use for training+validation. If provided, only these samples will be used to create k-folds.
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action
- Returns:
dict –
‘index_sets’: List of IndexSet objects for each fold
’fold_indices’: Legacy format dict for backward compatibility
’sample_status’: Sample status information
- Return type:
Dictionary with fold information:
- train_ensemble(genotypes, samples, k=5, training_set_indices=None, na_action=None, augment_data=False, flip_rate=0.05, save_fold_models=True, verbose=True, use_model_manager=True, use_mixed_precision=None, patience_multiplier=1.0)[source]
Train an ensemble of k models using k-fold cross-validation.
This method trains k models, each on a different k-fold split of the data. It uses the modern tf.data pipeline for memory efficiency and supports all standard Locator features including NA handling and data augmentation.
- Parameters:
genotypes – GenotypeArray containing genetic data
samples – Array of sample IDs
k – Number of folds/models in ensemble (default: 5)
training_set_indices – Optional array of indices to restrict training
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’)
augment_data – Whether to apply data augmentation (default: False)
flip_rate – Rate for genotype flipping augmentation (default: 0.05)
save_fold_models – Whether to save individual fold models (default: True)
verbose – Whether to show training progress (default: True)
use_model_manager – Whether to use model manager for saving (default: True)
use_mixed_precision – Whether to use mixed precision training (default: None, auto-detect)
patience_multiplier – Multiply patience for ensemble training (default: 1.0)
- Returns:
dict –
‘histories’: List of training histories for each fold
’models’: List of trained model configurations
’normalization_params’: Averaged normalization parameters
’fold_info’: Information about fold splits
- Return type:
Dictionary containing:
- predict_ensemble(genotypes=None, samples=None, indices=None, include_fold_predictions=False, return_std=False, return_df=True, save_predictions=True)[source]
Make predictions using the ensemble of models.
- Parameters:
genotypes – GenotypeArray for prediction (if None, uses stored data)
samples – Sample IDs (if None, uses stored samples)
indices – Specific indices to predict on (if None, predicts all)
include_fold_predictions – Include individual fold predictions in output
return_std – Return standard deviation across ensemble predictions
return_df – Return results as DataFrame (default: True)
save_predictions – Save predictions to disk (default: True)
- Returns:
pd.DataFrame or np.ndarray
- Return type:
Ensemble predictions with optional std
- load_ensemble(ensemble_path)[source]
Load a saved ensemble for prediction.
- Parameters:
ensemble_path – Path to the saved ensemble directory
- Returns:
dict
- Return type:
Ensemble information including models and parameters
- predict_ensemble_from_manager(genotypes, samples, indices=None, return_df=True, save_predictions=True)[source]
Make predictions using loaded ensemble with model manager.
This method efficiently loads models on-demand for prediction, reducing memory usage for large ensembles.
- Parameters:
genotypes – GenotypeArray for prediction
samples – Sample IDs
indices – Specific indices to predict on (if None, predicts all)
return_df – Return results as DataFrame (default: True)
save_predictions – Save predictions to disk (default: True)
- Returns:
pd.DataFrame or np.ndarray
- Return type:
Ensemble predictions
- setup_ensemble_gpu_optimization(use_mixed_precision=None)[source]
Setup GPU optimizations for ensemble training.
- Parameters:
use_mixed_precision – Whether to use mixed precision training. If None, uses config value or auto-detects based on GPU.
- Returns:
bool
- Return type:
Whether mixed precision was enabled
- get_ensemble_batch_size(dataset_size, fold_idx=0)[source]
Determine optimal batch size for ensemble training.
Uses GPUOptimizer to find the best batch size, with caching to avoid recomputing for each fold.
- Parameters:
dataset_size – Size of training dataset
fold_idx – Current fold index (for logging)
- Returns:
int
- Return type:
Optimal batch size
- create_ensemble_early_stopping(patience_multiplier=1.5)[source]
Create early stopping callback with ensemble-specific settings.
- Parameters:
patience_multiplier – Multiply base patience for ensemble training (ensembles often benefit from longer training)
- Returns:
keras.callbacks.EarlyStopping
- Return type:
Configured callback
- create_ensemble_lr_scheduler(fold_idx)[source]
Create learning rate scheduler for ensemble training.
Each fold can start with a slightly different learning rate to improve ensemble diversity.
- Parameters:
fold_idx – Current fold index
- Returns:
keras.callbacks.ReduceLROnPlateau
- Return type:
Configured callback
EnsembleModelManager
- class EnsembleModelManager(base_path: str)[source]
Bases:
objectManages multiple models for ensemble predictions.
This class handles: - Saving and loading ensemble models with metadata - Lazy loading of model weights - Efficient storage of normalization parameters - Model versioning and validation
- __init__(base_path: str)[source]
Initialize model manager.
- Parameters:
base_path – Base path for saving/loading models
- save_ensemble(models_info: List[Dict], ensemble_metadata: Dict | None = None) None[source]
Save ensemble models and metadata.
- Parameters:
models_info – List of model info dictionaries from training
ensemble_metadata – Optional metadata about the ensemble
- load_ensemble(model_builder_fn=None) List[Dict][source]
Load ensemble models and metadata.
- Parameters:
model_builder_fn – Function to build model architecture
- Return type:
List of model info dictionaries
- get_model(fold: int, n_features: int) Model[source]
Get a specific model, loading if necessary.
- Parameters:
fold – Fold index
n_features – Number of features for model construction
- Return type:
Loaded model
- get_normalization_params(fold: int) NormalizationParams[source]
Get normalization parameters for a specific fold.
- Parameters:
fold – Fold index
- Return type:
NormalizationParams instance
- get_averaged_normalization_params() NormalizationParams[source]
Get averaged normalization parameters across all folds.
- Return type:
Averaged NormalizationParams
Parallel Ensemble Training
The parallel ensemble training function is available when Ray is installed:
from locator.parallel import parallel_train_ensemble
- parallel_train_ensemble(locator, genotypes, samples, k=5, gpu_ids=[0, 1], gpu_fraction=1.0, training_set_indices=None, na_action=None, augment_data=False, flip_rate=0.05, save_fold_models=True, use_model_manager=True, use_mixed_precision=None, patience_multiplier=1.0, verbose=True)
Train ensemble models in parallel across multiple GPUs using Ray.
- Parameters:
locator – Locator instance with configuration
genotypes – GenotypeArray containing genetic data
samples – Array of sample IDs
k – Number of folds/models in ensemble (default: 5)
gpu_ids – List of GPU IDs to use (default: [0, 1])
gpu_fraction – Fraction of GPU memory per worker (default: 1.0)
training_set_indices – Optional indices to restrict training
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’)
augment_data – Whether to apply data augmentation
flip_rate – Rate for genotype flipping augmentation
save_fold_models – Whether to save individual fold models
use_model_manager – Whether to use model manager for storage
use_mixed_precision – Whether to use mixed precision training
patience_multiplier – Multiply patience for ensemble training
verbose – Whether to show training progress
- Returns:
dict containing histories, models, normalization_params, fold_info
Note
This function requires Ray to be installed. Install with
pip install locator[ray].
Models Module
- create_network(input_shape: int, width: int = 256, n_layers: int = 8, dropout_prop: float = 0.25, pca_components: int | None = None, optimizer_config: dict | None = None, loss_fn: callable | None = None) Model[source]
Create a neural network model for geographic location prediction.
- Parameters:
input_shape (int) – Number of input features (SNPs).
width (int, optional) – Width of the dense layers, defaults to 256.
n_layers (int, optional) – Total number of dense layers (excluding final layers), defaults to 8.
dropout_prop (float, optional) – Dropout proportion for middle dropout layer, defaults to 0.25.
pca_components (int, optional) – If set, prepend a linear projection layer named “pca_projection” of this width as the first layer. The caller is responsible for initializing its weights with PCA loadings. Defaults to None (no projection layer).
optimizer_config (dict, optional) – Configuration for the optimizer. Should be a dict containing keys: “algo” (str): “adam” or “adamw”; “learning_rate” (float); “weight_decay” (float, only used for “adamw”). Defaults to None (uses Adam with default settings).
loss_fn (callable, optional) – Loss function to use. If None, defaults to euclidean_distance_loss, defaults to None.
- Returns:
Compiled Keras model ready for training.
- Return type:
keras.Model
Example
>>> model = create_network(input_shape=1000) >>> model.summary()
Data Module
This module contains the memory-efficient data pipeline components.
IndexSet
- class IndexSet(indices: Dict[str, ndarray], total_samples: int, na_mask: ndarray | None = None)[source]
Bases:
objectContainer for dataset indices that avoids copying data.
This class stores indices for different data splits (train/val/test) to enable memory-efficient data access without creating copies of large genotype arrays.
- Variables:
indices (Dictionary mapping split names to numpy arrays of indices)
total_samples (Total number of samples in the dataset)
na_mask (Optional boolean mask indicating samples without coordinates)
- classmethod random_split(n: int, splits: Dict[str, float] | None = None, seed: int | None = None, na_mask: ndarray | None = None, na_action: str = 'separate') IndexSet[source]
Create random train/val/test splits.
- Parameters:
n – Total number of samples
splits – Dictionary mapping split names to proportions (must sum to ≤ 1.0) Default: {“train”: 0.8, “val”: 0.1, “test”: 0.1}
seed – Random seed for reproducibility
na_mask – Boolean mask indicating samples without coordinates
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’)
- Return type:
IndexSet with random splits
- classmethod from_k_fold(n: int, k: int, fold: int, seed: int | None = None, na_mask: ndarray | None = None) IndexSet[source]
Create train/test split for k-fold cross-validation.
- Parameters:
n – Total number of samples
k – Number of folds
fold – Which fold to use as test set (0-indexed)
seed – Random seed for reproducibility
na_mask – Boolean mask indicating samples without coordinates
- Return type:
IndexSet with train and test splits
- classmethod from_groups(groups: ndarray, test_groups: List[int | str], na_mask: ndarray | None = None) IndexSet[source]
Create train/test split based on group membership.
Useful for spatial or temporal cross-validation where you want to hold out entire groups (e.g., geographic regions).
- Parameters:
groups – Array of group labels for each sample
test_groups – List of group labels to use as test set
na_mask – Boolean mask indicating samples without coordinates
- Return type:
IndexSet with train and test splits
- classmethod from_manual(train: ndarray, test: ndarray | None = None, val: ndarray | None = None, predict: ndarray | None = None, total_samples: int | None = None) IndexSet[source]
Create IndexSet from manually specified indices.
- Parameters:
train – Training indices
test – Test indices
val – Validation indices
predict – Prediction indices (samples without labels)
total_samples – Total number of samples (inferred if not provided)
- Return type:
IndexSet with specified splits
- classmethod k_fold_split(n: int, k: int, seed: int | None = None, na_mask: ndarray | None = None) List[IndexSet][source]
Create all k-fold cross-validation splits at once.
This method generates k IndexSet objects, one for each fold, suitable for ensemble training or cross-validation.
- Parameters:
n – Total number of samples
k – Number of folds
seed – Random seed for reproducibility
na_mask – Boolean mask indicating samples to exclude from k-fold (e.g., samples without coordinates or not in training set)
- Return type:
List of k IndexSet objects, one for each fold
Data Pipeline Functions
- make_tf_dataset(coordinates: ndarray, index_set: IndexSet, split: str, batch_size: int = 256, sample_weights: ndarray | None = None, training: bool = True, shuffle: bool = True, drop_remainder: bool | None = None, prefetch: bool = True) DatasetV2[source]
Create an index-based tf.data pipeline for training or validation.
The pipeline carries only sample indices and their coordinates – a few kilobytes per batch. Genotypes are gathered on the GPU inside
IndexedGenotypeModel, so the genotype matrix never enters this pipeline and there is no per-epoch host-to-device genotype traffic.- Parameters:
coordinates – Full coordinate array of shape
(n_samples, 2).index_set – IndexSet containing the train/val/test/predict splits.
split – Which split to use (‘train’, ‘val’, ‘test’, ‘predict’).
batch_size – Batch size for the dataset.
sample_weights – Optional per-sample weights, aligned to the split’s index order (length must equal the split size).
training – Whether this is for training (enables shuffling).
shuffle – Whether to shuffle the split each epoch (only when training).
drop_remainder – Whether to drop the final partial batch (defaults to the value of
training).prefetch – Whether to prefetch batches.
- Returns:
A
tf.data.Datasetyielding(sample_index, coordinate)batches,or
(sample_index, coordinate, sample_weight)when weights are given.
Preprocessing Functions
- filter_snps(genotypes, min_mac: int = 1, max_snps: int | None = None, impute: bool = False, verbose: bool = False) Tuple[ndarray, FilterStats][source]
Filter SNPs based on criteria and return statistics.
- Parameters:
genotypes – GenotypeArray to filter
min_mac – Minimum minor allele count for filtering
max_snps – Maximum number of SNPs to retain
impute – Whether to impute missing data
verbose – Whether to print progress messages
- Return type:
Tuple of (filtered allele counts array, FilterStats)
- normalize_locs(locs: ndarray) Tuple[float, float, float, float, ndarray, ndarray][source]
Normalize location coordinates.
- Parameters:
locs – Array of shape (n_samples, 2) containing longitude and latitude
- Return type:
Tuple of (meanlong, sdlong, meanlat, sdlat, unnormedlocs, normedlocs)
- impute_missing(genotypes, alt_counts: ndarray | None = None) ndarray[source]
Replace missing data with binomial draws from allele frequency.
- Parameters:
genotypes – GenotypeArray with missing data
alt_counts – Optional precomputed per-site alt allele counts of shape
(n_sites,). When provided, the internalcount_alleles()call is skipped — used byfilter_snpsto reuse counts from its numba kernel.
- Return type:
Allele counts array with imputed values
Data Classes
- class FilterStats(n_samples_original: int, n_samples_filtered: int, n_snps_original: int, n_snps_filtered: int, mac_threshold: int, samples_removed_na: list[str] = None, n_biallelic_filtered: int = 0, n_mac_filtered: int = 0, n_random_subset: int = 0)[source]
Track what was filtered and why.
Sample Weights Module
- weight_samples(method: str, trainlocs: ndarray | None = None, trainsamps: ndarray | None = None, weightdf: DataFrame | None = None, xbins: int | None = None, ybins: int | None = None, lam: float | None = None, bandwidth: float | None = None, cache_bandwidth: bool = True, n_bandwidths: int = 100) Dict[str, Any][source]
Calculate weights for training data based on the specified method.
- Parameters:
method – Method for calculating weights (‘KD’, ‘histogram’, or ‘load’)
trainlocs – Training locations (required for KD and histogram methods)
trainsamps – Training sample IDs
weightdf – DataFrame containing pre-calculated sample weights
xbins – Number of bins in x direction for histogram method
ybins – Number of bins in y direction for histogram method
lam – Exponent for KDE weights
bandwidth – Bandwidth for KDE (if None, will be calculated)
cache_bandwidth – Whether to use bandwidth caching for KDE
n_bandwidths – Number of bandwidth values to test if calculating
- Returns:
‘method’: weighting method used
’sample_weights’: array of weights
’sample_weights_df’: DataFrame with sampleID and weights
method-specific parameters
- Return type:
Dictionary containing
GPU Optimizer Module
- class GPUOptimizer[source]
Utilities for optimizing GPU performance in TensorFlow.
- static setup_mixed_precision()[source]
Enable mixed precision training for 2x speedup on modern GPUs.
- Returns:
bool
- Return type:
True if mixed precision was enabled successfully
- static get_optimal_batch_size(model: Model, input_shape: Tuple[int, ...], target_memory_usage: float = 0.9, min_batch_size: int = 32, max_batch_size: int = 2048, dataset_size: int | None = None, verbose: bool = True) int[source]
Dynamically determine optimal batch size for GPU memory.
- Parameters:
model – Keras model to optimize for
input_shape – Shape of single input sample (excluding batch dimension)
target_memory_usage – Target GPU memory usage (0.0-1.0)
min_batch_size – Minimum batch size to test
max_batch_size – Maximum batch size to test
dataset_size – Size of the dataset (if provided, limits max batch size)
- Returns:
int
- Return type:
Optimal batch size for current GPU
Internal Modules (Implementation Details)
These modules contain the implementation of Locator functionality. Users typically interact with these through the main Locator class.
Loaders Module
- class DataLoaderMixin[source]
Mixin class providing data loading functionality for Locator.
- load_genotypes(vcf=None, zarr=None, matrix=None, microsat=None, microsat_min_allele_freq=0.01)[source]
Load genotype data from various input sources.
This method can load genotype data from: 1. A stored DataFrame provided during initialization 2. A VCF file 3. A zarr file (scikit-allel or bio2zarr format) 4. A tab-delimited matrix file 5. A tab-delimited microsatellite genotype table
For windowed analysis, SNP positions must be available either from: - Column names in the genotype DataFrame - The zarr file’s variants/POS array - The VCF file’s POS field (automatically loaded)
- Parameters:
vcf (str, optional) – Path to VCF format genotype data
zarr (str, optional) – Path to zarr format genotype data
matrix (str, optional) – Path to tab-delimited matrix file
microsat (str, optional) – Path to tab-delimited microsatellite genotype table
microsat_min_allele_freq (float, optional) – Drop microsat alleles below this per-locus frequency. Default 0.01.
- Returns:
tuple –
genotypes is an allel.GenotypeArray of shape (n_sites, n_samples, 2) for VCF/zarr/integer-matrix inputs, or a float32 ndarray of shape (n_sites, n_samples) for continuous-dosage (matrix float / microsat) inputs
samples is a numpy array of sample IDs
- Return type:
(genotypes, samples) where:
Examples
>>> # Using stored DataFrame from initialization >>> locator = Locator({ ... "genotype_data": geno_df, # DataFrame with genotypes ... "sample_data": coords_df # DataFrame with coordinates ... }) >>> genotypes, samples = locator.load_genotypes()
>>> # Using zarr file (recommended for windowed analysis) >>> locator = Locator({"sample_data": coords_df}) >>> genotypes, samples = locator.load_genotypes(zarr="path/to/geno.zarr")
>>> # Using VCF file >>> genotypes, samples = locator.load_genotypes(vcf="path/to/geno.vcf")
>>> # Using matrix file >>> genotypes, samples = locator.load_genotypes(matrix="path/to/geno.txt")
>>> # Using microsatellite genotypes >>> genotypes, samples = locator.load_genotypes(microsat="path/to/microsats.tsv")
- Raises:
ValueError – If no input source is provided or if input format is invalid:
- sort_samples(samples=None, sample_data_file=None, reorder=True)[source]
Sort samples and match with location data.
Matches samples with their location data and ensures consistent ordering between genotype and location data.
- Parameters:
samples (numpy.ndarray) – Array of sample IDs from the genotype data
sample_data_file (str, optional) – Override path to tab-delimited file with columns ‘sampleID’, ‘x’, ‘y’. If not provided, uses stored sample data.
reorder (bool) – If True, automatically reorder metadata to match genotype order. If False, raise error on order mismatch (default: True)
- Returns:
tuple
- Return type:
(sample_data DataFrame, locs array of shape (n_samples, 2))
Training Module
- class TrainingMixin[source]
Mixin class providing training functionality for Locator.
- set_sample_weights(wdict)[source]
Set sample weights for training. :param wdict: Dictionary returned by utils.weight_samples() containing sample weights. :type wdict: dict
- train(*, genotypes, samples, sample_data_file=None, boot=None, train_gen=None, test_gen=None, pred_gen=None, train_locs=None, test_locs=None, setup_only=False, na_action=None, site_order=None)[source]
Train the Locator model on genotype and location data.
This method trains the neural network model to predict geographic locations from genetic data. It supports both standard training and advanced workflows such as bootstrapping, by accepting pre-processed genotype and location arrays. The model is configured using the parameters provided at initialization.
- Parameters:
genotypes (allel.GenotypeArray or np.ndarray) – Genotype data for all samples. Should be of shape (n_sites, n_samples, ploidy).
samples (np.ndarray) – Array of sample IDs corresponding to the genotype data.
sample_data_file (str, optional) – Path to a tab-delimited file with columns ‘sampleID’, ‘x’, ‘y’ for sample locations. Used if not provided in config or as a DataFrame.
boot (int, optional) – Bootstrap replicate number. Used for bootstrapping analyses. Defaults to None.
train_gen (np.ndarray, optional) – Pre-processed training genotype data. Used for bootstrapping. If None, will be generated from genotypes. Defaults to None.
test_gen (np.ndarray, optional) – Pre-processed test genotype data. Used for bootstrapping. If None, will be generated from genotypes. Defaults to None.
pred_gen (np.ndarray, optional) – Pre-processed prediction genotype data. Used for bootstrapping. If None, will be generated from genotypes. Defaults to None.
train_locs (np.ndarray, optional) – Pre-processed training locations. Used for bootstrapping. If None, will be generated from sample data. Defaults to None.
test_locs (np.ndarray, optional) – Pre-processed test locations. Used for bootstrapping. If None, will be generated from sample data. Defaults to None.
setup_only (bool, optional) – If True, only sets up the model and data without training. Defaults to False.
na_action (str, optional) – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action. Defaults to None.
site_order (np.ndarray, optional) – Array of SNP indices for bootstrap resampling. If provided, SNPs will be reordered according to these indices during training. Used for bootstrap analyses to resample SNPs with replacement.
- Returns:
keras.callbacks.History or None
- Return type:
The Keras training history object if training is performed, or None if setup_only is True.
- Raises:
ValueError – If required sample data is missing or improperly formatted.:
Example
>>> # Standard training >>> loc = Locator({"out": "analysis", "sample_data": "samples.txt", "zarr": "genotypes.zarr"}) >>> genotypes, samples = loc.load_genotypes(zarr="genotypes.zarr") >>> history = loc.train(genotypes=genotypes, samples=samples)
>>> # Bootstrapping with pre-processed data >>> history = loc.train( ... genotypes=None, ... samples=samples, ... boot=1, ... train_gen=boot_train_gen, ... test_gen=boot_test_gen, ... pred_gen=boot_pred_gen, ... train_locs=boot_train_locs, ... test_locs=boot_test_locs ... )
- train_holdout(genotypes=None, samples=None, k=10, holdout_indices=None, filtered_genotypes=None)[source]
Train the model while holding out samples with known locations.
- Parameters:
genotypes – Array of genotype data. Required unless filtered_genotypes is provided.
samples – Sample IDs corresponding to genotypes
k – Number of samples to hold out (ignored if holdout_indices provided)
holdout_indices – Optional specific indices of samples to hold out
filtered_genotypes – Pre-filtered allele count array. If provided, skips internal filter_snps call and avoids loading the full genotype array. Used by parallel dispatch to share one filtered copy across all workers.
- Return type:
keras.callbacks.History object from model training
- train_window(genotypes, samples, window_snp_indices, index_set, normalized_locs)[source]
Train the model for a specific genomic window using efficient tf.data pipeline.
This is an internal method used by run_windows_holdouts to train models on specific genomic windows without creating intermediate arrays.
- Parameters:
genotypes – Full genotype array (not filtered)
samples – Sample IDs
window_snp_indices – Indices of SNPs in this window
index_set – Pre-computed IndexSet with train/test/holdout splits
normalized_locs – Pre-normalized location coordinates
- Return type:
keras.callbacks.History object from model training
Prediction Module
- class PredictionMixin[source]
Mixin class providing prediction functionality for Locator.
- predict(boot=0, verbose=True, prediction_genotypes=None, genotypes=None, samples=None, indices=None, return_df=False, save_preds_to_disk=True, site_order=None)[source]
Make predictions for samples with unknown locations.
- Parameters:
boot (int, optional) – Bootstrap replicate number. Defaults to 0.
verbose (bool, optional) – Whether to print validation metrics. Defaults to True.
prediction_genotypes (numpy.ndarray, optional) – DEPRECATED - use genotypes parameter. Override default prediction genotypes. Used for jacknife resampling. Defaults to None.
genotypes (numpy.ndarray, optional) – Full genotype array for creating tf.data dataset. Should be the original unfiltered genotypes. Defaults to None.
samples (numpy.ndarray, optional) – Sample IDs corresponding to genotypes. Defaults to None.
indices (numpy.ndarray, optional) – Indices of samples to predict on. If None, predicts on samples without coordinates (self.pred_indices). Defaults to None.
return_df (bool, optional) – Whether to return predictions as pandas DataFrame. Defaults to False.
save_preds_to_disk (bool, optional) – Whether to save predictions to disk. Defaults to True.
site_order (np.ndarray, optional) – Array of SNP indices for bootstrap resampling. If provided, SNPs will be reordered according to these indices during prediction. Used for bootstrap analyses to ensure consistent resampling between train and predict.
- Returns:
numpy.ndarray or pandas.DataFrame – x,y coordinates and sampleID columns
- Return type:
Array of predicted coordinates or DataFrame with
- load_model(weights_path)[source]
Load a trained model from saved weights.
This method loads a model from HDF5 weights file and restores the preprocessing parameters needed for making predictions.
- Parameters:
weights_path (str) – Path to the saved HDF5 weights file
- Returns:
dict
- Return type:
Dictionary containing loaded metadata including normalization params
- Raises:
ValueError – If weights file cannot be loaded or is missing metadata:
- predict_from_weights(weights_path, genotypes, samples, sample_data_file=None, save_preds_to_disk=True, return_df=True)[source]
Convenience method to load weights and make predictions.
This method combines loading a saved model and making predictions in a single call. It handles preprocessing the genotypes using the same parameters that were used during training.
- Parameters:
weights_path (str) – Path to saved HDF5 weights file
genotypes (numpy.ndarray) – Genotype data to predict on
samples (numpy.ndarray) – Sample IDs corresponding to genotypes
sample_data_file (str, optional) – Path to sample data file
save_preds_to_disk (bool) – Whether to save predictions to disk
return_df (bool) – Whether to return predictions as DataFrame
- Returns:
numpy.ndarray or pandas.DataFrame
- Return type:
Predictions
- predict_holdout(verbose=True, return_df=False, save_preds_to_disk=True, plot_summary=True, plot_map=True)[source]
Predict locations for held out samples.
- Parameters:
verbose – Print progress and metrics
return_df – Return predictions as pandas DataFrame
save_preds_to_disk – Save predictions to disk
plot_summary – Display error summary plot in notebook (only if return_df=True)
plot_map – Display map of predictions (only if plot_summary=True)
- Returns:
If return_df is True, returns pandas DataFrame with predictions
Otherwise returns None
Analysis Module
- class AnalysisMixin[source]
Mixin class providing analysis functionality for Locator.
Parallel Analysis Module
This module provides Ray-based parallel implementations of analysis methods for multi-GPU execution.
- parallel_k_fold_holdouts(*args, **kwargs)
- parallel_leave_one_out(*args, **kwargs)
- parallel_holdouts(*args, **kwargs)
- parallel_windows_holdouts(*args, **kwargs)
Plotting Module
This module provides visualization functions for Locator predictions and analyses.
Standalone Functions
- plot_predictions(predictions, locator, out_prefix, samples=None, n_samples=9, n_cols=3, plot_map=False, width=5, height=4, dpi=300, n_levels=3, show=None)[source]
Plot locator predictions from jacknife, bootstrap, or windows analyses.
This function visualizes predictions from any of locator’s prediction methods that generate multiple predictions per sample. It creates a grid of subplots, one per sample, showing the distribution of predictions as KDE contours.
The function expects prediction data with:
A ‘sampleID’ column
Multiple prediction columns (‘x_0’, ‘x_1’… and ‘y_0’, ‘y_1’…)
For each sample, the plot shows:
KDE contours of predictions (blue lines)
True location if known (red star)
All training sample locations (gray circles)
- Parameters:
predictions (pandas.DataFrame or str) –
DataFrame or path to predictions file. Output from any of:
locator.run_jacknife(return_df=True)locator.run_bootstraps(return_df=True)locator.run_windows(return_df=True)
locator (Locator) – Locator instance containing training data configuration
out_prefix (str) – Prefix for output files. Plot saved as {out_prefix}_predictions.pdf
samples (list, optional) – List of sample IDs to plot. If None, randomly selects n_samples
n_samples (int) – Number of samples to plot if samples not specified. Default: 9
n_cols (int) – Number of columns in plot grid. Default: 3
plot_map (bool) – Whether to plot on a geographic map (requires cartopy). Default: False
width (float) – Width of each subplot in inches. Default: 5
height (float) – Height of each subplot in inches. Default: 4
dpi (int) – DPI resolution for output figure. Default: 300
n_levels (int) – Number of KDE contour levels to plot. Default: 3
show (bool or None) – Whether to display plot. None=auto-detect environment. Default: None
- Returns:
None
- Return type:
Saves plot to file and optionally displays it
Examples
For jacknife analysis:
predictions = locator.run_jacknife(genotypes, samples, return_df=True) plot_predictions(predictions, locator, "jacknife_example")
For bootstrap analysis:
predictions = locator.run_bootstraps(genotypes, samples, return_df=True) plot_predictions(predictions, locator, "bootstrap_example")
For windows analysis:
predictions = locator.run_windows(genotypes, samples, return_df=True) plot_predictions(predictions, locator, "windows_example")
Plot specific samples:
plot_predictions(predictions, locator, "selected", samples=['HG001', 'HG002', 'HG003'])
Note
Requires matplotlib and scipy for KDE calculation
If plot_map=True, requires cartopy for geographic projections
Automatically adjusts plot limits based on prediction ranges
KDE may fail for samples with very few predictions
- plot_error_summary(predictions, sample_data, out_prefix=None, plot_map=True, width=20, height=10, dpi=300, use_geodesic=True, include_training_locs=True, show=None, return_merged=False)[source]
Plot summary of prediction errors from holdout analysis.
Creates a comprehensive error visualization with two panels:
Map/Scatter panel: Shows true locations colored by prediction error, with lines connecting true and predicted locations
Histogram panel: Distribution of errors with summary statistics
This function is designed for analyzing results from holdout methods like:
run_holdouts()run_k_fold_holdouts()run_leave_one_out()
- Parameters:
predictions (pandas.DataFrame) – DataFrame with columns: -
sampleID: Sample identifiers -x_pred: Predicted longitude -y_pred: Predicted latitudesample_data (pandas.DataFrame or str) – DataFrame or path to TSV file with columns: -
sampleID: Sample identifiers (must match predictions) -x: True longitude -y: True latitudeout_prefix (str, optional) – Prefix for output files. If provided, saves as {out_prefix}_error_summary.png (or .html for interactive). Default: None
plot_map (bool) – Whether to plot on a geographic map using cartopy projection. If False, uses regular scatter plot. Default: True
width (float) – Figure width in inches. Default: 20
height (float) – Figure height in inches. Default: 10
dpi (int) – Figure resolution in dots per inch. Default: 300
use_geodesic (bool) – If True, calculate geodesic distances in kilometers. If False, use Euclidean distances in coordinate units. Default: True
include_training_locs (bool) – Whether to plot training locations (gray circles) and use their extent for map bounds. Default: True
show (bool or None) – Whether to display plot. None=auto-detect environment, True=always show, False=never show. Default: None
return_merged (bool) – If True, return the internal merged DataFrame used for plotting. Default: False
- Returns:
None (Saves plot to file and optionally displays it.)
If return_merged is True, returns the internal merged DataFrame containing prediction errors and true locations.
- Raises:
ValueError – If predictions or sample_data are empty, have missing columns,: or have no matching samples
Examples
Basic usage with k-fold results:
predictions = locator.run_k_fold_holdouts(genotypes, samples, return_df=True) plot_error_summary(predictions, "samples.tsv", "kfold_errors")
With DataFrame input and Euclidean distances:
plot_error_summary(predictions, sample_df, out_prefix="holdout_errors", use_geodesic=False)
Without map projection:
plot_error_summary(predictions, sample_df, plot_map=False, width=10, height=5)
Return merged DataFrame:
merged = plot_error_summary(predictions, sample_df, return_merged=True)
Note
Summary statistics shown: mean, median, max error, R² for x and y
Training locations help visualize geographic sampling bias
Geodesic distances account for Earth’s curvature
Map projection requires cartopy to be installed
- plot_sample_weights(locator, out_prefix=None, plot_map=True, width=5, height=3, dpi=300, show=None)[source]
Plot sample weights assigned to training locations.
Visualizes the geographic distribution of sample weights used during training. This is useful for understanding which regions are upweighted or downweighted based on sampling density.
Sample weights are typically computed using:
Kernel density (KD) method: Upweights samples in sparse regions
Histogram binning method: Based on 2D histogram counts
The plot uses a log-scale color mapping to better show weight variations.
- Parameters:
locator (Locator) – Locator instance that has been trained with sample weighting enabled. Must have computed sample_weights attribute.
out_prefix (str, optional) – Prefix for output files. If provided, saves as {out_prefix}_sample_weights.png. Default: None
plot_map (bool) – Whether to plot on a geographic map using cartopy projection. If False, uses regular scatter plot with equal aspect ratio. Default: True
width (float) – Figure width in inches. Default: 5
height (float) – Figure height in inches. Default: 3
dpi (int) – Figure resolution in dots per inch. Default: 300
show (bool or None) – Whether to display plot. None=auto-detect environment, True=always show, False=never show. Default: None
- Returns:
None
- Return type:
Saves plot to file and optionally displays it
- Raises:
ValueError – If locator doesn’t have computed sample weights, or if: required data is missing
Examples
After training with KDE weighting:
config = { "weight_samples": { "enabled": True, "method": "KD" } } locator = Locator(config) locator.train(genotypes, samples) plot_sample_weights(locator, "kde_weights")
With histogram binning weights:
config = { "weight_samples": { "enabled": True, "method": "hist", "xbins": 20, "ybins": 20 } } locator = Locator(config) locator.train(genotypes, samples) plot_sample_weights(locator, "hist_weights", plot_map=False)
Note
Requires that locator was trained with weight_samples enabled
Log scale coloring helps visualize large weight variations
Higher weights (yellow) indicate undersampled regions
Lower weights (purple) indicate oversampled regions
Map projection requires cartopy to be installed
- kde_predict(x_coords, y_coords, xlim=(0, 50), ylim=(0, 50), n_points=100)[source]
Calculate kernel density estimate of predictions.
This is a helper function used internally by plot_predictions() to compute kernel density estimates for visualizing prediction uncertainty.
- Parameters:
x_coords (array-like) – Array of x coordinates (longitude values)
y_coords (array-like) – Array of y coordinates (latitude values)
xlim (tuple) – Tuple of (min, max) x values for grid. Default: (0, 50)
ylim (tuple) – Tuple of (min, max) y values for grid. Default: (0, 50)
n_points (int) – Number of points for density estimation grid. Default: 100
- Returns:
tuple –
x_grid (numpy.ndarray): X coordinates of the mesh grid
y_grid (numpy.ndarray): Y coordinates of the mesh grid
density (numpy.ndarray): Density values at each grid point
Returns (None, None, None) if KDE calculation fails.
- Return type:
A 3-tuple containing:
Note
The function uses scipy.stats.gaussian_kde for density estimation. Grid limits should match the geographic extent of your predictions.
PlottingMixin Class
- class PlottingMixin[source]
Bases:
objectMixin class providing plotting functionality for Locator.
This mixin is inherited by the main Locator class to provide visualization methods for training history and Jupyter notebook integration.
- _repr_html_: Generate rich HTML representation for Jupyter notebooks
Configuration Options
This section provides an overview of the available configuration options.
Default Configuration
The default configuration for Locator includes:
{
# Data parameters
"train_split": 0.9,
"batch_size": 32,
"min_mac": 2,
"max_SNPs": None,
"impute_missing": False,
# Network architecture
"width": 256,
"nlayers": 8,
"dropout_prop": 0.25,
# Training parameters
"max_epochs": 5000,
"patience": 100,
"learning_rate": 0.001,
"min_epochs": 10,
"min_delta": 1e-4,
"restore_best_weights": True,
# Optimizer parameters
"optimizer_algo": "adam",
"weight_decay": 0.004,
# Output control
"keras_verbose": 1,
"prediction_frequency": 1,
# Validation
"validation_split": 0.1,
# Data augmentation
"augmentation": {
"enabled": False,
"flip_rate": 0.05,
},
# Sample weighting
"weight_samples": {
"enabled": False,
"method": "KD",
"xbins": 10,
"ybins": 10,
"lam": 1.0,
"bandwidth": None,
"weightdf": None,
},
# Range penalty
"use_range_penalty": False,
"species_range_shapefile": None,
"resolution": 0.05,
"penalty_weight": 1.0,
"out": "locator",
# NA handling
"na_action": "separate",
# GPU optimization (enabled by default)
"use_mixed_precision": True,
"gpu_batch_size": "auto",
"gradient_accumulation_steps": 1,
"gpu_memory_mode": "growth",
"enable_xla": False,
# Performance optimization
"optimize_tf_parallelism": True,
"holdout_no_intermediate_saves": True,
"save_fold_models": True,
# Verbosity control
"verbose_splits": False,
"verbose_batch_size": False,
}
Input Formats
Genotype Data
Supported input formats for genotype data:
VCF files (
.vcfor.vcf.gz)Zarr format (recommended for large datasets)
Pandas DataFrame with: - Samples as index - SNP positions as columns - Genotype counts (0,1,2) as values
Sample Data
Required format for sample coordinate data:
Tab-delimited file or DataFrame with columns: -
sampleID: Sample identifier -x: Longitude -y: Latitude
Output Formats
Prediction Results
Default output files:
{out}_predlocs.txt: Main predictions{out}_history.txt: Training history{out}_fitplot.pdf: Training plots{out}.weights.h5: Model weights
For special analyses:
{out}_bootstrap_predlocs.csv: Bootstrap results{out}_jacknife_predlocs.csv: Jacknife results{out}_windows_predlocs.csv: Windowed analysis results{out}_holdout_predlocs.csv: Holdout analysis results