API Reference

Core Module

setup_gpu(gpu_number=None)[source]

Configure GPU settings for optimal usage.

Parameters:: gpu_number (int or str, optional) – GPU index to use (0-based). If None, the first available GPU is used.
Returns:: bool
Return type:: True if a GPU is available and successfully configured, otherwise False.

Locator

class Locator(config=None)[source]

Bases: DataLoaderMixin, TrainingMixin, PredictionMixin, AnalysisMixin, EnsembleMixin, PlottingMixin

A class for predicting geographic locations from genetic data.

This class implements a neural network approach to predict sample locations from genetic data. It can handle various input formats including:

Genotype data:
- VCF or VCF.gz files
- Zarr format
- Pandas DataFrame with samples as index, SNP positions as columns
Sample location data:
- Tab-delimited file
- Pandas DataFrame

The model can be configured through a dictionary of parameters passed during initialization. Sample location data can be provided either as a file path or as a pandas DataFrame.

Variables:

(dict) (config)
(keras.Model) (model)
(keras.callbacks.History) (history)
(numpy.ndarray) (samples)
(float) (sdlat)
(float)
(float)
(float)

Example

>>> # Using a file path for sample data
>>> locator = Locator({
...     "out": "analysis_1",
...     "sample_data": "samples.txt",
...     "zarr": "genotypes.zarr"
... })

>>> # Using a DataFrame for sample data
>>> locator = Locator({
...     "out": "analysis_1",
...     "sample_data": sample_df,  # pandas DataFrame
...     "zarr": "genotypes.zarr"
... })

>>> # Using DataFrames for both inputs
>>> # Coordinate DataFrame must have columns: sampleID, x, y
>>> coords_df = pd.DataFrame({
...     "sampleID": ["sample1", "sample2"],
...     "x": [longitude1, longitude2],
...     "y": [latitude1, latitude2]
... })
>>>
>>> # Genotype DataFrame has samples as index, SNP positions as columns
>>> geno_df = pd.DataFrame({
...     1001: [0, 1],    # SNP position 1001
...     2001: [1, 2],    # SNP position 2001
... }, index=["sample1", "sample2"])
>>>
>>> locator = Locator({
...     "out": "analysis_1",
...     "sample_data": coords_df,
...     "genotype_data": geno_df
... })

__init__(config=None)[source]

Initialize Locator with configuration parameters.

Parameters:: config (dict, optional) – Configuration dictionary that can include the following keys:

Top-level keys:

sample_data (str or pandas.DataFrame): Path to sample data file or a DataFrame with columns ‘sampleID’, ‘x’, ‘y’.
genotype_data (pandas.DataFrame): DataFrame with samples as index, SNP positions as columns, and genotype counts (0, 1, 2) as values.
zarr (str): Path to Zarr format genotype data.
vcf (str): Path to VCF format genotype data.
out (str): Output root name for all output files.
train_split (float): Proportion of data to use for training.
batch_size (int): Batch size for training.
max_epochs (int): Maximum number of training epochs.
patience (int): Patience for early stopping.
min_mac (int): Minimum minor allele count for SNP filtering.
max_SNPs (int): Maximum number of SNPs to use.
width (int): Width of neural network layers.
nlayers (int): Number of neural network layers.
dropout_prop (float): Dropout proportion.
pca_components (int or “auto”): If set, prepend a PCA-initialized linear projection of this width as the first layer and fine-tune it. Use "auto" to pick the width from the genotype-PCA scree elbow. Recommended when n_SNPs >> n_samples. Default None (disabled).
pca_finetune (bool): Whether to unfreeze the PCA projection for a low-learning-rate fine-tuning phase. Default True. False keeps the projection frozen at its PCA initialization.
pca_finetune_lr (float): Learning rate for the PCA fine-tuning phase. Default 1e-4.
keras_verbose (int): Verbosity level for Keras training.
impute_missing (bool): Whether to impute missing genotypes.
validation_split (float): Proportion of data to use for validation.
learning_rate (float): Learning rate for the optimizer.
min_epochs (int): Minimum number of epochs to train.
patience (int): Number of epochs with no improvement to wait before stopping.
min_delta (float): Minimum change in validation loss to qualify as an improvement.
restore_best_weights (bool): Whether to restore model weights from the epoch with the best validation loss.
prediction_frequency (int): Frequency (in epochs) of making predictions during training.
optimizer_algo (str): Optimizer algorithm to use (“adam” or “adamw”).
weight_decay (float): Weight decay coefficient for AdamW optimizer.
augmentation (dict): Dictionary of augmentation parameters:
- enabled (bool): Whether data augmentation is enabled.
- flip_rate (float): Rate at which to randomly flip genotypes during augmentation.
weight_samples (dict): Dictionary of sample weighting parameters:
- enabled (bool): Whether to weight samples by distance.
- method (str): Method for weighting samples (“KD”, “histogram”, “df”).
- xbins (int): Number of bins for histogram.
- ybins (int): Number of bins for histogram.
- lam (float): Exponent for weights.
- bandwidth (float): Bandwidth for KDE.
- weightdf (pandas.DataFrame): DataFrame containing sample weights.
use_range_penalty (bool): Whether to apply a range penalty in the loss function.
penalty_weight (float): Weight assigned to the range penalty term.
species_range_geom (shapely.geometry): Shapely geometry object defining the valid species range.
na_action (str): How to handle samples without coordinates. Options:
- ‘separate’ (default): Include all samples, train on known, predict unknown.
- ‘exclude’: Only use samples with known coordinates.
- ‘fail’: Raise error if any samples lack coordinates.

property sample_data: DataFrame

Returns the sample data as a pandas DataFrame.

Returns:: pd.DataFrame
Return type:: The sample data DataFrame with columns [‘sampleID’, ‘x’, ‘y’, …].
Raises:: ValueError – If sample data is not available.:

Example

>>> locator = Locator({"sample_data": coords_df})
>>> df = locator.sample_data

get_sample_status(samples, sample_data=None)[source]

Analyze sample coordinate status.

This method identifies which samples have known geographic coordinates and which have missing (NA) coordinates. This is useful for understanding your data and for methods that need to handle samples with and without coordinates differently.

Parameters:

samples (numpy.ndarray) – Array of sample IDs from genotype data
sample_data (pandas.DataFrame, optional) – DataFrame with columns ‘sampleID’, ‘x’, ‘y’. If not provided, uses the stored sample data or loads from config.

Returns:

dict –

‘known_indices’ (numpy.ndarray): Array indices of samples with coordinates
’na_indices’ (numpy.ndarray): Array indices of samples without coordinates
’known_samples’ (numpy.ndarray): Sample IDs with coordinates
’na_samples’ (numpy.ndarray): Sample IDs without coordinates
’n_known’ (int): Count of samples with known coordinates
’n_na’ (int): Count of samples with NA coordinates
’total’ (int): Total number of samples

Return type:

A dictionary containing:

Example

>>> locator = Locator(config)
>>> status = locator.get_sample_status(samples)
>>> print(f"Found {status['n_known']} samples with coordinates")
>>> print(f"Found {status['n_na']} samples without coordinates")

check_data(genotypes, samples, verbose=True)[source]

Check data quality and report statistics.

This is a convenience method to help users understand their data before running analyses. It reports the number of samples, SNPs, and identifies samples with missing coordinates.

Parameters:

genotypes (numpy.ndarray or allel.GenotypeArray) – Genotype data
samples (numpy.ndarray) – Array of sample IDs
verbose (bool) – If True, print detailed statistics. Default: True

Returns:

dict

Return type:

Sample status dictionary from get_sample_status()

Example:

>>> locator = Locator(config)
>>> genotypes, samples = locator.load_genotypes()
>>> status = locator.check_data(genotypes, samples)
Data Summary
==================================================
Total samples: 231
Samples with coordinates: 211
Samples without coordinates: 20
Total SNPs: 1000

Current NA handling mode: separate
- Will train on samples with known locations
- Can predict on samples without locations

Samples without coordinates (first 10):
  - sample_001
  - sample_002
  ...

create_ensemble_early_stopping(patience_multiplier=1.5)

Create early stopping callback with ensemble-specific settings.

Parameters:: patience_multiplier – Multiply base patience for ensemble training (ensembles often benefit from longer training)
Returns:: keras.callbacks.EarlyStopping
Return type:: Configured callback

create_ensemble_folds(genotypes, samples, k=5, training_set_indices=None, na_action=None)

Create k-fold splits for ensemble training using IndexSet.

Parameters:

genotypes – GenotypeArray containing genetic data
samples – Array of sample IDs
k – Number of folds (default: 5)
training_set_indices – Optional array of indices to use for training+validation. If provided, only these samples will be used to create k-folds.
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action

Returns:

dict –

‘index_sets’: List of IndexSet objects for each fold
’fold_indices’: Legacy format dict for backward compatibility
’sample_status’: Sample status information

Return type:

Dictionary with fold information:

create_ensemble_lr_scheduler(fold_idx)

Create learning rate scheduler for ensemble training.

Each fold can start with a slightly different learning rate to improve ensemble diversity.

Parameters:: fold_idx – Current fold index
Returns:: keras.callbacks.ReduceLROnPlateau
Return type:: Configured callback

get_ensemble_batch_size(dataset_size, fold_idx=0)

Determine optimal batch size for ensemble training.

Uses GPUOptimizer to find the best batch size, with caching to avoid recomputing for each fold.

Parameters:

dataset_size – Size of training dataset
fold_idx – Current fold index (for logging)

Returns:

int

Return type:

Optimal batch size

load_ensemble(ensemble_path)

Load a saved ensemble for prediction.

Parameters:: ensemble_path – Path to the saved ensemble directory
Returns:: dict
Return type:: Ensemble information including models and parameters

load_genotypes(vcf=None, zarr=None, matrix=None, microsat=None, microsat_min_allele_freq=0.01, gl=None, bam_list=None, gl_mode='dosage', gl_missing_threshold=0.4, gl_min_maf=0.01, gl_max_missing_frac=0.1)

Load genotype data from various input sources.

This method can load genotype data from: 1. A stored DataFrame provided during initialization 2. A VCF file 3. A zarr file (scikit-allel or bio2zarr format) 4. A tab-delimited matrix file 5. A tab-delimited microsatellite genotype table 6. ANGSD beagle GL file paired with a BAM list

For windowed analysis, SNP positions must be available either from: - Column names in the genotype DataFrame - The zarr file’s variants/POS array - The VCF file’s POS field (automatically loaded)

Parameters:

vcf (str, optional) – Path to VCF format genotype data
zarr (str, optional) – Path to zarr format genotype data
matrix (str, optional) – Path to tab-delimited matrix file
microsat (str, optional) – Path to tab-delimited microsatellite genotype table
microsat_min_allele_freq (float, optional) – Drop microsat alleles below this per-locus frequency. Default 0.01.
gl (str, optional) – Path to ANGSD -doGlf 2 beagle.gz file
bam_list (str, optional) – Path to BAM file list used in ANGSD run (one path per line). Required when gl is provided. Sample IDs are derived from Path(bam).stem.
gl_mode (str) – GL encoding mode, one of "dosage" (default) or "full_gl". "dosage" returns one expected-dosage value per site per sample; "full_gl" returns all three AA/AB/BB GL probabilities as separate rows.
gl_missing_threshold (float) – GL site filter; a sample at a site is missing if max(GL_AA, GL_AB, GL_BB) < gl_missing_threshold. Ignored unless gl is set. Default 0.4.
gl_min_maf (float) – GL site filter; drop sites whose mean-dosage MAF falls below this. Ignored unless gl is set. Default 0.01.
gl_max_missing_frac (float) – GL site filter; drop sites whose fraction of missing samples exceeds this. Ignored unless gl is set. Default 0.10.

Returns:

tuple –

genotypes is an allel.GenotypeArray of shape (n_sites, n_samples, 2) for VCF/zarr/integer-matrix inputs, or a float32 ndarray of shape (n_sites, n_samples) for continuous-dosage inputs (matrix float, microsat, or GL)
samples is a numpy array of sample IDs

Return type:

(genotypes, samples) where:

Examples

>>> # Using stored DataFrame from initialization
>>> locator = Locator({
...     "genotype_data": geno_df,  # DataFrame with genotypes
...     "sample_data": coords_df   # DataFrame with coordinates
... })
>>> genotypes, samples = locator.load_genotypes()

>>> # Using zarr file (recommended for windowed analysis)
>>> locator = Locator({"sample_data": coords_df})
>>> genotypes, samples = locator.load_genotypes(zarr="path/to/geno.zarr")

>>> # Using VCF file
>>> genotypes, samples = locator.load_genotypes(vcf="path/to/geno.vcf")

>>> # Using matrix file
>>> genotypes, samples = locator.load_genotypes(matrix="path/to/geno.txt")

>>> # Using microsatellite genotypes
>>> genotypes, samples = locator.load_genotypes(microsat="path/to/microsats.tsv")

>>> # Using ANGSD genotype-likelihood file
>>> genotypes, samples = locator.load_genotypes(
...     gl="output.beagle.gz", bam_list="bams.txt", gl_mode="dosage"
... )

Raises:: ValueError – If no input source is provided or if input format is invalid:

load_model(weights_path)

Load a trained model from saved weights.

This method loads a model from HDF5 weights file and restores the preprocessing parameters needed for making predictions.

Parameters:: weights_path (str) – Path to the saved HDF5 weights file
Returns:: dict
Return type:: Dictionary containing loaded metadata including normalization params
Raises:: ValueError – If weights file cannot be loaded or is missing metadata:

predict(boot=0, verbose=True, prediction_genotypes=None, genotypes=None, samples=None, indices=None, return_df=False, save_preds_to_disk=True, site_order=None)

Make predictions for samples with unknown locations.

Parameters:

boot (int, optional) – Bootstrap replicate number. Defaults to 0.
verbose (bool, optional) – Whether to print validation metrics. Defaults to True.
prediction_genotypes (numpy.ndarray, optional) – DEPRECATED - use genotypes parameter. Override default prediction genotypes. Used for jacknife resampling. Defaults to None.
genotypes (numpy.ndarray, optional) – Full genotype array for creating tf.data dataset. Should be the original unfiltered genotypes. Defaults to None.
samples (numpy.ndarray, optional) – Sample IDs corresponding to genotypes. Defaults to None.
indices (numpy.ndarray, optional) – Indices of samples to predict on. If None, predicts on samples without coordinates (self.pred_indices). Defaults to None.
return_df (bool, optional) – Whether to return predictions as pandas DataFrame. Defaults to False.
save_preds_to_disk (bool, optional) – Whether to save predictions to disk. Defaults to True.
site_order (np.ndarray, optional) – Array of SNP indices for bootstrap resampling. If provided, SNPs will be reordered according to these indices during prediction. Used for bootstrap analyses to ensure consistent resampling between train and predict.

Returns:

numpy.ndarray or pandas.DataFrame – x,y coordinates and sampleID columns

Return type:

Array of predicted coordinates or DataFrame with

predict_ensemble(genotypes=None, samples=None, indices=None, include_fold_predictions=False, return_std=False, return_df=True, save_predictions=True)

Make predictions using the ensemble of models.

Parameters:

genotypes – GenotypeArray for prediction (if None, uses stored data)
samples – Sample IDs (if None, uses stored samples)
indices – Specific indices to predict on (if None, predicts all)
include_fold_predictions – Include individual fold predictions in output
return_std – Return standard deviation across ensemble predictions
return_df – Return results as DataFrame (default: True)
save_predictions – Save predictions to disk (default: True)

Returns:

pd.DataFrame or np.ndarray

Return type:

Ensemble predictions with optional std

predict_ensemble_from_manager(genotypes, samples, indices=None, return_df=True, save_predictions=True)

Make predictions using loaded ensemble with model manager.

This method efficiently loads models on-demand for prediction, reducing memory usage for large ensembles.

Parameters:

genotypes – GenotypeArray for prediction
samples – Sample IDs
indices – Specific indices to predict on (if None, predicts all)
return_df – Return results as DataFrame (default: True)
save_predictions – Save predictions to disk (default: True)

Returns:

pd.DataFrame or np.ndarray

Return type:

Ensemble predictions

predict_from_weights(weights_path, genotypes, samples, sample_data_file=None, save_preds_to_disk=True, return_df=True)

Convenience method to load weights and make predictions.

This method combines loading a saved model and making predictions in a single call. It handles preprocessing the genotypes using the same parameters that were used during training.

Parameters:

weights_path (str) – Path to saved HDF5 weights file
genotypes (numpy.ndarray) – Genotype data to predict on
samples (numpy.ndarray) – Sample IDs corresponding to genotypes
sample_data_file (str, optional) – Path to sample data file
save_preds_to_disk (bool) – Whether to save predictions to disk
return_df (bool) – Whether to return predictions as DataFrame

Returns:

numpy.ndarray or pandas.DataFrame

Return type:

Predictions

predict_holdout(verbose=True, return_df=False, save_preds_to_disk=True, plot_summary=True, plot_map=True)

Predict locations for held out samples.

Parameters:

verbose – Print progress and metrics
return_df – Return predictions as pandas DataFrame
save_preds_to_disk – Save predictions to disk
plot_summary – Display error summary plot in notebook (only if return_df=True)
plot_map – Display map of predictions (only if plot_summary=True)

Returns:

If return_df is True, returns pandas DataFrame with predictions
Otherwise returns None

run_bootstraps(genotypes, samples, n_bootstraps=50, return_df=False, save_full_pred_matrix=True, na_action=None)

Run bootstrap analysis by resampling SNPs with replacement.

Parameters:

genotypes – Array of genotype data
samples – Sample IDs corresponding to genotypes
n_bootstraps – Number of bootstrap replicates to run
return_df – Whether to return DataFrame with all predictions
save_full_pred_matrix – Whether to save full prediction matrix to disk
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action

Returns:

pandas.DataFrame or None – for each bootstrap, otherwise None

Return type:

If return_df=True, returns DataFrame with predictions

Notes

With na_action=’separate’: Trains on samples with known locations, can predict on samples with NA locations
With na_action=’exclude’: Only uses samples with known locations
With na_action=’fail’: Raises error if any NA samples found

run_holdouts(genotypes, samples, k=10, n_reps=10, holdout_indices=None, holdout_sample_ids=None, return_df=False, save_full_pred_matrix=True, na_action=None)

Run multiple holdout replicates for cross-validation.

Parameters:

genotypes – Array of genotype data
samples – Sample IDs corresponding to genotypes
k – Number of samples to hold out in each replicate
n_reps – Number of holdout replicates to run
holdout_indices – Optional list of lists, each containing indices to hold out
holdout_sample_ids – Optional list of sample IDs to hold out. If provided, these specific samples will be held out (overrides k and holdout_indices). Can be a single list (used for all replicates) or list of lists (different samples per replicate).
return_df – Whether to return DataFrame with all predictions
save_full_pred_matrix – Whether to save full prediction matrix to disk
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action

Returns:

If return_df=True, returns DataFrame with predictions

for each holdout replicate containing columns: - sampleID: Sample identifier - x_pred: Predicted longitude - y_pred: Predicted latitude - rep: Replicate number (0 to n_reps-1)

Note: True locations are not included. Merge with sample metadata to calculate errors.

Return type:

pandas.DataFrame or None

Notes

With na_action=’separate’: Currently behaves like ‘exclude’ (holdouts must have known locations). Future versions may support predicting NA samples.
With na_action=’exclude’: Only uses samples with known locations (current behavior)
With na_action=’fail’: Raises error if any NA samples found

run_jacknife(genotypes, samples, prop=0.05, return_df=False, save_full_pred_matrix=True, na_action=None)

Run jacknife analysis by dropping SNPs.

Parameters:

genotypes – Array of genotype data
samples – Sample IDs corresponding to genotypes
prop (float, optional) – Proportion of SNPs to drop in each replicate. Defaults to 0.05.
return_df (bool, optional) – Whether to return DataFrame of all predictions. Defaults to False.
save_full_pred_matrix (bool, optional) – Whether to save the full prediction matrix. Defaults to True.
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action

Returns:

pandas.DataFrame or None – all predictions, with columns named ‘x_0’, ‘y_0’, ‘x_1’, ‘y_1’, etc. for each jacknife replicate. Row index contains sample IDs.

Return type:

If return_df=True, returns DataFrame containing

Notes

With na_action=’separate’: Trains on samples with known locations, can predict on samples with NA locations
With na_action=’exclude’: Only uses samples with known locations
With na_action=’fail’: Raises error if any NA samples found

run_jacknife_holdouts(genotypes, samples, k=10, prop=0.05, n_boots=50, holdout_indices=None, return_df=False, save_full_pred_matrix=True, na_action=None)

Run jacknife analysis on holdout samples.

Parameters:

genotypes – Array of genotype data
samples – Sample IDs corresponding to genotypes
k – Number of samples to hold out
prop – Proportion of SNPs to drop in each jacknife replicate
n_boots – Number of jacknife replicates
holdout_indices – Optional specific indices to hold out
return_df – Whether to return DataFrame with all predictions
save_full_pred_matrix – Whether to save full prediction matrix to disk
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action

Returns:

pandas.DataFrame or None – for each jacknife replicate containing columns: - sampleID: Sample identifier - x_pred: Predicted longitude - y_pred: Predicted latitude - boot: Jacknife replicate number (0 to n_boots-1)

Note: True locations are not included. Merge with sample metadata to calculate errors.

Return type:

If return_df=True, returns DataFrame with predictions

Notes

With na_action=’separate’: Currently behaves like ‘exclude’ (holdouts must have known locations). Future versions may support predicting NA samples.
With na_action=’exclude’: Only uses samples with known locations (current behavior)
With na_action=’fail’: Raises error if any NA samples found

run_k_fold_holdouts(genotypes, samples, k=10, return_df=False, save_full_pred_matrix=True, verbose=True, na_action=None)

Run true k-fold cross-validation with nonoverlapping holdout sets.

Parameters:

genotypes – Array of genotype data
samples – Sample IDs corresponding to genotypes
k – Number of folds (holdout sets)
return_df – Whether to return DataFrame with all predictions
save_full_pred_matrix – Whether to save full prediction matrix to disk
verbose – Whether to show training progress and intermediate output
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action

Returns:

If return_df=True, returns DataFrame with one prediction

per held-out sample containing columns: - sampleID: Sample identifier - x_pred: Predicted longitude - y_pred: Predicted latitude

Note: True locations are not included. To calculate prediction errors, merge the returned DataFrame with your sample metadata using the sampleID column.

Return type:

pandas.DataFrame or None

Notes

With na_action=’separate’: Currently behaves like ‘exclude’ (k-fold requires known locations). Future versions may support predicting NA samples.
With na_action=’exclude’: Only uses samples with known locations (current behavior)
With na_action=’fail’: Raises error if any NA samples found

Example

>>> # Run k-fold cross-validation
>>> predictions = locator.run_k_fold_holdouts(genotypes, samples, k=10, return_df=True)
>>>
>>> # Merge with true locations to calculate errors
>>> sample_data = pd.read_csv('samples.tsv', sep='\t')
>>> merged = predictions.merge(sample_data[['sampleID', 'x', 'y']], on='sampleID')
>>> merged['error_km'] = np.sqrt(
...     (merged['x'] - merged['x_pred'])**2 +
...     (merged['y'] - merged['y_pred'])**2
... ) * 111.32  # Convert degrees to km

run_leave_one_out(genotypes, samples, return_df=True, save_full_pred_matrix=True, na_action=None)

Perform leave-one-out cross-validation: for each sample with a known location, train without it and predict its location.

This is a convenience wrapper around run_k_fold_holdouts with k equal to the number of samples with known locations.

Parameters:

genotypes – Array of genotype data
samples – Sample IDs corresponding to genotypes
return_df – Whether to return DataFrame with all predictions
save_full_pred_matrix – Whether to save full prediction matrix to disk
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action

Returns:

pandas.DataFrame or None

Return type:

DataFrame with predictions for each left-out sample

run_windows(genotypes, samples, window_start=0, window_size=500000.0, window_stop=None, respect_chromosomes=True, return_df=False, save_full_pred_matrix=True, na_action=None)

Run windowed prediction analysis.

Parameters:

genotypes – GenotypeArray containing genetic data
samples – Array of sample IDs
window_start – Start position for windows (default: 0)
window_size – Size of windows in base pairs (default: 500kb)
window_stop – Stop position for windows (default: None)
respect_chromosomes – Whether to respect chromosome boundaries when creating windows (default: True). If True, windows will not span chromosome boundaries. Requires chromosome information from VCF/Zarr input.
return_df – Whether to return DataFrame with all predictions
save_full_pred_matrix – Whether to save full prediction matrix to disk
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action

Returns:

pandas.DataFrame or None – for each window, otherwise None

Return type:

If return_df=True, returns DataFrame with predictions

Notes

With na_action=’separate’: Trains on samples with known locations, can predict on samples with NA locations
With na_action=’exclude’: Only uses samples with known locations
With na_action=’fail’: Raises error if any NA samples found

Warning

When respect_chromosomes=False, window analysis treats all SNP positions as continuous along a single coordinate axis. If your data contains multiple chromosomes, windows may span across chromosome boundaries. Use respect_chromosomes=True (default) for biologically meaningful windows.

run_windows_holdouts(genotypes, samples, k=10, window_start=0, window_size=500000.0, window_stop=None, respect_chromosomes=True, holdout_indices=None, holdout_sample_ids=None, return_df=False, save_full_pred_matrix=True, na_action=None)

Run windowed analysis on holdout samples.

Parameters:

genotypes – Array of genotype data
samples – Sample IDs corresponding to genotypes
k – Number of samples to hold out
window_start – Start position for windows
window_size – Size of windows in base pairs
window_stop – Stop position for windows
respect_chromosomes – Whether to respect chromosome boundaries when creating windows (default: True). If True, windows will not span chromosome boundaries. Requires chromosome information from VCF/Zarr input.
holdout_indices – Optional specific indices to hold out
holdout_sample_ids – Optional list of sample IDs to hold out. If provided, these specific samples will be held out (overrides k and holdout_indices).
return_df – Whether to return DataFrame with all predictions
save_full_pred_matrix – Whether to save full prediction matrix to disk
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action

Returns:

pandas.DataFrame or None – for each window, otherwise None

Return type:

If return_df=True, returns DataFrame with predictions

Notes

With na_action=’separate’: Currently behaves like ‘exclude’ (holdouts must have known locations). Future versions may support predicting NA samples.
With na_action=’exclude’: Only uses samples with known locations (current behavior)
With na_action=’fail’: Raises error if any NA samples found

Warning

When respect_chromosomes=False, window analysis treats all SNP positions as continuous along a single coordinate axis. If your data contains multiple chromosomes, windows may span across chromosome boundaries. Use respect_chromosomes=True (default) for biologically meaningful windows.

set_sample_weights(wdict): Set sample weights for training. :param wdict: Dictionary returned by utils.weight_samples() containing sample weights. :type wdict: dict

setup_ensemble_gpu_optimization(use_mixed_precision=None)

Setup GPU optimizations for ensemble training.

Parameters:: use_mixed_precision – Whether to use mixed precision training. If None, uses config value or auto-detects based on GPU.
Returns:: bool
Return type:: Whether mixed precision was enabled

sort_samples(samples=None, sample_data_file=None, reorder=True)

Sort samples and match with location data.

Matches samples with their location data and ensures consistent ordering between genotype and location data.

Parameters:

samples (numpy.ndarray) – Array of sample IDs from the genotype data
sample_data_file (str, optional) – Override path to tab-delimited file with columns ‘sampleID’, ‘x’, ‘y’. If not provided, uses stored sample data.
reorder (bool) – If True, automatically reorder metadata to match genotype order. If False, raise error on order mismatch (default: True)

Returns:

tuple

Return type:

(sample_data DataFrame, locs array of shape (n_samples, 2))

train(*, genotypes, samples, sample_data_file=None, boot=None, train_gen=None, test_gen=None, pred_gen=None, train_locs=None, test_locs=None, setup_only=False, na_action=None, site_order=None)

Train the Locator model on genotype and location data.

This method trains the neural network model to predict geographic locations from genetic data. It supports both standard training and advanced workflows such as bootstrapping, by accepting pre-processed genotype and location arrays. The model is configured using the parameters provided at initialization.

Parameters:

genotypes (allel.GenotypeArray or np.ndarray) – Genotype data for all samples. Should be of shape (n_sites, n_samples, ploidy).
samples (np.ndarray) – Array of sample IDs corresponding to the genotype data.
sample_data_file (str, optional) – Path to a tab-delimited file with columns ‘sampleID’, ‘x’, ‘y’ for sample locations. Used if not provided in config or as a DataFrame.
boot (int, optional) – Bootstrap replicate number. Used for bootstrapping analyses. Defaults to None.
train_gen (np.ndarray, optional) – Pre-processed training genotype data. Used for bootstrapping. If None, will be generated from genotypes. Defaults to None.
test_gen (np.ndarray, optional) – Pre-processed test genotype data. Used for bootstrapping. If None, will be generated from genotypes. Defaults to None.
pred_gen (np.ndarray, optional) – Pre-processed prediction genotype data. Used for bootstrapping. If None, will be generated from genotypes. Defaults to None.
train_locs (np.ndarray, optional) – Pre-processed training locations. Used for bootstrapping. If None, will be generated from sample data. Defaults to None.
test_locs (np.ndarray, optional) – Pre-processed test locations. Used for bootstrapping. If None, will be generated from sample data. Defaults to None.
setup_only (bool, optional) – If True, only sets up the model and data without training. Defaults to False.
na_action (str, optional) – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action. Defaults to None.
site_order (np.ndarray, optional) – Array of SNP indices for bootstrap resampling. If provided, SNPs will be reordered according to these indices during training. Used for bootstrap analyses to resample SNPs with replacement.

Returns:

keras.callbacks.History or None

Return type:

The Keras training history object if training is performed, or None if setup_only is True.

Raises:

ValueError – If required sample data is missing or improperly formatted.:

Example

>>> # Standard training
>>> loc = Locator({"out": "analysis", "sample_data": "samples.txt", "zarr": "genotypes.zarr"})
>>> genotypes, samples = loc.load_genotypes(zarr="genotypes.zarr")
>>> history = loc.train(genotypes=genotypes, samples=samples)

>>> # Bootstrapping with pre-processed data
>>> history = loc.train(
...     genotypes=None,
...     samples=samples,
...     boot=1,
...     train_gen=boot_train_gen,
...     test_gen=boot_test_gen,
...     pred_gen=boot_pred_gen,
...     train_locs=boot_train_locs,
...     test_locs=boot_test_locs
... )

train_ensemble(genotypes, samples, k=5, training_set_indices=None, na_action=None, augment_data=False, flip_rate=0.05, save_fold_models=True, verbose=True, use_model_manager=True, use_mixed_precision=None, patience_multiplier=1.0)

Train an ensemble of k models using k-fold cross-validation.

This method trains k models, each on a different k-fold split of the data. It uses the modern tf.data pipeline for memory efficiency and supports all standard Locator features including NA handling and data augmentation.

Parameters:

genotypes – GenotypeArray containing genetic data
samples – Array of sample IDs
k – Number of folds/models in ensemble (default: 5)
training_set_indices – Optional array of indices to restrict training
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’)
augment_data – Whether to apply data augmentation (default: False)
flip_rate – Rate for genotype flipping augmentation (default: 0.05)
save_fold_models – Whether to save individual fold models (default: True)
verbose – Whether to show training progress (default: True)
use_model_manager – Whether to use model manager for saving (default: True)
use_mixed_precision – Whether to use mixed precision training (default: None, auto-detect)
patience_multiplier – Multiply patience for ensemble training (default: 1.0)

Returns:

dict –

‘histories’: List of training histories for each fold
’models’: List of trained model configurations
’normalization_params’: Averaged normalization parameters
’fold_info’: Information about fold splits

Return type:

Dictionary containing:

train_holdout(genotypes=None, samples=None, k=10, holdout_indices=None, filtered_genotypes=None)

Train the model while holding out samples with known locations.

Parameters:

genotypes – Array of genotype data. Required unless filtered_genotypes is provided.
samples – Sample IDs corresponding to genotypes
k – Number of samples to hold out (ignored if holdout_indices provided)
holdout_indices – Optional specific indices of samples to hold out
filtered_genotypes – Pre-filtered allele count array. If provided, skips internal filter_snps call and avoids loading the full genotype array. Used by parallel dispatch to share one filtered copy across all workers.

Return type:

keras.callbacks.History object from model training

train_window(genotypes, samples, window_snp_indices, index_set, normalized_locs)

Train the model for a specific genomic window using efficient tf.data pipeline.

This is an internal method used by run_windows_holdouts to train models on specific genomic windows without creating intermediate arrays.

Parameters:

genotypes – Full genotype array (not filtered)
samples – Sample IDs
window_snp_indices – Indices of SNPs in this window
index_set – Pre-computed IndexSet with train/test/holdout splits
normalized_locs – Pre-normalized location coordinates

Return type:

keras.callbacks.History object from model training

Ensemble Functionality

The ensemble functionality is integrated into the main Locator class through the EnsembleMixin.

EnsembleMixin

class EnsembleMixin[source]

Bases: object

Mixin class providing ensemble functionality for Locator.

create_ensemble_folds(genotypes, samples, k=5, training_set_indices=None, na_action=None)[source]

Create k-fold splits for ensemble training using IndexSet.

Parameters:

genotypes – GenotypeArray containing genetic data
samples – Array of sample IDs
k – Number of folds (default: 5)
training_set_indices – Optional array of indices to use for training+validation. If provided, only these samples will be used to create k-folds.
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action

Returns:

dict –

‘index_sets’: List of IndexSet objects for each fold
’fold_indices’: Legacy format dict for backward compatibility
’sample_status’: Sample status information

Return type:

Dictionary with fold information:

train_ensemble(genotypes, samples, k=5, training_set_indices=None, na_action=None, augment_data=False, flip_rate=0.05, save_fold_models=True, verbose=True, use_model_manager=True, use_mixed_precision=None, patience_multiplier=1.0)[source]

Train an ensemble of k models using k-fold cross-validation.

This method trains k models, each on a different k-fold split of the data. It uses the modern tf.data pipeline for memory efficiency and supports all standard Locator features including NA handling and data augmentation.

Parameters:

genotypes – GenotypeArray containing genetic data
samples – Array of sample IDs
k – Number of folds/models in ensemble (default: 5)
training_set_indices – Optional array of indices to restrict training
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’)
augment_data – Whether to apply data augmentation (default: False)
flip_rate – Rate for genotype flipping augmentation (default: 0.05)
save_fold_models – Whether to save individual fold models (default: True)
verbose – Whether to show training progress (default: True)
use_model_manager – Whether to use model manager for saving (default: True)
use_mixed_precision – Whether to use mixed precision training (default: None, auto-detect)
patience_multiplier – Multiply patience for ensemble training (default: 1.0)

Returns:

dict –

‘histories’: List of training histories for each fold
’models’: List of trained model configurations
’normalization_params’: Averaged normalization parameters
’fold_info’: Information about fold splits

Return type:

Dictionary containing:

predict_ensemble(genotypes=None, samples=None, indices=None, include_fold_predictions=False, return_std=False, return_df=True, save_predictions=True)[source]

Make predictions using the ensemble of models.

Parameters:

genotypes – GenotypeArray for prediction (if None, uses stored data)
samples – Sample IDs (if None, uses stored samples)
indices – Specific indices to predict on (if None, predicts all)
include_fold_predictions – Include individual fold predictions in output
return_std – Return standard deviation across ensemble predictions
return_df – Return results as DataFrame (default: True)
save_predictions – Save predictions to disk (default: True)

Returns:

pd.DataFrame or np.ndarray

Return type:

Ensemble predictions with optional std

load_ensemble(ensemble_path)[source]

Load a saved ensemble for prediction.

Parameters:: ensemble_path – Path to the saved ensemble directory
Returns:: dict
Return type:: Ensemble information including models and parameters

predict_ensemble_from_manager(genotypes, samples, indices=None, return_df=True, save_predictions=True)[source]

Make predictions using loaded ensemble with model manager.

This method efficiently loads models on-demand for prediction, reducing memory usage for large ensembles.

Parameters:

genotypes – GenotypeArray for prediction
samples – Sample IDs
indices – Specific indices to predict on (if None, predicts all)
return_df – Return results as DataFrame (default: True)
save_predictions – Save predictions to disk (default: True)

Returns:

pd.DataFrame or np.ndarray

Return type:

Ensemble predictions

setup_ensemble_gpu_optimization(use_mixed_precision=None)[source]

Setup GPU optimizations for ensemble training.

Parameters:: use_mixed_precision – Whether to use mixed precision training. If None, uses config value or auto-detects based on GPU.
Returns:: bool
Return type:: Whether mixed precision was enabled

get_ensemble_batch_size(dataset_size, fold_idx=0)[source]

Determine optimal batch size for ensemble training.

Uses GPUOptimizer to find the best batch size, with caching to avoid recomputing for each fold.

Parameters:

dataset_size – Size of training dataset
fold_idx – Current fold index (for logging)

Returns:

int

Return type:

Optimal batch size

create_ensemble_early_stopping(patience_multiplier=1.5)[source]

Create early stopping callback with ensemble-specific settings.

Parameters:: patience_multiplier – Multiply base patience for ensemble training (ensembles often benefit from longer training)
Returns:: keras.callbacks.EarlyStopping
Return type:: Configured callback

create_ensemble_lr_scheduler(fold_idx)[source]

Create learning rate scheduler for ensemble training.

Each fold can start with a slightly different learning rate to improve ensemble diversity.

Parameters:: fold_idx – Current fold index
Returns:: keras.callbacks.ReduceLROnPlateau
Return type:: Configured callback

EnsembleModelManager

class EnsembleModelManager(base_path: str)[source]

Bases: object

Manages multiple models for ensemble predictions.

This class handles: - Saving and loading ensemble models with metadata - Lazy loading of model weights - Efficient storage of normalization parameters - Model versioning and validation

__init__(base_path: str)[source]

Initialize model manager.

Parameters:: base_path – Base path for saving/loading models

save_ensemble(models_info: List[Dict], ensemble_metadata: Dict | None = None) → None[source]

Save ensemble models and metadata.

Parameters:

models_info – List of model info dictionaries from training
ensemble_metadata – Optional metadata about the ensemble

load_ensemble(model_builder_fn=None) → List[Dict][source]

Load ensemble models and metadata.

Parameters:: model_builder_fn – Function to build model architecture
Return type:: List of model info dictionaries

get_model(fold: int, n_features: int) → Model[source]

Get a specific model, loading if necessary.

Parameters:

fold – Fold index
n_features – Number of features for model construction

Return type:

Loaded model

get_normalization_params(fold: int) → NormalizationParams[source]

Get normalization parameters for a specific fold.

Parameters:: fold – Fold index
Return type:: NormalizationParams instance

get_averaged_normalization_params() → NormalizationParams[source]

Get averaged normalization parameters across all folds.

Return type:: Averaged NormalizationParams

save_predictions(predictions: DataFrame, prediction_type: str = 'ensemble') → None[source]

Save predictions to disk.

Parameters:

predictions – DataFrame with predictions
prediction_type – Type of predictions (e.g., “ensemble”, “fold_0”)

clear_cache() → None[source]: Clear loaded models from memory.

Parallel Ensemble Training

The parallel ensemble training function is available when Ray is installed:

from locator.parallel import parallel_train_ensemble

parallel_train_ensemble(locator, genotypes, samples, k=5, gpu_ids=[0, 1], gpu_fraction=1.0, training_set_indices=None, na_action=None, augment_data=False, flip_rate=0.05, save_fold_models=True, use_model_manager=True, use_mixed_precision=None, patience_multiplier=1.0, verbose=True)

Train ensemble models in parallel across multiple GPUs using Ray.

Parameters:

locator – Locator instance with configuration
genotypes – GenotypeArray containing genetic data
samples – Array of sample IDs
k – Number of folds/models in ensemble (default: 5)
gpu_ids – List of GPU IDs to use (default: [0, 1])
gpu_fraction – Fraction of GPU memory per worker (default: 1.0)
training_set_indices – Optional indices to restrict training
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’)
augment_data – Whether to apply data augmentation
flip_rate – Rate for genotype flipping augmentation
save_fold_models – Whether to save individual fold models
use_model_manager – Whether to use model manager for storage
use_mixed_precision – Whether to use mixed precision training
patience_multiplier – Multiply patience for ensemble training
verbose – Whether to show training progress

Returns:

dict containing histories, models, normalization_params, fold_info

Note

This function requires Ray to be installed. Install with pip install locator[ray].

Models Module

create_network(input_shape: int, width: int = 256, n_layers: int = 8, dropout_prop: float = 0.25, pca_components: int | None = None, optimizer_config: dict | None = None, loss_fn: callable | None = None) → Model[source]

Create a neural network model for geographic location prediction.

Parameters:

input_shape (int) – Number of input features (SNPs).
width (int, optional) – Width of the dense layers, defaults to 256.
n_layers (int, optional) – Total number of dense layers (excluding final layers), defaults to 8.
dropout_prop (float, optional) – Dropout proportion for middle dropout layer, defaults to 0.25.
pca_components (int, optional) – If set, prepend a linear projection layer named “pca_projection” of this width as the first layer. The caller is responsible for initializing its weights with PCA loadings. Defaults to None (no projection layer).
optimizer_config (dict, optional) – Configuration for the optimizer. Should be a dict containing keys: “algo” (str): “adam” or “adamw”; “learning_rate” (float); “weight_decay” (float, only used for “adamw”). Defaults to None (uses Adam with default settings).
loss_fn (callable, optional) – Loss function to use. If None, defaults to euclidean_distance_loss, defaults to None.

Returns:

Compiled Keras model ready for training.

Return type:

keras.Model

Example

>>> model = create_network(input_shape=1000)
>>> model.summary()

loss_with_range_penalty(y_true, y_pred, mask_tensor, transform, resolution, penalty_weight=1.0)[source]

rasterize_species_range(shapefile_path, resolution=0.1)[source]

Data Module

This module contains the memory-efficient data pipeline components.

IndexSet

class IndexSet(indices: Dict[str, ndarray], total_samples: int, na_mask: ndarray | None = None)[source]

Bases: object

Container for dataset indices that avoids copying data.

This class stores indices for different data splits (train/val/test) to enable memory-efficient data access without creating copies of large genotype arrays.

Variables:

indices (Dictionary mapping split names to numpy arrays of indices)
total_samples (Total number of samples in the dataset)
na_mask (Optional boolean mask indicating samples without coordinates)

indices: Dict[str, ndarray]

total_samples: int

na_mask: ndarray | None = None

__post_init__()[source]: Validate the IndexSet after initialization.

property train: ndarray: Get training indices (backward compatibility).

property val: ndarray: Get validation indices (backward compatibility).

property test: ndarray: Get test indices (backward compatibility).

property hold: ndarray: Get holdout/prediction indices (backward compatibility).

get_split(name: str) → ndarray[source]: Get indices for a named split.

split_sizes() → Dict[str, int][source]: Get the size of each split.

classmethod random_split(n: int, splits: Dict[str, float] | None = None, seed: int | None = None, na_mask: ndarray | None = None, na_action: str = 'separate') → IndexSet[source]

Create random train/val/test splits.

Parameters:

n – Total number of samples
splits – Dictionary mapping split names to proportions (must sum to ≤ 1.0) Default: {“train”: 0.8, “val”: 0.1, “test”: 0.1}
seed – Random seed for reproducibility
na_mask – Boolean mask indicating samples without coordinates
na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’)

Return type:

IndexSet with random splits

classmethod from_k_fold(n: int, k: int, fold: int, seed: int | None = None, na_mask: ndarray | None = None) → IndexSet[source]

Create train/test split for k-fold cross-validation.

Parameters:

n – Total number of samples
k – Number of folds
fold – Which fold to use as test set (0-indexed)
seed – Random seed for reproducibility
na_mask – Boolean mask indicating samples without coordinates

Return type:

IndexSet with train and test splits

classmethod from_groups(groups: ndarray, test_groups: List[int | str], na_mask: ndarray | None = None) → IndexSet[source]

Create train/test split based on group membership.

Useful for spatial or temporal cross-validation where you want to hold out entire groups (e.g., geographic regions).

Parameters:

groups – Array of group labels for each sample
test_groups – List of group labels to use as test set
na_mask – Boolean mask indicating samples without coordinates

Return type:

IndexSet with train and test splits

classmethod from_manual(train: ndarray, test: ndarray | None = None, val: ndarray | None = None, predict: ndarray | None = None, total_samples: int | None = None) → IndexSet[source]

Create IndexSet from manually specified indices.

Parameters:

train – Training indices
test – Test indices
val – Validation indices
predict – Prediction indices (samples without labels)
total_samples – Total number of samples (inferred if not provided)

Return type:

IndexSet with specified splits

classmethod k_fold_split(n: int, k: int, seed: int | None = None, na_mask: ndarray | None = None) → List[IndexSet][source]

Create all k-fold cross-validation splits at once.

This method generates k IndexSet objects, one for each fold, suitable for ensemble training or cross-validation.

Parameters:

n – Total number of samples
k – Number of folds
seed – Random seed for reproducibility
na_mask – Boolean mask indicating samples to exclude from k-fold (e.g., samples without coordinates or not in training set)

Return type:

List of k IndexSet objects, one for each fold

__init__(indices: Dict[str, ndarray], total_samples: int, na_mask: ndarray | None = None) → None

Data Pipeline Functions

make_tf_dataset(coordinates: ndarray, index_set: IndexSet, split: str, batch_size: int = 256, sample_weights: ndarray | None = None, training: bool = True, shuffle: bool = True, drop_remainder: bool | None = None, prefetch: bool = True, repeat: bool = False) → DatasetV2[source]

Create an index-based tf.data pipeline for training or validation.

The pipeline carries only sample indices and their coordinates – a few kilobytes per batch. Genotypes are gathered on the GPU inside IndexedGenotypeModel, so the genotype matrix never enters this pipeline and there is no per-epoch host-to-device genotype traffic.

Parameters:

coordinates – Full coordinate array of shape (n_samples, 2).
index_set – IndexSet containing the train/val/test/predict splits.
split – Which split to use (‘train’, ‘val’, ‘test’, ‘predict’).
batch_size – Batch size for the dataset.
sample_weights – Optional per-sample weights, aligned to the split’s index order (length must equal the split size).
training – Whether this is for training (enables shuffling).
shuffle – Whether to shuffle the split each epoch (only when training).
drop_remainder – Whether to drop the final partial batch (defaults to the value of training).
prefetch – Whether to prefetch batches.
repeat – Whether to repeat the dataset indefinitely. Pair with a steps_per_epoch on model.fit so one iterator serves the whole run instead of being rebuilt every epoch (the per-epoch iterator rebuild dominates wall time for tiny splits). Reshuffles each cycle.

Returns:

A tf.data.Dataset yielding (sample_index, coordinate) batches,
or (sample_index, coordinate, sample_weight) when weights are given.

Preprocessing Functions

filter_snps(genotypes, min_mac: int = 1, max_snps: int | None = None, impute: bool = False, verbose: bool = False) → Tuple[ndarray, FilterStats][source]

Filter SNPs based on criteria and return statistics.

Parameters:

genotypes – GenotypeArray to filter
min_mac – Minimum minor allele count for filtering
max_snps – Maximum number of SNPs to retain
impute – Whether to impute missing data
verbose – Whether to print progress messages

Return type:

Tuple of (filtered allele counts array, FilterStats)

normalize_locs(locs: ndarray) → Tuple[float, float, float, float, ndarray, ndarray][source]

Normalize location coordinates.

Parameters:: locs – Array of shape (n_samples, 2) containing longitude and latitude
Return type:: Tuple of (meanlong, sdlong, meanlat, sdlat, unnormedlocs, normedlocs)

impute_missing(genotypes, alt_counts: ndarray | None = None) → ndarray[source]

Replace missing data with binomial draws from allele frequency.

Parameters:

genotypes – GenotypeArray with missing data
alt_counts – Optional precomputed per-site alt allele counts of shape (n_sites,). When provided, the internal count_alleles() call is skipped — used by filter_snps to reuse counts from its numba kernel.

Return type:

Allele counts array with imputed values

Data Classes

class FilterStats(n_samples_original: int, n_samples_filtered: int, n_snps_original: int, n_snps_filtered: int, mac_threshold: int, samples_removed_na: list[str] = None, n_biallelic_filtered: int = 0, n_mac_filtered: int = 0, n_random_subset: int = 0)[source]

Track what was filtered and why.

n_samples_original: int

n_samples_filtered: int

n_snps_original: int

n_snps_filtered: int

mac_threshold: int

samples_removed_na: list[str] = None

n_biallelic_filtered: int = 0

n_mac_filtered: int = 0

n_random_subset: int = 0

__init__(n_samples_original: int, n_samples_filtered: int, n_snps_original: int, n_snps_filtered: int, mac_threshold: int, samples_removed_na: list[str] = None, n_biallelic_filtered: int = 0, n_mac_filtered: int = 0, n_random_subset: int = 0) → None

class NormalizationParams(meanlong: float, sdlong: float, meanlat: float, sdlat: float)[source]

Store normalization parameters for coordinates.

meanlong: float

sdlong: float

meanlat: float

sdlat: float

apply(locs: ndarray) → ndarray[source]: Apply normalization to coordinates.

reverse(normalized_locs: ndarray) → ndarray[source]: Reverse normalization to get original coordinates.

__init__(meanlong: float, sdlong: float, meanlat: float, sdlat: float) → None

Sample Weights Module

Calculate weights for training data based on the specified method.

Parameters:

method – Method for calculating weights (‘KD’, ‘histogram’, or ‘load’)
trainlocs – Training locations (required for KD and histogram methods)
trainsamps – Training sample IDs
weightdf – DataFrame containing pre-calculated sample weights
xbins – Number of bins in x direction for histogram method
ybins – Number of bins in y direction for histogram method
lam – Exponent for KDE weights
bandwidth – Bandwidth for KDE (if None, will be calculated)
cache_bandwidth – Whether to use bandwidth caching for KDE
n_bandwidths – Number of bandwidth values to test if calculating

Returns:

‘method’: weighting method used
’sample_weights’: array of weights
’sample_weights_df’: DataFrame with sampleID and weights
method-specific parameters

Return type:

Dictionary containing

GPU Optimizer Module

class GPUOptimizer[source]

Utilities for optimizing GPU performance in TensorFlow.

static setup_mixed_precision()[source]

Enable mixed precision training for 2x speedup on modern GPUs.

Returns:: bool
Return type:: True if mixed precision was enabled successfully

static get_optimal_batch_size(model: Model, input_shape: Tuple[int, ...], target_memory_usage: float = 0.9, min_batch_size: int = 32, max_batch_size: int = 2048, dataset_size: int | None = None, verbose: bool = True) → int[source]

Dynamically determine optimal batch size for GPU memory.

Parameters:

model – Keras model to optimize for
input_shape – Shape of single input sample (excluding batch dimension)
target_memory_usage – Target GPU memory usage (0.0-1.0)
min_batch_size – Minimum batch size to test
max_batch_size – Maximum batch size to test
dataset_size – Size of the dataset (if provided, limits max batch size)

Returns:

int

Return type:

Optimal batch size for current GPU

static optimize_gpu_memory(mode: str = 'growth', memory_limit: int | None = None)[source]

Configure GPU memory allocation strategy.

Parameters:

mode – Memory allocation mode (‘growth’, ‘preallocate’, ‘limit’)
memory_limit – Memory limit in MB (only used with mode=’limit’)

static enable_xla_compilation()[source]

Enable XLA compilation for additional performance.

Note: This is experimental and may not work with all operations.

Internal Modules (Implementation Details)

These modules contain the implementation of Locator functionality. Users typically interact with these through the main Locator class.

Loaders Module

class DataLoaderMixin[source]

Mixin class providing data loading functionality for Locator.

load_genotypes(vcf=None, zarr=None, matrix=None, microsat=None, microsat_min_allele_freq=0.01, gl=None, bam_list=None, gl_mode='dosage', gl_missing_threshold=0.4, gl_min_maf=0.01, gl_max_missing_frac=0.1)[source]

Load genotype data from various input sources.

This method can load genotype data from: 1. A stored DataFrame provided during initialization 2. A VCF file 3. A zarr file (scikit-allel or bio2zarr format) 4. A tab-delimited matrix file 5. A tab-delimited microsatellite genotype table 6. ANGSD beagle GL file paired with a BAM list

For windowed analysis, SNP positions must be available either from: - Column names in the genotype DataFrame - The zarr file’s variants/POS array - The VCF file’s POS field (automatically loaded)

Parameters:

vcf (str, optional) – Path to VCF format genotype data
zarr (str, optional) – Path to zarr format genotype data
matrix (str, optional) – Path to tab-delimited matrix file
microsat (str, optional) – Path to tab-delimited microsatellite genotype table
microsat_min_allele_freq (float, optional) – Drop microsat alleles below this per-locus frequency. Default 0.01.
gl (str, optional) – Path to ANGSD -doGlf 2 beagle.gz file
bam_list (str, optional) – Path to BAM file list used in ANGSD run (one path per line). Required when gl is provided. Sample IDs are derived from Path(bam).stem.
gl_mode (str) – GL encoding mode, one of "dosage" (default) or "full_gl". "dosage" returns one expected-dosage value per site per sample; "full_gl" returns all three AA/AB/BB GL probabilities as separate rows.
gl_missing_threshold (float) – GL site filter; a sample at a site is missing if max(GL_AA, GL_AB, GL_BB) < gl_missing_threshold. Ignored unless gl is set. Default 0.4.
gl_min_maf (float) – GL site filter; drop sites whose mean-dosage MAF falls below this. Ignored unless gl is set. Default 0.01.
gl_max_missing_frac (float) – GL site filter; drop sites whose fraction of missing samples exceeds this. Ignored unless gl is set. Default 0.10.

Returns:

tuple –

genotypes is an allel.GenotypeArray of shape (n_sites, n_samples, 2) for VCF/zarr/integer-matrix inputs, or a float32 ndarray of shape (n_sites, n_samples) for continuous-dosage inputs (matrix float, microsat, or GL)
samples is a numpy array of sample IDs

Return type:

(genotypes, samples) where:

Examples

>>> # Using stored DataFrame from initialization
>>> locator = Locator({
...     "genotype_data": geno_df,  # DataFrame with genotypes
...     "sample_data": coords_df   # DataFrame with coordinates
... })
>>> genotypes, samples = locator.load_genotypes()

>>> # Using zarr file (recommended for windowed analysis)
>>> locator = Locator({"sample_data": coords_df})
>>> genotypes, samples = locator.load_genotypes(zarr="path/to/geno.zarr")

>>> # Using VCF file
>>> genotypes, samples = locator.load_genotypes(vcf="path/to/geno.vcf")

>>> # Using matrix file
>>> genotypes, samples = locator.load_genotypes(matrix="path/to/geno.txt")

>>> # Using microsatellite genotypes
>>> genotypes, samples = locator.load_genotypes(microsat="path/to/microsats.tsv")

>>> # Using ANGSD genotype-likelihood file
>>> genotypes, samples = locator.load_genotypes(
...     gl="output.beagle.gz", bam_list="bams.txt", gl_mode="dosage"
... )

Raises:: ValueError – If no input source is provided or if input format is invalid:

sort_samples(samples=None, sample_data_file=None, reorder=True)[source]

Sort samples and match with location data.

Matches samples with their location data and ensures consistent ordering between genotype and location data.

Parameters:

samples (numpy.ndarray) – Array of sample IDs from the genotype data
sample_data_file (str, optional) – Override path to tab-delimited file with columns ‘sampleID’, ‘x’, ‘y’. If not provided, uses stored sample data.
reorder (bool) – If True, automatically reorder metadata to match genotype order. If False, raise error on order mismatch (default: True)

Returns:

tuple

Return type:

(sample_data DataFrame, locs array of shape (n_samples, 2))

Training Module

class TrainingMixin[source]

Mixin class providing training functionality for Locator.

set_sample_weights(wdict)[source]: Set sample weights for training. :param wdict: Dictionary returned by utils.weight_samples() containing sample weights. :type wdict: dict

train(*, genotypes, samples, sample_data_file=None, boot=None, train_gen=None, test_gen=None, pred_gen=None, train_locs=None, test_locs=None, setup_only=False, na_action=None, site_order=None)[source]

Train the Locator model on genotype and location data.

This method trains the neural network model to predict geographic locations from genetic data. It supports both standard training and advanced workflows such as bootstrapping, by accepting pre-processed genotype and location arrays. The model is configured using the parameters provided at initialization.

Parameters:

genotypes (allel.GenotypeArray or np.ndarray) – Genotype data for all samples. Should be of shape (n_sites, n_samples, ploidy).
samples (np.ndarray) – Array of sample IDs corresponding to the genotype data.
sample_data_file (str, optional) – Path to a tab-delimited file with columns ‘sampleID’, ‘x’, ‘y’ for sample locations. Used if not provided in config or as a DataFrame.
boot (int, optional) – Bootstrap replicate number. Used for bootstrapping analyses. Defaults to None.
train_gen (np.ndarray, optional) – Pre-processed training genotype data. Used for bootstrapping. If None, will be generated from genotypes. Defaults to None.
test_gen (np.ndarray, optional) – Pre-processed test genotype data. Used for bootstrapping. If None, will be generated from genotypes. Defaults to None.
pred_gen (np.ndarray, optional) – Pre-processed prediction genotype data. Used for bootstrapping. If None, will be generated from genotypes. Defaults to None.
train_locs (np.ndarray, optional) – Pre-processed training locations. Used for bootstrapping. If None, will be generated from sample data. Defaults to None.
test_locs (np.ndarray, optional) – Pre-processed test locations. Used for bootstrapping. If None, will be generated from sample data. Defaults to None.
setup_only (bool, optional) – If True, only sets up the model and data without training. Defaults to False.
na_action (str, optional) – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action. Defaults to None.
site_order (np.ndarray, optional) – Array of SNP indices for bootstrap resampling. If provided, SNPs will be reordered according to these indices during training. Used for bootstrap analyses to resample SNPs with replacement.

Returns:

keras.callbacks.History or None

Return type:

The Keras training history object if training is performed, or None if setup_only is True.

Raises:

ValueError – If required sample data is missing or improperly formatted.:

Example

>>> # Standard training
>>> loc = Locator({"out": "analysis", "sample_data": "samples.txt", "zarr": "genotypes.zarr"})
>>> genotypes, samples = loc.load_genotypes(zarr="genotypes.zarr")
>>> history = loc.train(genotypes=genotypes, samples=samples)

>>> # Bootstrapping with pre-processed data
>>> history = loc.train(
...     genotypes=None,
...     samples=samples,
...     boot=1,
...     train_gen=boot_train_gen,
...     test_gen=boot_test_gen,
...     pred_gen=boot_pred_gen,
...     train_locs=boot_train_locs,
...     test_locs=boot_test_locs
... )

train_holdout(genotypes=None, samples=None, k=10, holdout_indices=None, filtered_genotypes=None)[source]

Train the model while holding out samples with known locations.

Parameters:

genotypes – Array of genotype data. Required unless filtered_genotypes is provided.
samples – Sample IDs corresponding to genotypes
k – Number of samples to hold out (ignored if holdout_indices provided)
holdout_indices – Optional specific indices of samples to hold out
filtered_genotypes – Pre-filtered allele count array. If provided, skips internal filter_snps call and avoids loading the full genotype array. Used by parallel dispatch to share one filtered copy across all workers.

Return type:

keras.callbacks.History object from model training

train_window(genotypes, samples, window_snp_indices, index_set, normalized_locs)[source]

Train the model for a specific genomic window using efficient tf.data pipeline.

This is an internal method used by run_windows_holdouts to train models on specific genomic windows without creating intermediate arrays.

Parameters:

genotypes – Full genotype array (not filtered)
samples – Sample IDs
window_snp_indices – Indices of SNPs in this window
index_set – Pre-computed IndexSet with train/test/holdout splits
normalized_locs – Pre-normalized location coordinates

Return type:

keras.callbacks.History object from model training

Prediction Module

class PredictionMixin[source]

Mixin class providing prediction functionality for Locator.

predict(boot=0, verbose=True, prediction_genotypes=None, genotypes=None, samples=None, indices=None, return_df=False, save_preds_to_disk=True, site_order=None)[source]

Make predictions for samples with unknown locations.

Parameters:

boot (int, optional) – Bootstrap replicate number. Defaults to 0.
verbose (bool, optional) – Whether to print validation metrics. Defaults to True.
prediction_genotypes (numpy.ndarray, optional) – DEPRECATED - use genotypes parameter. Override default prediction genotypes. Used for jacknife resampling. Defaults to None.
genotypes (numpy.ndarray, optional) – Full genotype array for creating tf.data dataset. Should be the original unfiltered genotypes. Defaults to None.
samples (numpy.ndarray, optional) – Sample IDs corresponding to genotypes. Defaults to None.
indices (numpy.ndarray, optional) – Indices of samples to predict on. If None, predicts on samples without coordinates (self.pred_indices). Defaults to None.
return_df (bool, optional) – Whether to return predictions as pandas DataFrame. Defaults to False.
save_preds_to_disk (bool, optional) – Whether to save predictions to disk. Defaults to True.
site_order (np.ndarray, optional) – Array of SNP indices for bootstrap resampling. If provided, SNPs will be reordered according to these indices during prediction. Used for bootstrap analyses to ensure consistent resampling between train and predict.

Returns:

numpy.ndarray or pandas.DataFrame – x,y coordinates and sampleID columns

Return type:

Array of predicted coordinates or DataFrame with

load_model(weights_path)[source]

Load a trained model from saved weights.

This method loads a model from HDF5 weights file and restores the preprocessing parameters needed for making predictions.

Parameters:: weights_path (str) – Path to the saved HDF5 weights file
Returns:: dict
Return type:: Dictionary containing loaded metadata including normalization params
Raises:: ValueError – If weights file cannot be loaded or is missing metadata:

predict_from_weights(weights_path, genotypes, samples, sample_data_file=None, save_preds_to_disk=True, return_df=True)[source]

Convenience method to load weights and make predictions.

This method combines loading a saved model and making predictions in a single call. It handles preprocessing the genotypes using the same parameters that were used during training.

Parameters:

weights_path (str) – Path to saved HDF5 weights file
genotypes (numpy.ndarray) – Genotype data to predict on
samples (numpy.ndarray) – Sample IDs corresponding to genotypes
sample_data_file (str, optional) – Path to sample data file
save_preds_to_disk (bool) – Whether to save predictions to disk
return_df (bool) – Whether to return predictions as DataFrame

Returns:

numpy.ndarray or pandas.DataFrame

Return type:

Predictions

predict_holdout(verbose=True, return_df=False, save_preds_to_disk=True, plot_summary=True, plot_map=True)[source]

Predict locations for held out samples.

Parameters:

verbose – Print progress and metrics
return_df – Return predictions as pandas DataFrame
save_preds_to_disk – Save predictions to disk
plot_summary – Display error summary plot in notebook (only if return_df=True)
plot_map – Display map of predictions (only if plot_summary=True)

Returns:

If return_df is True, returns pandas DataFrame with predictions
Otherwise returns None

Analysis Module

class AnalysisMixin[source]: Mixin class providing analysis functionality for Locator.

Parallel Analysis Module

This module provides Ray-based parallel implementations of analysis methods for multi-GPU execution.

parallel_k_fold_holdouts(*args, **kwargs)

parallel_leave_one_out(*args, **kwargs)

parallel_holdouts(*args, **kwargs)

parallel_windows_holdouts(*args, **kwargs)

Plotting Module

This module provides visualization functions for Locator predictions and analyses.

Standalone Functions

plot_predictions(predictions, locator, out_prefix, samples=None, n_samples=9, n_cols=3, plot_map=False, width=5, height=4, dpi=300, n_levels=3, show=None)[source]

Plot locator predictions from jacknife, bootstrap, or windows analyses.

This function visualizes predictions from any of locator’s prediction methods that generate multiple predictions per sample. It creates a grid of subplots, one per sample, showing the distribution of predictions as KDE contours.

The function expects prediction data with:

A ‘sampleID’ column
Multiple prediction columns (‘x_0’, ‘x_1’… and ‘y_0’, ‘y_1’…)

For each sample, the plot shows:

KDE contours of predictions (blue lines)
True location if known (red star)
All training sample locations (gray circles)

Parameters:

predictions (pandas.DataFrame or str) –
DataFrame or path to predictions file. Output from any of:
- locator.run_jacknife(return_df=True)
- locator.run_bootstraps(return_df=True)
- locator.run_windows(return_df=True)
locator (Locator) – Locator instance containing training data configuration
out_prefix (str) – Prefix for output files. Plot saved as {out_prefix}_predictions.pdf
samples (list, optional) – List of sample IDs to plot. If None, randomly selects n_samples
n_samples (int) – Number of samples to plot if samples not specified. Default: 9
n_cols (int) – Number of columns in plot grid. Default: 3
plot_map (bool) – Whether to plot on a geographic map (requires cartopy). Default: False
width (float) – Width of each subplot in inches. Default: 5
height (float) – Height of each subplot in inches. Default: 4
dpi (int) – DPI resolution for output figure. Default: 300
n_levels (int) – Number of KDE contour levels to plot. Default: 3
show (bool or None) – Whether to display plot. None=auto-detect environment. Default: None

Returns:

None

Return type:

Saves plot to file and optionally displays it

Examples

For jacknife analysis:

predictions = locator.run_jacknife(genotypes, samples, return_df=True)
plot_predictions(predictions, locator, "jacknife_example")

For bootstrap analysis:

predictions = locator.run_bootstraps(genotypes, samples, return_df=True)
plot_predictions(predictions, locator, "bootstrap_example")

For windows analysis:

predictions = locator.run_windows(genotypes, samples, return_df=True)
plot_predictions(predictions, locator, "windows_example")

Plot specific samples:

plot_predictions(predictions, locator, "selected",
               samples=['HG001', 'HG002', 'HG003'])

Note

Requires matplotlib and scipy for KDE calculation
If plot_map=True, requires cartopy for geographic projections
Automatically adjusts plot limits based on prediction ranges
KDE may fail for samples with very few predictions

plot_error_summary(predictions, sample_data, out_prefix=None, plot_map=True, width=20, height=10, dpi=300, use_geodesic=True, include_training_locs=True, show=None, return_merged=False)[source]

Plot summary of prediction errors from holdout analysis.

Creates a comprehensive error visualization with two panels:

Map/Scatter panel: Shows true locations colored by prediction error, with lines connecting true and predicted locations
Histogram panel: Distribution of errors with summary statistics

This function is designed for analyzing results from holdout methods like:

run_holdouts()
run_k_fold_holdouts()
run_leave_one_out()

Parameters:

predictions (pandas.DataFrame) – DataFrame with columns: - sampleID: Sample identifiers - x_pred: Predicted longitude - y_pred: Predicted latitude
sample_data (pandas.DataFrame or str) – DataFrame or path to TSV file with columns: - sampleID: Sample identifiers (must match predictions) - x: True longitude - y: True latitude
out_prefix (str, optional) – Prefix for output files. If provided, saves as {out_prefix}_error_summary.png (or .html for interactive). Default: None
plot_map (bool) – Whether to plot on a geographic map using cartopy projection. If False, uses regular scatter plot. Default: True
width (float) – Figure width in inches. Default: 20
height (float) – Figure height in inches. Default: 10
dpi (int) – Figure resolution in dots per inch. Default: 300
use_geodesic (bool) – If True, calculate geodesic distances in kilometers. If False, use Euclidean distances in coordinate units. Default: True
include_training_locs (bool) – Whether to plot training locations (gray circles) and use their extent for map bounds. Default: True
show (bool or None) – Whether to display plot. None=auto-detect environment, True=always show, False=never show. Default: None
return_merged (bool) – If True, return the internal merged DataFrame used for plotting. Default: False

Returns:

None (Saves plot to file and optionally displays it.)
If return_merged is True, returns the internal merged DataFrame containing prediction errors and true locations.

Raises:

ValueError – If predictions or sample_data are empty, have missing columns,: or have no matching samples

Examples

Basic usage with k-fold results:

predictions = locator.run_k_fold_holdouts(genotypes, samples, return_df=True)
plot_error_summary(predictions, "samples.tsv", "kfold_errors")

With DataFrame input and Euclidean distances:

plot_error_summary(predictions, sample_df,
                 out_prefix="holdout_errors",
                 use_geodesic=False)

Without map projection:

plot_error_summary(predictions, sample_df,
                 plot_map=False,
                 width=10, height=5)

Return merged DataFrame:

merged = plot_error_summary(predictions, sample_df, return_merged=True)

Note

Summary statistics shown: mean, median, max error, R² for x and y
Training locations help visualize geographic sampling bias
Geodesic distances account for Earth’s curvature
Map projection requires cartopy to be installed

plot_sample_weights(locator, out_prefix=None, plot_map=True, width=5, height=3, dpi=300, show=None)[source]

Plot sample weights assigned to training locations.

Visualizes the geographic distribution of sample weights used during training. This is useful for understanding which regions are upweighted or downweighted based on sampling density.

Sample weights are typically computed using:

Kernel density (KD) method: Upweights samples in sparse regions
Histogram binning method: Based on 2D histogram counts

The plot uses a log-scale color mapping to better show weight variations.

Parameters:

locator (Locator) – Locator instance that has been trained with sample weighting enabled. Must have computed sample_weights attribute.
out_prefix (str, optional) – Prefix for output files. If provided, saves as {out_prefix}_sample_weights.png. Default: None
plot_map (bool) – Whether to plot on a geographic map using cartopy projection. If False, uses regular scatter plot with equal aspect ratio. Default: True
width (float) – Figure width in inches. Default: 5
height (float) – Figure height in inches. Default: 3
dpi (int) – Figure resolution in dots per inch. Default: 300
show (bool or None) – Whether to display plot. None=auto-detect environment, True=always show, False=never show. Default: None

Returns:

None

Return type:

Saves plot to file and optionally displays it

Raises:

ValueError – If locator doesn’t have computed sample weights, or if: required data is missing

Examples

After training with KDE weighting:

config = {
    "weight_samples": {
        "enabled": True,
        "method": "KD"
    }
}
locator = Locator(config)
locator.train(genotypes, samples)
plot_sample_weights(locator, "kde_weights")

With histogram binning weights:

config = {
    "weight_samples": {
        "enabled": True,
        "method": "hist",
        "xbins": 20,
        "ybins": 20
    }
}
locator = Locator(config)
locator.train(genotypes, samples)
plot_sample_weights(locator, "hist_weights", plot_map=False)

Note

Requires that locator was trained with weight_samples enabled
Log scale coloring helps visualize large weight variations
Higher weights (yellow) indicate undersampled regions
Lower weights (purple) indicate oversampled regions
Map projection requires cartopy to be installed

kde_predict(x_coords, y_coords, xlim=(0, 50), ylim=(0, 50), n_points=100)[source]

Calculate kernel density estimate of predictions.

This is a helper function used internally by plot_predictions() to compute kernel density estimates for visualizing prediction uncertainty.

Parameters:

x_coords (array-like) – Array of x coordinates (longitude values)
y_coords (array-like) – Array of y coordinates (latitude values)
xlim (tuple) – Tuple of (min, max) x values for grid. Default: (0, 50)
ylim (tuple) – Tuple of (min, max) y values for grid. Default: (0, 50)
n_points (int) – Number of points for density estimation grid. Default: 100

Returns:

tuple –

x_grid (numpy.ndarray): X coordinates of the mesh grid
y_grid (numpy.ndarray): Y coordinates of the mesh grid
density (numpy.ndarray): Density values at each grid point

Returns (None, None, None) if KDE calculation fails.

Return type:

A 3-tuple containing:

Note

The function uses scipy.stats.gaussian_kde for density estimation. Grid limits should match the geographic extent of your predictions.

PlottingMixin Class

class PlottingMixin[source]

Bases: object

Mixin class providing plotting functionality for Locator.

This mixin is inherited by the main Locator class to provide visualization methods for training history and Jupyter notebook integration.

_repr_html_: Generate rich HTML representation for Jupyter notebooks

Configuration Options

This section provides an overview of the available configuration options.

Default Configuration

The default configuration for Locator includes:

{
    # Data parameters
    "train_split": 0.9,
    "batch_size": 32,
    "min_mac": 2,
    "max_SNPs": None,
    "impute_missing": False,

    # Network architecture
    "width": 256,
    "nlayers": 8,
    "dropout_prop": 0.25,

    # Training parameters
    "max_epochs": 5000,
    "patience": 100,
    "learning_rate": 0.001,
    "min_epochs": 10,
    "min_delta": 1e-4,
    "restore_best_weights": True,

    # Optimizer parameters
    "optimizer_algo": "adam",
    "weight_decay": 0.004,

    # Output control
    "keras_verbose": 1,
    "prediction_frequency": 1,

    # Validation
    "validation_split": 0.1,

    # Data augmentation
    "augmentation": {
        "enabled": False,
        "flip_rate": 0.05,
    },

    # Sample weighting
    "weight_samples": {
        "enabled": False,
        "method": "KD",
        "xbins": 10,
        "ybins": 10,
        "lam": 1.0,
        "bandwidth": None,
        "weightdf": None,
    },

    # Range penalty
    "use_range_penalty": False,
    "species_range_shapefile": None,
    "resolution": 0.05,
    "penalty_weight": 1.0,
    "out": "locator",

    # NA handling
    "na_action": "separate",

    # GPU optimization (enabled by default)
    "use_mixed_precision": True,
    "gpu_batch_size": "auto",
    "gradient_accumulation_steps": 1,
    "gpu_memory_mode": "growth",
    "enable_xla": False,

    # Performance optimization
    "optimize_tf_parallelism": True,
    "holdout_no_intermediate_saves": True,
    "save_fold_models": True,

    # Verbosity control
    "verbose_splits": False,
    "verbose_batch_size": False,
}

Input Formats

Genotype Data

Supported input formats for genotype data:

VCF files (.vcf or .vcf.gz)
Zarr format (recommended for large datasets)
Pandas DataFrame with: - Samples as index - SNP positions as columns - Genotype counts (0,1,2) as values

Sample Data

Required format for sample coordinate data:

Tab-delimited file or DataFrame with columns: - sampleID: Sample identifier - x: Longitude - y: Latitude

Output Formats

Prediction Results

Default output files:

{out}_predlocs.txt: Main predictions
{out}_history.txt: Training history
{out}_fitplot.pdf: Training plots
{out}.weights.h5: Model weights

For special analyses:

{out}_bootstrap_predlocs.csv: Bootstrap results
{out}_jacknife_predlocs.csv: Jacknife results
{out}_windows_predlocs.csv: Windowed analysis results
{out}_holdout_predlocs.csv: Holdout analysis results