API Reference

Core Module

setup_gpu(gpu_number=None)[source]

Configure GPU settings for optimal usage.

Parameters:

gpu_number (int or str, optional) – GPU index to use (0-based). If None, the first available GPU is used.

Returns:

bool

Return type:

True if a GPU is available and successfully configured, otherwise False.

Locator

class Locator(config=None)[source]

Bases: DataLoaderMixin, TrainingMixin, PredictionMixin, AnalysisMixin, EnsembleMixin, PlottingMixin

A class for predicting geographic locations from genetic data.

This class implements a neural network approach to predict sample locations from genetic data. It can handle various input formats including:

  • Genotype data:
    • VCF or VCF.gz files

    • Zarr format

    • Pandas DataFrame with samples as index, SNP positions as columns

  • Sample location data:
    • Tab-delimited file

    • Pandas DataFrame

The model can be configured through a dictionary of parameters passed during initialization. Sample location data can be provided either as a file path or as a pandas DataFrame.

Variables:
  • (dict) (config)

  • (keras.Model) (model)

  • (keras.callbacks.History) (history)

  • (numpy.ndarray) (samples)

  • (float) (sdlat)

  • (float)

  • (float)

  • (float)

Example

>>> # Using a file path for sample data
>>> locator = Locator({
...     "out": "analysis_1",
...     "sample_data": "samples.txt",
...     "zarr": "genotypes.zarr"
... })
>>> # Using a DataFrame for sample data
>>> locator = Locator({
...     "out": "analysis_1",
...     "sample_data": sample_df,  # pandas DataFrame
...     "zarr": "genotypes.zarr"
... })
>>> # Using DataFrames for both inputs
>>> # Coordinate DataFrame must have columns: sampleID, x, y
>>> coords_df = pd.DataFrame({
...     "sampleID": ["sample1", "sample2"],
...     "x": [longitude1, longitude2],
...     "y": [latitude1, latitude2]
... })
>>>
>>> # Genotype DataFrame has samples as index, SNP positions as columns
>>> geno_df = pd.DataFrame({
...     1001: [0, 1],    # SNP position 1001
...     2001: [1, 2],    # SNP position 2001
... }, index=["sample1", "sample2"])
>>>
>>> locator = Locator({
...     "out": "analysis_1",
...     "sample_data": coords_df,
...     "genotype_data": geno_df
... })
__init__(config=None)[source]

Initialize Locator with configuration parameters.

Parameters:

config (dict, optional) – Configuration dictionary that can include the following keys:

Top-level keys:

  • sample_data (str or pandas.DataFrame): Path to sample data file or a DataFrame with columns ‘sampleID’, ‘x’, ‘y’.

  • genotype_data (pandas.DataFrame): DataFrame with samples as index, SNP positions as columns, and genotype counts (0, 1, 2) as values.

  • zarr (str): Path to Zarr format genotype data.

  • vcf (str): Path to VCF format genotype data.

  • out (str): Output root name for all output files.

  • train_split (float): Proportion of data to use for training.

  • batch_size (int): Batch size for training.

  • max_epochs (int): Maximum number of training epochs.

  • patience (int): Patience for early stopping.

  • min_mac (int): Minimum minor allele count for SNP filtering.

  • max_SNPs (int): Maximum number of SNPs to use.

  • width (int): Width of neural network layers.

  • nlayers (int): Number of neural network layers.

  • dropout_prop (float): Dropout proportion.

  • pca_components (int or “auto”): If set, prepend a PCA-initialized linear projection of this width as the first layer and fine-tune it. Use "auto" to pick the width from the genotype-PCA scree elbow. Recommended when n_SNPs >> n_samples. Default None (disabled).

  • pca_finetune (bool): Whether to unfreeze the PCA projection for a low-learning-rate fine-tuning phase. Default True. False keeps the projection frozen at its PCA initialization.

  • pca_finetune_lr (float): Learning rate for the PCA fine-tuning phase. Default 1e-4.

  • keras_verbose (int): Verbosity level for Keras training.

  • impute_missing (bool): Whether to impute missing genotypes.

  • validation_split (float): Proportion of data to use for validation.

  • learning_rate (float): Learning rate for the optimizer.

  • min_epochs (int): Minimum number of epochs to train.

  • patience (int): Number of epochs with no improvement to wait before stopping.

  • min_delta (float): Minimum change in validation loss to qualify as an improvement.

  • restore_best_weights (bool): Whether to restore model weights from the epoch with the best validation loss.

  • prediction_frequency (int): Frequency (in epochs) of making predictions during training.

  • optimizer_algo (str): Optimizer algorithm to use (“adam” or “adamw”).

  • weight_decay (float): Weight decay coefficient for AdamW optimizer.

  • augmentation (dict): Dictionary of augmentation parameters:
    • enabled (bool): Whether data augmentation is enabled.

    • flip_rate (float): Rate at which to randomly flip genotypes during augmentation.

  • weight_samples (dict): Dictionary of sample weighting parameters:
    • enabled (bool): Whether to weight samples by distance.

    • method (str): Method for weighting samples (“KD”, “histogram”, “df”).

    • xbins (int): Number of bins for histogram.

    • ybins (int): Number of bins for histogram.

    • lam (float): Exponent for weights.

    • bandwidth (float): Bandwidth for KDE.

    • weightdf (pandas.DataFrame): DataFrame containing sample weights.

  • use_range_penalty (bool): Whether to apply a range penalty in the loss function.

  • penalty_weight (float): Weight assigned to the range penalty term.

  • species_range_geom (shapely.geometry): Shapely geometry object defining the valid species range.

  • na_action (str): How to handle samples without coordinates. Options:
    • ‘separate’ (default): Include all samples, train on known, predict unknown.

    • ‘exclude’: Only use samples with known coordinates.

    • ‘fail’: Raise error if any samples lack coordinates.

property sample_data: DataFrame

Returns the sample data as a pandas DataFrame.

Returns:

pd.DataFrame

Return type:

The sample data DataFrame with columns [‘sampleID’, ‘x’, ‘y’, …].

Raises:

ValueError – If sample data is not available.:

Example

>>> locator = Locator({"sample_data": coords_df})
>>> df = locator.sample_data
get_sample_status(samples, sample_data=None)[source]

Analyze sample coordinate status.

This method identifies which samples have known geographic coordinates and which have missing (NA) coordinates. This is useful for understanding your data and for methods that need to handle samples with and without coordinates differently.

Parameters:
  • samples (numpy.ndarray) – Array of sample IDs from genotype data

  • sample_data (pandas.DataFrame, optional) – DataFrame with columns ‘sampleID’, ‘x’, ‘y’. If not provided, uses the stored sample data or loads from config.

Returns:

dict

  • ‘known_indices’ (numpy.ndarray): Array indices of samples with coordinates

  • ’na_indices’ (numpy.ndarray): Array indices of samples without coordinates

  • ’known_samples’ (numpy.ndarray): Sample IDs with coordinates

  • ’na_samples’ (numpy.ndarray): Sample IDs without coordinates

  • ’n_known’ (int): Count of samples with known coordinates

  • ’n_na’ (int): Count of samples with NA coordinates

  • ’total’ (int): Total number of samples

Return type:

A dictionary containing:

Example

>>> locator = Locator(config)
>>> status = locator.get_sample_status(samples)
>>> print(f"Found {status['n_known']} samples with coordinates")
>>> print(f"Found {status['n_na']} samples without coordinates")
check_data(genotypes, samples, verbose=True)[source]

Check data quality and report statistics.

This is a convenience method to help users understand their data before running analyses. It reports the number of samples, SNPs, and identifies samples with missing coordinates.

Parameters:
  • genotypes (numpy.ndarray or allel.GenotypeArray) – Genotype data

  • samples (numpy.ndarray) – Array of sample IDs

  • verbose (bool) – If True, print detailed statistics. Default: True

Returns:

dict

Return type:

Sample status dictionary from get_sample_status()

Example:

>>> locator = Locator(config)
>>> genotypes, samples = locator.load_genotypes()
>>> status = locator.check_data(genotypes, samples)
Data Summary
==================================================
Total samples: 231
Samples with coordinates: 211
Samples without coordinates: 20
Total SNPs: 1000

Current NA handling mode: separate
- Will train on samples with known locations
- Can predict on samples without locations

Samples without coordinates (first 10):
  - sample_001
  - sample_002
  ...
create_ensemble_early_stopping(patience_multiplier=1.5)

Create early stopping callback with ensemble-specific settings.

Parameters:

patience_multiplier – Multiply base patience for ensemble training (ensembles often benefit from longer training)

Returns:

keras.callbacks.EarlyStopping

Return type:

Configured callback

create_ensemble_folds(genotypes, samples, k=5, training_set_indices=None, na_action=None)

Create k-fold splits for ensemble training using IndexSet.

Parameters:
  • genotypes – GenotypeArray containing genetic data

  • samples – Array of sample IDs

  • k – Number of folds (default: 5)

  • training_set_indices – Optional array of indices to use for training+validation. If provided, only these samples will be used to create k-folds.

  • na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action

Returns:

dict

  • ‘index_sets’: List of IndexSet objects for each fold

  • ’fold_indices’: Legacy format dict for backward compatibility

  • ’sample_status’: Sample status information

Return type:

Dictionary with fold information:

create_ensemble_lr_scheduler(fold_idx)

Create learning rate scheduler for ensemble training.

Each fold can start with a slightly different learning rate to improve ensemble diversity.

Parameters:

fold_idx – Current fold index

Returns:

keras.callbacks.ReduceLROnPlateau

Return type:

Configured callback

get_ensemble_batch_size(dataset_size, fold_idx=0)

Determine optimal batch size for ensemble training.

Uses GPUOptimizer to find the best batch size, with caching to avoid recomputing for each fold.

Parameters:
  • dataset_size – Size of training dataset

  • fold_idx – Current fold index (for logging)

Returns:

int

Return type:

Optimal batch size

load_ensemble(ensemble_path)

Load a saved ensemble for prediction.

Parameters:

ensemble_path – Path to the saved ensemble directory

Returns:

dict

Return type:

Ensemble information including models and parameters

load_genotypes(vcf=None, zarr=None, matrix=None, microsat=None, microsat_min_allele_freq=0.01)

Load genotype data from various input sources.

This method can load genotype data from: 1. A stored DataFrame provided during initialization 2. A VCF file 3. A zarr file (scikit-allel or bio2zarr format) 4. A tab-delimited matrix file 5. A tab-delimited microsatellite genotype table

For windowed analysis, SNP positions must be available either from: - Column names in the genotype DataFrame - The zarr file’s variants/POS array - The VCF file’s POS field (automatically loaded)

Parameters:
  • vcf (str, optional) – Path to VCF format genotype data

  • zarr (str, optional) – Path to zarr format genotype data

  • matrix (str, optional) – Path to tab-delimited matrix file

  • microsat (str, optional) – Path to tab-delimited microsatellite genotype table

  • microsat_min_allele_freq (float, optional) – Drop microsat alleles below this per-locus frequency. Default 0.01.

Returns:

tuple

  • genotypes is an allel.GenotypeArray of shape (n_sites, n_samples, 2) for VCF/zarr/integer-matrix inputs, or a float32 ndarray of shape (n_sites, n_samples) for continuous-dosage (matrix float / microsat) inputs

  • samples is a numpy array of sample IDs

Return type:

(genotypes, samples) where:

Examples

>>> # Using stored DataFrame from initialization
>>> locator = Locator({
...     "genotype_data": geno_df,  # DataFrame with genotypes
...     "sample_data": coords_df   # DataFrame with coordinates
... })
>>> genotypes, samples = locator.load_genotypes()
>>> # Using zarr file (recommended for windowed analysis)
>>> locator = Locator({"sample_data": coords_df})
>>> genotypes, samples = locator.load_genotypes(zarr="path/to/geno.zarr")
>>> # Using VCF file
>>> genotypes, samples = locator.load_genotypes(vcf="path/to/geno.vcf")
>>> # Using matrix file
>>> genotypes, samples = locator.load_genotypes(matrix="path/to/geno.txt")
>>> # Using microsatellite genotypes
>>> genotypes, samples = locator.load_genotypes(microsat="path/to/microsats.tsv")
Raises:

ValueError – If no input source is provided or if input format is invalid:

load_model(weights_path)

Load a trained model from saved weights.

This method loads a model from HDF5 weights file and restores the preprocessing parameters needed for making predictions.

Parameters:

weights_path (str) – Path to the saved HDF5 weights file

Returns:

dict

Return type:

Dictionary containing loaded metadata including normalization params

Raises:

ValueError – If weights file cannot be loaded or is missing metadata:

predict(boot=0, verbose=True, prediction_genotypes=None, genotypes=None, samples=None, indices=None, return_df=False, save_preds_to_disk=True, site_order=None)

Make predictions for samples with unknown locations.

Parameters:
  • boot (int, optional) – Bootstrap replicate number. Defaults to 0.

  • verbose (bool, optional) – Whether to print validation metrics. Defaults to True.

  • prediction_genotypes (numpy.ndarray, optional) – DEPRECATED - use genotypes parameter. Override default prediction genotypes. Used for jacknife resampling. Defaults to None.

  • genotypes (numpy.ndarray, optional) – Full genotype array for creating tf.data dataset. Should be the original unfiltered genotypes. Defaults to None.

  • samples (numpy.ndarray, optional) – Sample IDs corresponding to genotypes. Defaults to None.

  • indices (numpy.ndarray, optional) – Indices of samples to predict on. If None, predicts on samples without coordinates (self.pred_indices). Defaults to None.

  • return_df (bool, optional) – Whether to return predictions as pandas DataFrame. Defaults to False.

  • save_preds_to_disk (bool, optional) – Whether to save predictions to disk. Defaults to True.

  • site_order (np.ndarray, optional) – Array of SNP indices for bootstrap resampling. If provided, SNPs will be reordered according to these indices during prediction. Used for bootstrap analyses to ensure consistent resampling between train and predict.

Returns:

numpy.ndarray or pandas.DataFrame – x,y coordinates and sampleID columns

Return type:

Array of predicted coordinates or DataFrame with

predict_ensemble(genotypes=None, samples=None, indices=None, include_fold_predictions=False, return_std=False, return_df=True, save_predictions=True)

Make predictions using the ensemble of models.

Parameters:
  • genotypes – GenotypeArray for prediction (if None, uses stored data)

  • samples – Sample IDs (if None, uses stored samples)

  • indices – Specific indices to predict on (if None, predicts all)

  • include_fold_predictions – Include individual fold predictions in output

  • return_std – Return standard deviation across ensemble predictions

  • return_df – Return results as DataFrame (default: True)

  • save_predictions – Save predictions to disk (default: True)

Returns:

pd.DataFrame or np.ndarray

Return type:

Ensemble predictions with optional std

predict_ensemble_from_manager(genotypes, samples, indices=None, return_df=True, save_predictions=True)

Make predictions using loaded ensemble with model manager.

This method efficiently loads models on-demand for prediction, reducing memory usage for large ensembles.

Parameters:
  • genotypes – GenotypeArray for prediction

  • samples – Sample IDs

  • indices – Specific indices to predict on (if None, predicts all)

  • return_df – Return results as DataFrame (default: True)

  • save_predictions – Save predictions to disk (default: True)

Returns:

pd.DataFrame or np.ndarray

Return type:

Ensemble predictions

predict_from_weights(weights_path, genotypes, samples, sample_data_file=None, save_preds_to_disk=True, return_df=True)

Convenience method to load weights and make predictions.

This method combines loading a saved model and making predictions in a single call. It handles preprocessing the genotypes using the same parameters that were used during training.

Parameters:
  • weights_path (str) – Path to saved HDF5 weights file

  • genotypes (numpy.ndarray) – Genotype data to predict on

  • samples (numpy.ndarray) – Sample IDs corresponding to genotypes

  • sample_data_file (str, optional) – Path to sample data file

  • save_preds_to_disk (bool) – Whether to save predictions to disk

  • return_df (bool) – Whether to return predictions as DataFrame

Returns:

numpy.ndarray or pandas.DataFrame

Return type:

Predictions

predict_holdout(verbose=True, return_df=False, save_preds_to_disk=True, plot_summary=True, plot_map=True)

Predict locations for held out samples.

Parameters:
  • verbose – Print progress and metrics

  • return_df – Return predictions as pandas DataFrame

  • save_preds_to_disk – Save predictions to disk

  • plot_summary – Display error summary plot in notebook (only if return_df=True)

  • plot_map – Display map of predictions (only if plot_summary=True)

Returns:

  • If return_df is True, returns pandas DataFrame with predictions

  • Otherwise returns None

run_bootstraps(genotypes, samples, n_bootstraps=50, return_df=False, save_full_pred_matrix=True, na_action=None)

Run bootstrap analysis by resampling SNPs with replacement.

Parameters:
  • genotypes – Array of genotype data

  • samples – Sample IDs corresponding to genotypes

  • n_bootstraps – Number of bootstrap replicates to run

  • return_df – Whether to return DataFrame with all predictions

  • save_full_pred_matrix – Whether to save full prediction matrix to disk

  • na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action

Returns:

pandas.DataFrame or None – for each bootstrap, otherwise None

Return type:

If return_df=True, returns DataFrame with predictions

Notes

  • With na_action=’separate’: Trains on samples with known locations, can predict on samples with NA locations

  • With na_action=’exclude’: Only uses samples with known locations

  • With na_action=’fail’: Raises error if any NA samples found

run_holdouts(genotypes, samples, k=10, n_reps=10, holdout_indices=None, holdout_sample_ids=None, return_df=False, save_full_pred_matrix=True, na_action=None)

Run multiple holdout replicates for cross-validation.

Parameters:
  • genotypes – Array of genotype data

  • samples – Sample IDs corresponding to genotypes

  • k – Number of samples to hold out in each replicate

  • n_reps – Number of holdout replicates to run

  • holdout_indices – Optional list of lists, each containing indices to hold out

  • holdout_sample_ids – Optional list of sample IDs to hold out. If provided, these specific samples will be held out (overrides k and holdout_indices). Can be a single list (used for all replicates) or list of lists (different samples per replicate).

  • return_df – Whether to return DataFrame with all predictions

  • save_full_pred_matrix – Whether to save full prediction matrix to disk

  • na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action

Returns:

If return_df=True, returns DataFrame with predictions

for each holdout replicate containing columns: - sampleID: Sample identifier - x_pred: Predicted longitude - y_pred: Predicted latitude - rep: Replicate number (0 to n_reps-1)

Note: True locations are not included. Merge with sample metadata to calculate errors.

Return type:

pandas.DataFrame or None

Notes

  • With na_action=’separate’: Currently behaves like ‘exclude’ (holdouts must have known locations). Future versions may support predicting NA samples.

  • With na_action=’exclude’: Only uses samples with known locations (current behavior)

  • With na_action=’fail’: Raises error if any NA samples found

run_jacknife(genotypes, samples, prop=0.05, return_df=False, save_full_pred_matrix=True, na_action=None)

Run jacknife analysis by dropping SNPs.

Parameters:
  • genotypes – Array of genotype data

  • samples – Sample IDs corresponding to genotypes

  • prop (float, optional) – Proportion of SNPs to drop in each replicate. Defaults to 0.05.

  • return_df (bool, optional) – Whether to return DataFrame of all predictions. Defaults to False.

  • save_full_pred_matrix (bool, optional) – Whether to save the full prediction matrix. Defaults to True.

  • na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action

Returns:

pandas.DataFrame or None – all predictions, with columns named ‘x_0’, ‘y_0’, ‘x_1’, ‘y_1’, etc. for each jacknife replicate. Row index contains sample IDs.

Return type:

If return_df=True, returns DataFrame containing

Notes

  • With na_action=’separate’: Trains on samples with known locations, can predict on samples with NA locations

  • With na_action=’exclude’: Only uses samples with known locations

  • With na_action=’fail’: Raises error if any NA samples found

run_jacknife_holdouts(genotypes, samples, k=10, prop=0.05, n_boots=50, holdout_indices=None, return_df=False, save_full_pred_matrix=True, na_action=None)

Run jacknife analysis on holdout samples.

Parameters:
  • genotypes – Array of genotype data

  • samples – Sample IDs corresponding to genotypes

  • k – Number of samples to hold out

  • prop – Proportion of SNPs to drop in each jacknife replicate

  • n_boots – Number of jacknife replicates

  • holdout_indices – Optional specific indices to hold out

  • return_df – Whether to return DataFrame with all predictions

  • save_full_pred_matrix – Whether to save full prediction matrix to disk

  • na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action

Returns:

pandas.DataFrame or None – for each jacknife replicate containing columns: - sampleID: Sample identifier - x_pred: Predicted longitude - y_pred: Predicted latitude - boot: Jacknife replicate number (0 to n_boots-1)

Note: True locations are not included. Merge with sample metadata to calculate errors.

Return type:

If return_df=True, returns DataFrame with predictions

Notes

  • With na_action=’separate’: Currently behaves like ‘exclude’ (holdouts must have known locations). Future versions may support predicting NA samples.

  • With na_action=’exclude’: Only uses samples with known locations (current behavior)

  • With na_action=’fail’: Raises error if any NA samples found

run_k_fold_holdouts(genotypes, samples, k=10, return_df=False, save_full_pred_matrix=True, verbose=True, na_action=None)

Run true k-fold cross-validation with nonoverlapping holdout sets.

Parameters:
  • genotypes – Array of genotype data

  • samples – Sample IDs corresponding to genotypes

  • k – Number of folds (holdout sets)

  • return_df – Whether to return DataFrame with all predictions

  • save_full_pred_matrix – Whether to save full prediction matrix to disk

  • verbose – Whether to show training progress and intermediate output

  • na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action

Returns:

If return_df=True, returns DataFrame with one prediction

per held-out sample containing columns: - sampleID: Sample identifier - x_pred: Predicted longitude - y_pred: Predicted latitude

Note: True locations are not included. To calculate prediction errors, merge the returned DataFrame with your sample metadata using the sampleID column.

Return type:

pandas.DataFrame or None

Notes

  • With na_action=’separate’: Currently behaves like ‘exclude’ (k-fold requires known locations). Future versions may support predicting NA samples.

  • With na_action=’exclude’: Only uses samples with known locations (current behavior)

  • With na_action=’fail’: Raises error if any NA samples found

Example

>>> # Run k-fold cross-validation
>>> predictions = locator.run_k_fold_holdouts(genotypes, samples, k=10, return_df=True)
>>>
>>> # Merge with true locations to calculate errors
>>> sample_data = pd.read_csv('samples.tsv', sep='\t')
>>> merged = predictions.merge(sample_data[['sampleID', 'x', 'y']], on='sampleID')
>>> merged['error_km'] = np.sqrt(
...     (merged['x'] - merged['x_pred'])**2 +
...     (merged['y'] - merged['y_pred'])**2
... ) * 111.32  # Convert degrees to km
run_leave_one_out(genotypes, samples, return_df=True, save_full_pred_matrix=True, na_action=None)

Perform leave-one-out cross-validation: for each sample with a known location, train without it and predict its location.

This is a convenience wrapper around run_k_fold_holdouts with k equal to the number of samples with known locations.

Parameters:
  • genotypes – Array of genotype data

  • samples – Sample IDs corresponding to genotypes

  • return_df – Whether to return DataFrame with all predictions

  • save_full_pred_matrix – Whether to save full prediction matrix to disk

  • na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action

Returns:

pandas.DataFrame or None

Return type:

DataFrame with predictions for each left-out sample

run_windows(genotypes, samples, window_start=0, window_size=500000.0, window_stop=None, respect_chromosomes=True, return_df=False, save_full_pred_matrix=True, na_action=None)

Run windowed prediction analysis.

Parameters:
  • genotypes – GenotypeArray containing genetic data

  • samples – Array of sample IDs

  • window_start – Start position for windows (default: 0)

  • window_size – Size of windows in base pairs (default: 500kb)

  • window_stop – Stop position for windows (default: None)

  • respect_chromosomes – Whether to respect chromosome boundaries when creating windows (default: True). If True, windows will not span chromosome boundaries. Requires chromosome information from VCF/Zarr input.

  • return_df – Whether to return DataFrame with all predictions

  • save_full_pred_matrix – Whether to save full prediction matrix to disk

  • na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action

Returns:

pandas.DataFrame or None – for each window, otherwise None

Return type:

If return_df=True, returns DataFrame with predictions

Notes

  • With na_action=’separate’: Trains on samples with known locations, can predict on samples with NA locations

  • With na_action=’exclude’: Only uses samples with known locations

  • With na_action=’fail’: Raises error if any NA samples found

Warning

When respect_chromosomes=False, window analysis treats all SNP positions as continuous along a single coordinate axis. If your data contains multiple chromosomes, windows may span across chromosome boundaries. Use respect_chromosomes=True (default) for biologically meaningful windows.

run_windows_holdouts(genotypes, samples, k=10, window_start=0, window_size=500000.0, window_stop=None, respect_chromosomes=True, holdout_indices=None, holdout_sample_ids=None, return_df=False, save_full_pred_matrix=True, na_action=None)

Run windowed analysis on holdout samples.

Parameters:
  • genotypes – Array of genotype data

  • samples – Sample IDs corresponding to genotypes

  • k – Number of samples to hold out

  • window_start – Start position for windows

  • window_size – Size of windows in base pairs

  • window_stop – Stop position for windows

  • respect_chromosomes – Whether to respect chromosome boundaries when creating windows (default: True). If True, windows will not span chromosome boundaries. Requires chromosome information from VCF/Zarr input.

  • holdout_indices – Optional specific indices to hold out

  • holdout_sample_ids – Optional list of sample IDs to hold out. If provided, these specific samples will be held out (overrides k and holdout_indices).

  • return_df – Whether to return DataFrame with all predictions

  • save_full_pred_matrix – Whether to save full prediction matrix to disk

  • na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action

Returns:

pandas.DataFrame or None – for each window, otherwise None

Return type:

If return_df=True, returns DataFrame with predictions

Notes

  • With na_action=’separate’: Currently behaves like ‘exclude’ (holdouts must have known locations). Future versions may support predicting NA samples.

  • With na_action=’exclude’: Only uses samples with known locations (current behavior)

  • With na_action=’fail’: Raises error if any NA samples found

Warning

When respect_chromosomes=False, window analysis treats all SNP positions as continuous along a single coordinate axis. If your data contains multiple chromosomes, windows may span across chromosome boundaries. Use respect_chromosomes=True (default) for biologically meaningful windows.

set_sample_weights(wdict)

Set sample weights for training. :param wdict: Dictionary returned by utils.weight_samples() containing sample weights. :type wdict: dict

setup_ensemble_gpu_optimization(use_mixed_precision=None)

Setup GPU optimizations for ensemble training.

Parameters:

use_mixed_precision – Whether to use mixed precision training. If None, uses config value or auto-detects based on GPU.

Returns:

bool

Return type:

Whether mixed precision was enabled

sort_samples(samples=None, sample_data_file=None, reorder=True)

Sort samples and match with location data.

Matches samples with their location data and ensures consistent ordering between genotype and location data.

Parameters:
  • samples (numpy.ndarray) – Array of sample IDs from the genotype data

  • sample_data_file (str, optional) – Override path to tab-delimited file with columns ‘sampleID’, ‘x’, ‘y’. If not provided, uses stored sample data.

  • reorder (bool) – If True, automatically reorder metadata to match genotype order. If False, raise error on order mismatch (default: True)

Returns:

tuple

Return type:

(sample_data DataFrame, locs array of shape (n_samples, 2))

train(*, genotypes, samples, sample_data_file=None, boot=None, train_gen=None, test_gen=None, pred_gen=None, train_locs=None, test_locs=None, setup_only=False, na_action=None, site_order=None)

Train the Locator model on genotype and location data.

This method trains the neural network model to predict geographic locations from genetic data. It supports both standard training and advanced workflows such as bootstrapping, by accepting pre-processed genotype and location arrays. The model is configured using the parameters provided at initialization.

Parameters:
  • genotypes (allel.GenotypeArray or np.ndarray) – Genotype data for all samples. Should be of shape (n_sites, n_samples, ploidy).

  • samples (np.ndarray) – Array of sample IDs corresponding to the genotype data.

  • sample_data_file (str, optional) – Path to a tab-delimited file with columns ‘sampleID’, ‘x’, ‘y’ for sample locations. Used if not provided in config or as a DataFrame.

  • boot (int, optional) – Bootstrap replicate number. Used for bootstrapping analyses. Defaults to None.

  • train_gen (np.ndarray, optional) – Pre-processed training genotype data. Used for bootstrapping. If None, will be generated from genotypes. Defaults to None.

  • test_gen (np.ndarray, optional) – Pre-processed test genotype data. Used for bootstrapping. If None, will be generated from genotypes. Defaults to None.

  • pred_gen (np.ndarray, optional) – Pre-processed prediction genotype data. Used for bootstrapping. If None, will be generated from genotypes. Defaults to None.

  • train_locs (np.ndarray, optional) – Pre-processed training locations. Used for bootstrapping. If None, will be generated from sample data. Defaults to None.

  • test_locs (np.ndarray, optional) – Pre-processed test locations. Used for bootstrapping. If None, will be generated from sample data. Defaults to None.

  • setup_only (bool, optional) – If True, only sets up the model and data without training. Defaults to False.

  • na_action (str, optional) – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action. Defaults to None.

  • site_order (np.ndarray, optional) – Array of SNP indices for bootstrap resampling. If provided, SNPs will be reordered according to these indices during training. Used for bootstrap analyses to resample SNPs with replacement.

Returns:

keras.callbacks.History or None

Return type:

The Keras training history object if training is performed, or None if setup_only is True.

Raises:

ValueError – If required sample data is missing or improperly formatted.:

Example

>>> # Standard training
>>> loc = Locator({"out": "analysis", "sample_data": "samples.txt", "zarr": "genotypes.zarr"})
>>> genotypes, samples = loc.load_genotypes(zarr="genotypes.zarr")
>>> history = loc.train(genotypes=genotypes, samples=samples)
>>> # Bootstrapping with pre-processed data
>>> history = loc.train(
...     genotypes=None,
...     samples=samples,
...     boot=1,
...     train_gen=boot_train_gen,
...     test_gen=boot_test_gen,
...     pred_gen=boot_pred_gen,
...     train_locs=boot_train_locs,
...     test_locs=boot_test_locs
... )
train_ensemble(genotypes, samples, k=5, training_set_indices=None, na_action=None, augment_data=False, flip_rate=0.05, save_fold_models=True, verbose=True, use_model_manager=True, use_mixed_precision=None, patience_multiplier=1.0)

Train an ensemble of k models using k-fold cross-validation.

This method trains k models, each on a different k-fold split of the data. It uses the modern tf.data pipeline for memory efficiency and supports all standard Locator features including NA handling and data augmentation.

Parameters:
  • genotypes – GenotypeArray containing genetic data

  • samples – Array of sample IDs

  • k – Number of folds/models in ensemble (default: 5)

  • training_set_indices – Optional array of indices to restrict training

  • na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’)

  • augment_data – Whether to apply data augmentation (default: False)

  • flip_rate – Rate for genotype flipping augmentation (default: 0.05)

  • save_fold_models – Whether to save individual fold models (default: True)

  • verbose – Whether to show training progress (default: True)

  • use_model_manager – Whether to use model manager for saving (default: True)

  • use_mixed_precision – Whether to use mixed precision training (default: None, auto-detect)

  • patience_multiplier – Multiply patience for ensemble training (default: 1.0)

Returns:

dict

  • ‘histories’: List of training histories for each fold

  • ’models’: List of trained model configurations

  • ’normalization_params’: Averaged normalization parameters

  • ’fold_info’: Information about fold splits

Return type:

Dictionary containing:

train_holdout(genotypes=None, samples=None, k=10, holdout_indices=None, filtered_genotypes=None)

Train the model while holding out samples with known locations.

Parameters:
  • genotypes – Array of genotype data. Required unless filtered_genotypes is provided.

  • samples – Sample IDs corresponding to genotypes

  • k – Number of samples to hold out (ignored if holdout_indices provided)

  • holdout_indices – Optional specific indices of samples to hold out

  • filtered_genotypes – Pre-filtered allele count array. If provided, skips internal filter_snps call and avoids loading the full genotype array. Used by parallel dispatch to share one filtered copy across all workers.

Return type:

keras.callbacks.History object from model training

train_window(genotypes, samples, window_snp_indices, index_set, normalized_locs)

Train the model for a specific genomic window using efficient tf.data pipeline.

This is an internal method used by run_windows_holdouts to train models on specific genomic windows without creating intermediate arrays.

Parameters:
  • genotypes – Full genotype array (not filtered)

  • samples – Sample IDs

  • window_snp_indices – Indices of SNPs in this window

  • index_set – Pre-computed IndexSet with train/test/holdout splits

  • normalized_locs – Pre-normalized location coordinates

Return type:

keras.callbacks.History object from model training

Ensemble Functionality

The ensemble functionality is integrated into the main Locator class through the EnsembleMixin.

EnsembleMixin

class EnsembleMixin[source]

Bases: object

Mixin class providing ensemble functionality for Locator.

create_ensemble_folds(genotypes, samples, k=5, training_set_indices=None, na_action=None)[source]

Create k-fold splits for ensemble training using IndexSet.

Parameters:
  • genotypes – GenotypeArray containing genetic data

  • samples – Array of sample IDs

  • k – Number of folds (default: 5)

  • training_set_indices – Optional array of indices to use for training+validation. If provided, only these samples will be used to create k-folds.

  • na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action

Returns:

dict

  • ‘index_sets’: List of IndexSet objects for each fold

  • ’fold_indices’: Legacy format dict for backward compatibility

  • ’sample_status’: Sample status information

Return type:

Dictionary with fold information:

train_ensemble(genotypes, samples, k=5, training_set_indices=None, na_action=None, augment_data=False, flip_rate=0.05, save_fold_models=True, verbose=True, use_model_manager=True, use_mixed_precision=None, patience_multiplier=1.0)[source]

Train an ensemble of k models using k-fold cross-validation.

This method trains k models, each on a different k-fold split of the data. It uses the modern tf.data pipeline for memory efficiency and supports all standard Locator features including NA handling and data augmentation.

Parameters:
  • genotypes – GenotypeArray containing genetic data

  • samples – Array of sample IDs

  • k – Number of folds/models in ensemble (default: 5)

  • training_set_indices – Optional array of indices to restrict training

  • na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’)

  • augment_data – Whether to apply data augmentation (default: False)

  • flip_rate – Rate for genotype flipping augmentation (default: 0.05)

  • save_fold_models – Whether to save individual fold models (default: True)

  • verbose – Whether to show training progress (default: True)

  • use_model_manager – Whether to use model manager for saving (default: True)

  • use_mixed_precision – Whether to use mixed precision training (default: None, auto-detect)

  • patience_multiplier – Multiply patience for ensemble training (default: 1.0)

Returns:

dict

  • ‘histories’: List of training histories for each fold

  • ’models’: List of trained model configurations

  • ’normalization_params’: Averaged normalization parameters

  • ’fold_info’: Information about fold splits

Return type:

Dictionary containing:

predict_ensemble(genotypes=None, samples=None, indices=None, include_fold_predictions=False, return_std=False, return_df=True, save_predictions=True)[source]

Make predictions using the ensemble of models.

Parameters:
  • genotypes – GenotypeArray for prediction (if None, uses stored data)

  • samples – Sample IDs (if None, uses stored samples)

  • indices – Specific indices to predict on (if None, predicts all)

  • include_fold_predictions – Include individual fold predictions in output

  • return_std – Return standard deviation across ensemble predictions

  • return_df – Return results as DataFrame (default: True)

  • save_predictions – Save predictions to disk (default: True)

Returns:

pd.DataFrame or np.ndarray

Return type:

Ensemble predictions with optional std

load_ensemble(ensemble_path)[source]

Load a saved ensemble for prediction.

Parameters:

ensemble_path – Path to the saved ensemble directory

Returns:

dict

Return type:

Ensemble information including models and parameters

predict_ensemble_from_manager(genotypes, samples, indices=None, return_df=True, save_predictions=True)[source]

Make predictions using loaded ensemble with model manager.

This method efficiently loads models on-demand for prediction, reducing memory usage for large ensembles.

Parameters:
  • genotypes – GenotypeArray for prediction

  • samples – Sample IDs

  • indices – Specific indices to predict on (if None, predicts all)

  • return_df – Return results as DataFrame (default: True)

  • save_predictions – Save predictions to disk (default: True)

Returns:

pd.DataFrame or np.ndarray

Return type:

Ensemble predictions

setup_ensemble_gpu_optimization(use_mixed_precision=None)[source]

Setup GPU optimizations for ensemble training.

Parameters:

use_mixed_precision – Whether to use mixed precision training. If None, uses config value or auto-detects based on GPU.

Returns:

bool

Return type:

Whether mixed precision was enabled

get_ensemble_batch_size(dataset_size, fold_idx=0)[source]

Determine optimal batch size for ensemble training.

Uses GPUOptimizer to find the best batch size, with caching to avoid recomputing for each fold.

Parameters:
  • dataset_size – Size of training dataset

  • fold_idx – Current fold index (for logging)

Returns:

int

Return type:

Optimal batch size

create_ensemble_early_stopping(patience_multiplier=1.5)[source]

Create early stopping callback with ensemble-specific settings.

Parameters:

patience_multiplier – Multiply base patience for ensemble training (ensembles often benefit from longer training)

Returns:

keras.callbacks.EarlyStopping

Return type:

Configured callback

create_ensemble_lr_scheduler(fold_idx)[source]

Create learning rate scheduler for ensemble training.

Each fold can start with a slightly different learning rate to improve ensemble diversity.

Parameters:

fold_idx – Current fold index

Returns:

keras.callbacks.ReduceLROnPlateau

Return type:

Configured callback

EnsembleModelManager

class EnsembleModelManager(base_path: str)[source]

Bases: object

Manages multiple models for ensemble predictions.

This class handles: - Saving and loading ensemble models with metadata - Lazy loading of model weights - Efficient storage of normalization parameters - Model versioning and validation

__init__(base_path: str)[source]

Initialize model manager.

Parameters:

base_path – Base path for saving/loading models

save_ensemble(models_info: List[Dict], ensemble_metadata: Dict | None = None) None[source]

Save ensemble models and metadata.

Parameters:
  • models_info – List of model info dictionaries from training

  • ensemble_metadata – Optional metadata about the ensemble

load_ensemble(model_builder_fn=None) List[Dict][source]

Load ensemble models and metadata.

Parameters:

model_builder_fn – Function to build model architecture

Return type:

List of model info dictionaries

get_model(fold: int, n_features: int) Model[source]

Get a specific model, loading if necessary.

Parameters:
  • fold – Fold index

  • n_features – Number of features for model construction

Return type:

Loaded model

get_normalization_params(fold: int) NormalizationParams[source]

Get normalization parameters for a specific fold.

Parameters:

fold – Fold index

Return type:

NormalizationParams instance

get_averaged_normalization_params() NormalizationParams[source]

Get averaged normalization parameters across all folds.

Return type:

Averaged NormalizationParams

save_predictions(predictions: DataFrame, prediction_type: str = 'ensemble') None[source]

Save predictions to disk.

Parameters:
  • predictions – DataFrame with predictions

  • prediction_type – Type of predictions (e.g., “ensemble”, “fold_0”)

clear_cache() None[source]

Clear loaded models from memory.

Parallel Ensemble Training

The parallel ensemble training function is available when Ray is installed:

from locator.parallel import parallel_train_ensemble
parallel_train_ensemble(locator, genotypes, samples, k=5, gpu_ids=[0, 1], gpu_fraction=1.0, training_set_indices=None, na_action=None, augment_data=False, flip_rate=0.05, save_fold_models=True, use_model_manager=True, use_mixed_precision=None, patience_multiplier=1.0, verbose=True)

Train ensemble models in parallel across multiple GPUs using Ray.

Parameters:
  • locator – Locator instance with configuration

  • genotypes – GenotypeArray containing genetic data

  • samples – Array of sample IDs

  • k – Number of folds/models in ensemble (default: 5)

  • gpu_ids – List of GPU IDs to use (default: [0, 1])

  • gpu_fraction – Fraction of GPU memory per worker (default: 1.0)

  • training_set_indices – Optional indices to restrict training

  • na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’)

  • augment_data – Whether to apply data augmentation

  • flip_rate – Rate for genotype flipping augmentation

  • save_fold_models – Whether to save individual fold models

  • use_model_manager – Whether to use model manager for storage

  • use_mixed_precision – Whether to use mixed precision training

  • patience_multiplier – Multiply patience for ensemble training

  • verbose – Whether to show training progress

Returns:

dict containing histories, models, normalization_params, fold_info

Note

This function requires Ray to be installed. Install with pip install locator[ray].

Models Module

create_network(input_shape: int, width: int = 256, n_layers: int = 8, dropout_prop: float = 0.25, pca_components: int | None = None, optimizer_config: dict | None = None, loss_fn: callable | None = None) Model[source]

Create a neural network model for geographic location prediction.

Parameters:
  • input_shape (int) – Number of input features (SNPs).

  • width (int, optional) – Width of the dense layers, defaults to 256.

  • n_layers (int, optional) – Total number of dense layers (excluding final layers), defaults to 8.

  • dropout_prop (float, optional) – Dropout proportion for middle dropout layer, defaults to 0.25.

  • pca_components (int, optional) – If set, prepend a linear projection layer named “pca_projection” of this width as the first layer. The caller is responsible for initializing its weights with PCA loadings. Defaults to None (no projection layer).

  • optimizer_config (dict, optional) – Configuration for the optimizer. Should be a dict containing keys: “algo” (str): “adam” or “adamw”; “learning_rate” (float); “weight_decay” (float, only used for “adamw”). Defaults to None (uses Adam with default settings).

  • loss_fn (callable, optional) – Loss function to use. If None, defaults to euclidean_distance_loss, defaults to None.

Returns:

Compiled Keras model ready for training.

Return type:

keras.Model

Example

>>> model = create_network(input_shape=1000)
>>> model.summary()
loss_with_range_penalty(y_true, y_pred, mask_tensor, transform, resolution, penalty_weight=1.0)[source]
rasterize_species_range(shapefile_path, resolution=0.1)[source]

Data Module

This module contains the memory-efficient data pipeline components.

IndexSet

class IndexSet(indices: Dict[str, ndarray], total_samples: int, na_mask: ndarray | None = None)[source]

Bases: object

Container for dataset indices that avoids copying data.

This class stores indices for different data splits (train/val/test) to enable memory-efficient data access without creating copies of large genotype arrays.

Variables:
  • indices (Dictionary mapping split names to numpy arrays of indices)

  • total_samples (Total number of samples in the dataset)

  • na_mask (Optional boolean mask indicating samples without coordinates)

indices: Dict[str, ndarray]
total_samples: int
na_mask: ndarray | None = None
__post_init__()[source]

Validate the IndexSet after initialization.

property train: ndarray

Get training indices (backward compatibility).

property val: ndarray

Get validation indices (backward compatibility).

property test: ndarray

Get test indices (backward compatibility).

property hold: ndarray

Get holdout/prediction indices (backward compatibility).

get_split(name: str) ndarray[source]

Get indices for a named split.

split_sizes() Dict[str, int][source]

Get the size of each split.

classmethod random_split(n: int, splits: Dict[str, float] | None = None, seed: int | None = None, na_mask: ndarray | None = None, na_action: str = 'separate') IndexSet[source]

Create random train/val/test splits.

Parameters:
  • n – Total number of samples

  • splits – Dictionary mapping split names to proportions (must sum to ≤ 1.0) Default: {“train”: 0.8, “val”: 0.1, “test”: 0.1}

  • seed – Random seed for reproducibility

  • na_mask – Boolean mask indicating samples without coordinates

  • na_action – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’)

Return type:

IndexSet with random splits

classmethod from_k_fold(n: int, k: int, fold: int, seed: int | None = None, na_mask: ndarray | None = None) IndexSet[source]

Create train/test split for k-fold cross-validation.

Parameters:
  • n – Total number of samples

  • k – Number of folds

  • fold – Which fold to use as test set (0-indexed)

  • seed – Random seed for reproducibility

  • na_mask – Boolean mask indicating samples without coordinates

Return type:

IndexSet with train and test splits

classmethod from_groups(groups: ndarray, test_groups: List[int | str], na_mask: ndarray | None = None) IndexSet[source]

Create train/test split based on group membership.

Useful for spatial or temporal cross-validation where you want to hold out entire groups (e.g., geographic regions).

Parameters:
  • groups – Array of group labels for each sample

  • test_groups – List of group labels to use as test set

  • na_mask – Boolean mask indicating samples without coordinates

Return type:

IndexSet with train and test splits

classmethod from_manual(train: ndarray, test: ndarray | None = None, val: ndarray | None = None, predict: ndarray | None = None, total_samples: int | None = None) IndexSet[source]

Create IndexSet from manually specified indices.

Parameters:
  • train – Training indices

  • test – Test indices

  • val – Validation indices

  • predict – Prediction indices (samples without labels)

  • total_samples – Total number of samples (inferred if not provided)

Return type:

IndexSet with specified splits

classmethod k_fold_split(n: int, k: int, seed: int | None = None, na_mask: ndarray | None = None) List[IndexSet][source]

Create all k-fold cross-validation splits at once.

This method generates k IndexSet objects, one for each fold, suitable for ensemble training or cross-validation.

Parameters:
  • n – Total number of samples

  • k – Number of folds

  • seed – Random seed for reproducibility

  • na_mask – Boolean mask indicating samples to exclude from k-fold (e.g., samples without coordinates or not in training set)

Return type:

List of k IndexSet objects, one for each fold

__init__(indices: Dict[str, ndarray], total_samples: int, na_mask: ndarray | None = None) None

Data Pipeline Functions

make_tf_dataset(coordinates: ndarray, index_set: IndexSet, split: str, batch_size: int = 256, sample_weights: ndarray | None = None, training: bool = True, shuffle: bool = True, drop_remainder: bool | None = None, prefetch: bool = True) DatasetV2[source]

Create an index-based tf.data pipeline for training or validation.

The pipeline carries only sample indices and their coordinates – a few kilobytes per batch. Genotypes are gathered on the GPU inside IndexedGenotypeModel, so the genotype matrix never enters this pipeline and there is no per-epoch host-to-device genotype traffic.

Parameters:
  • coordinates – Full coordinate array of shape (n_samples, 2).

  • index_set – IndexSet containing the train/val/test/predict splits.

  • split – Which split to use (‘train’, ‘val’, ‘test’, ‘predict’).

  • batch_size – Batch size for the dataset.

  • sample_weights – Optional per-sample weights, aligned to the split’s index order (length must equal the split size).

  • training – Whether this is for training (enables shuffling).

  • shuffle – Whether to shuffle the split each epoch (only when training).

  • drop_remainder – Whether to drop the final partial batch (defaults to the value of training).

  • prefetch – Whether to prefetch batches.

Returns:

  • A tf.data.Dataset yielding (sample_index, coordinate) batches,

  • or (sample_index, coordinate, sample_weight) when weights are given.

Preprocessing Functions

filter_snps(genotypes, min_mac: int = 1, max_snps: int | None = None, impute: bool = False, verbose: bool = False) Tuple[ndarray, FilterStats][source]

Filter SNPs based on criteria and return statistics.

Parameters:
  • genotypes – GenotypeArray to filter

  • min_mac – Minimum minor allele count for filtering

  • max_snps – Maximum number of SNPs to retain

  • impute – Whether to impute missing data

  • verbose – Whether to print progress messages

Return type:

Tuple of (filtered allele counts array, FilterStats)

normalize_locs(locs: ndarray) Tuple[float, float, float, float, ndarray, ndarray][source]

Normalize location coordinates.

Parameters:

locs – Array of shape (n_samples, 2) containing longitude and latitude

Return type:

Tuple of (meanlong, sdlong, meanlat, sdlat, unnormedlocs, normedlocs)

impute_missing(genotypes, alt_counts: ndarray | None = None) ndarray[source]

Replace missing data with binomial draws from allele frequency.

Parameters:
  • genotypes – GenotypeArray with missing data

  • alt_counts – Optional precomputed per-site alt allele counts of shape (n_sites,). When provided, the internal count_alleles() call is skipped — used by filter_snps to reuse counts from its numba kernel.

Return type:

Allele counts array with imputed values

Data Classes

class FilterStats(n_samples_original: int, n_samples_filtered: int, n_snps_original: int, n_snps_filtered: int, mac_threshold: int, samples_removed_na: list[str] = None, n_biallelic_filtered: int = 0, n_mac_filtered: int = 0, n_random_subset: int = 0)[source]

Track what was filtered and why.

n_samples_original: int
n_samples_filtered: int
n_snps_original: int
n_snps_filtered: int
mac_threshold: int
samples_removed_na: list[str] = None
n_biallelic_filtered: int = 0
n_mac_filtered: int = 0
n_random_subset: int = 0
__init__(n_samples_original: int, n_samples_filtered: int, n_snps_original: int, n_snps_filtered: int, mac_threshold: int, samples_removed_na: list[str] = None, n_biallelic_filtered: int = 0, n_mac_filtered: int = 0, n_random_subset: int = 0) None
class NormalizationParams(meanlong: float, sdlong: float, meanlat: float, sdlat: float)[source]

Store normalization parameters for coordinates.

meanlong: float
sdlong: float
meanlat: float
sdlat: float
apply(locs: ndarray) ndarray[source]

Apply normalization to coordinates.

reverse(normalized_locs: ndarray) ndarray[source]

Reverse normalization to get original coordinates.

__init__(meanlong: float, sdlong: float, meanlat: float, sdlat: float) None

Sample Weights Module

weight_samples(method: str, trainlocs: ndarray | None = None, trainsamps: ndarray | None = None, weightdf: DataFrame | None = None, xbins: int | None = None, ybins: int | None = None, lam: float | None = None, bandwidth: float | None = None, cache_bandwidth: bool = True, n_bandwidths: int = 100) Dict[str, Any][source]

Calculate weights for training data based on the specified method.

Parameters:
  • method – Method for calculating weights (‘KD’, ‘histogram’, or ‘load’)

  • trainlocs – Training locations (required for KD and histogram methods)

  • trainsamps – Training sample IDs

  • weightdf – DataFrame containing pre-calculated sample weights

  • xbins – Number of bins in x direction for histogram method

  • ybins – Number of bins in y direction for histogram method

  • lam – Exponent for KDE weights

  • bandwidth – Bandwidth for KDE (if None, will be calculated)

  • cache_bandwidth – Whether to use bandwidth caching for KDE

  • n_bandwidths – Number of bandwidth values to test if calculating

Returns:

  • ‘method’: weighting method used

  • ’sample_weights’: array of weights

  • ’sample_weights_df’: DataFrame with sampleID and weights

  • method-specific parameters

Return type:

Dictionary containing

GPU Optimizer Module

class GPUOptimizer[source]

Utilities for optimizing GPU performance in TensorFlow.

static setup_mixed_precision()[source]

Enable mixed precision training for 2x speedup on modern GPUs.

Returns:

bool

Return type:

True if mixed precision was enabled successfully

static get_optimal_batch_size(model: Model, input_shape: Tuple[int, ...], target_memory_usage: float = 0.9, min_batch_size: int = 32, max_batch_size: int = 2048, dataset_size: int | None = None, verbose: bool = True) int[source]

Dynamically determine optimal batch size for GPU memory.

Parameters:
  • model – Keras model to optimize for

  • input_shape – Shape of single input sample (excluding batch dimension)

  • target_memory_usage – Target GPU memory usage (0.0-1.0)

  • min_batch_size – Minimum batch size to test

  • max_batch_size – Maximum batch size to test

  • dataset_size – Size of the dataset (if provided, limits max batch size)

Returns:

int

Return type:

Optimal batch size for current GPU

static optimize_gpu_memory(mode: str = 'growth', memory_limit: int | None = None)[source]

Configure GPU memory allocation strategy.

Parameters:
  • mode – Memory allocation mode (‘growth’, ‘preallocate’, ‘limit’)

  • memory_limit – Memory limit in MB (only used with mode=’limit’)

static enable_xla_compilation()[source]

Enable XLA compilation for additional performance.

Note: This is experimental and may not work with all operations.

Internal Modules (Implementation Details)

These modules contain the implementation of Locator functionality. Users typically interact with these through the main Locator class.

Loaders Module

class DataLoaderMixin[source]

Mixin class providing data loading functionality for Locator.

load_genotypes(vcf=None, zarr=None, matrix=None, microsat=None, microsat_min_allele_freq=0.01)[source]

Load genotype data from various input sources.

This method can load genotype data from: 1. A stored DataFrame provided during initialization 2. A VCF file 3. A zarr file (scikit-allel or bio2zarr format) 4. A tab-delimited matrix file 5. A tab-delimited microsatellite genotype table

For windowed analysis, SNP positions must be available either from: - Column names in the genotype DataFrame - The zarr file’s variants/POS array - The VCF file’s POS field (automatically loaded)

Parameters:
  • vcf (str, optional) – Path to VCF format genotype data

  • zarr (str, optional) – Path to zarr format genotype data

  • matrix (str, optional) – Path to tab-delimited matrix file

  • microsat (str, optional) – Path to tab-delimited microsatellite genotype table

  • microsat_min_allele_freq (float, optional) – Drop microsat alleles below this per-locus frequency. Default 0.01.

Returns:

tuple

  • genotypes is an allel.GenotypeArray of shape (n_sites, n_samples, 2) for VCF/zarr/integer-matrix inputs, or a float32 ndarray of shape (n_sites, n_samples) for continuous-dosage (matrix float / microsat) inputs

  • samples is a numpy array of sample IDs

Return type:

(genotypes, samples) where:

Examples

>>> # Using stored DataFrame from initialization
>>> locator = Locator({
...     "genotype_data": geno_df,  # DataFrame with genotypes
...     "sample_data": coords_df   # DataFrame with coordinates
... })
>>> genotypes, samples = locator.load_genotypes()
>>> # Using zarr file (recommended for windowed analysis)
>>> locator = Locator({"sample_data": coords_df})
>>> genotypes, samples = locator.load_genotypes(zarr="path/to/geno.zarr")
>>> # Using VCF file
>>> genotypes, samples = locator.load_genotypes(vcf="path/to/geno.vcf")
>>> # Using matrix file
>>> genotypes, samples = locator.load_genotypes(matrix="path/to/geno.txt")
>>> # Using microsatellite genotypes
>>> genotypes, samples = locator.load_genotypes(microsat="path/to/microsats.tsv")
Raises:

ValueError – If no input source is provided or if input format is invalid:

sort_samples(samples=None, sample_data_file=None, reorder=True)[source]

Sort samples and match with location data.

Matches samples with their location data and ensures consistent ordering between genotype and location data.

Parameters:
  • samples (numpy.ndarray) – Array of sample IDs from the genotype data

  • sample_data_file (str, optional) – Override path to tab-delimited file with columns ‘sampleID’, ‘x’, ‘y’. If not provided, uses stored sample data.

  • reorder (bool) – If True, automatically reorder metadata to match genotype order. If False, raise error on order mismatch (default: True)

Returns:

tuple

Return type:

(sample_data DataFrame, locs array of shape (n_samples, 2))

Training Module

class TrainingMixin[source]

Mixin class providing training functionality for Locator.

set_sample_weights(wdict)[source]

Set sample weights for training. :param wdict: Dictionary returned by utils.weight_samples() containing sample weights. :type wdict: dict

train(*, genotypes, samples, sample_data_file=None, boot=None, train_gen=None, test_gen=None, pred_gen=None, train_locs=None, test_locs=None, setup_only=False, na_action=None, site_order=None)[source]

Train the Locator model on genotype and location data.

This method trains the neural network model to predict geographic locations from genetic data. It supports both standard training and advanced workflows such as bootstrapping, by accepting pre-processed genotype and location arrays. The model is configured using the parameters provided at initialization.

Parameters:
  • genotypes (allel.GenotypeArray or np.ndarray) – Genotype data for all samples. Should be of shape (n_sites, n_samples, ploidy).

  • samples (np.ndarray) – Array of sample IDs corresponding to the genotype data.

  • sample_data_file (str, optional) – Path to a tab-delimited file with columns ‘sampleID’, ‘x’, ‘y’ for sample locations. Used if not provided in config or as a DataFrame.

  • boot (int, optional) – Bootstrap replicate number. Used for bootstrapping analyses. Defaults to None.

  • train_gen (np.ndarray, optional) – Pre-processed training genotype data. Used for bootstrapping. If None, will be generated from genotypes. Defaults to None.

  • test_gen (np.ndarray, optional) – Pre-processed test genotype data. Used for bootstrapping. If None, will be generated from genotypes. Defaults to None.

  • pred_gen (np.ndarray, optional) – Pre-processed prediction genotype data. Used for bootstrapping. If None, will be generated from genotypes. Defaults to None.

  • train_locs (np.ndarray, optional) – Pre-processed training locations. Used for bootstrapping. If None, will be generated from sample data. Defaults to None.

  • test_locs (np.ndarray, optional) – Pre-processed test locations. Used for bootstrapping. If None, will be generated from sample data. Defaults to None.

  • setup_only (bool, optional) – If True, only sets up the model and data without training. Defaults to False.

  • na_action (str, optional) – How to handle NA samples (‘separate’, ‘exclude’, ‘fail’). If None, uses self.na_action. Defaults to None.

  • site_order (np.ndarray, optional) – Array of SNP indices for bootstrap resampling. If provided, SNPs will be reordered according to these indices during training. Used for bootstrap analyses to resample SNPs with replacement.

Returns:

keras.callbacks.History or None

Return type:

The Keras training history object if training is performed, or None if setup_only is True.

Raises:

ValueError – If required sample data is missing or improperly formatted.:

Example

>>> # Standard training
>>> loc = Locator({"out": "analysis", "sample_data": "samples.txt", "zarr": "genotypes.zarr"})
>>> genotypes, samples = loc.load_genotypes(zarr="genotypes.zarr")
>>> history = loc.train(genotypes=genotypes, samples=samples)
>>> # Bootstrapping with pre-processed data
>>> history = loc.train(
...     genotypes=None,
...     samples=samples,
...     boot=1,
...     train_gen=boot_train_gen,
...     test_gen=boot_test_gen,
...     pred_gen=boot_pred_gen,
...     train_locs=boot_train_locs,
...     test_locs=boot_test_locs
... )
train_holdout(genotypes=None, samples=None, k=10, holdout_indices=None, filtered_genotypes=None)[source]

Train the model while holding out samples with known locations.

Parameters:
  • genotypes – Array of genotype data. Required unless filtered_genotypes is provided.

  • samples – Sample IDs corresponding to genotypes

  • k – Number of samples to hold out (ignored if holdout_indices provided)

  • holdout_indices – Optional specific indices of samples to hold out

  • filtered_genotypes – Pre-filtered allele count array. If provided, skips internal filter_snps call and avoids loading the full genotype array. Used by parallel dispatch to share one filtered copy across all workers.

Return type:

keras.callbacks.History object from model training

train_window(genotypes, samples, window_snp_indices, index_set, normalized_locs)[source]

Train the model for a specific genomic window using efficient tf.data pipeline.

This is an internal method used by run_windows_holdouts to train models on specific genomic windows without creating intermediate arrays.

Parameters:
  • genotypes – Full genotype array (not filtered)

  • samples – Sample IDs

  • window_snp_indices – Indices of SNPs in this window

  • index_set – Pre-computed IndexSet with train/test/holdout splits

  • normalized_locs – Pre-normalized location coordinates

Return type:

keras.callbacks.History object from model training

Prediction Module

class PredictionMixin[source]

Mixin class providing prediction functionality for Locator.

predict(boot=0, verbose=True, prediction_genotypes=None, genotypes=None, samples=None, indices=None, return_df=False, save_preds_to_disk=True, site_order=None)[source]

Make predictions for samples with unknown locations.

Parameters:
  • boot (int, optional) – Bootstrap replicate number. Defaults to 0.

  • verbose (bool, optional) – Whether to print validation metrics. Defaults to True.

  • prediction_genotypes (numpy.ndarray, optional) – DEPRECATED - use genotypes parameter. Override default prediction genotypes. Used for jacknife resampling. Defaults to None.

  • genotypes (numpy.ndarray, optional) – Full genotype array for creating tf.data dataset. Should be the original unfiltered genotypes. Defaults to None.

  • samples (numpy.ndarray, optional) – Sample IDs corresponding to genotypes. Defaults to None.

  • indices (numpy.ndarray, optional) – Indices of samples to predict on. If None, predicts on samples without coordinates (self.pred_indices). Defaults to None.

  • return_df (bool, optional) – Whether to return predictions as pandas DataFrame. Defaults to False.

  • save_preds_to_disk (bool, optional) – Whether to save predictions to disk. Defaults to True.

  • site_order (np.ndarray, optional) – Array of SNP indices for bootstrap resampling. If provided, SNPs will be reordered according to these indices during prediction. Used for bootstrap analyses to ensure consistent resampling between train and predict.

Returns:

numpy.ndarray or pandas.DataFrame – x,y coordinates and sampleID columns

Return type:

Array of predicted coordinates or DataFrame with

load_model(weights_path)[source]

Load a trained model from saved weights.

This method loads a model from HDF5 weights file and restores the preprocessing parameters needed for making predictions.

Parameters:

weights_path (str) – Path to the saved HDF5 weights file

Returns:

dict

Return type:

Dictionary containing loaded metadata including normalization params

Raises:

ValueError – If weights file cannot be loaded or is missing metadata:

predict_from_weights(weights_path, genotypes, samples, sample_data_file=None, save_preds_to_disk=True, return_df=True)[source]

Convenience method to load weights and make predictions.

This method combines loading a saved model and making predictions in a single call. It handles preprocessing the genotypes using the same parameters that were used during training.

Parameters:
  • weights_path (str) – Path to saved HDF5 weights file

  • genotypes (numpy.ndarray) – Genotype data to predict on

  • samples (numpy.ndarray) – Sample IDs corresponding to genotypes

  • sample_data_file (str, optional) – Path to sample data file

  • save_preds_to_disk (bool) – Whether to save predictions to disk

  • return_df (bool) – Whether to return predictions as DataFrame

Returns:

numpy.ndarray or pandas.DataFrame

Return type:

Predictions

predict_holdout(verbose=True, return_df=False, save_preds_to_disk=True, plot_summary=True, plot_map=True)[source]

Predict locations for held out samples.

Parameters:
  • verbose – Print progress and metrics

  • return_df – Return predictions as pandas DataFrame

  • save_preds_to_disk – Save predictions to disk

  • plot_summary – Display error summary plot in notebook (only if return_df=True)

  • plot_map – Display map of predictions (only if plot_summary=True)

Returns:

  • If return_df is True, returns pandas DataFrame with predictions

  • Otherwise returns None

Analysis Module

class AnalysisMixin[source]

Mixin class providing analysis functionality for Locator.

Parallel Analysis Module

This module provides Ray-based parallel implementations of analysis methods for multi-GPU execution.

parallel_k_fold_holdouts(*args, **kwargs)
parallel_leave_one_out(*args, **kwargs)
parallel_holdouts(*args, **kwargs)
parallel_windows_holdouts(*args, **kwargs)

Plotting Module

This module provides visualization functions for Locator predictions and analyses.

Standalone Functions

plot_predictions(predictions, locator, out_prefix, samples=None, n_samples=9, n_cols=3, plot_map=False, width=5, height=4, dpi=300, n_levels=3, show=None)[source]

Plot locator predictions from jacknife, bootstrap, or windows analyses.

This function visualizes predictions from any of locator’s prediction methods that generate multiple predictions per sample. It creates a grid of subplots, one per sample, showing the distribution of predictions as KDE contours.

The function expects prediction data with:

  • A ‘sampleID’ column

  • Multiple prediction columns (‘x_0’, ‘x_1’… and ‘y_0’, ‘y_1’…)

For each sample, the plot shows:

  • KDE contours of predictions (blue lines)

  • True location if known (red star)

  • All training sample locations (gray circles)

Parameters:
  • predictions (pandas.DataFrame or str) –

    DataFrame or path to predictions file. Output from any of:

    • locator.run_jacknife(return_df=True)

    • locator.run_bootstraps(return_df=True)

    • locator.run_windows(return_df=True)

  • locator (Locator) – Locator instance containing training data configuration

  • out_prefix (str) – Prefix for output files. Plot saved as {out_prefix}_predictions.pdf

  • samples (list, optional) – List of sample IDs to plot. If None, randomly selects n_samples

  • n_samples (int) – Number of samples to plot if samples not specified. Default: 9

  • n_cols (int) – Number of columns in plot grid. Default: 3

  • plot_map (bool) – Whether to plot on a geographic map (requires cartopy). Default: False

  • width (float) – Width of each subplot in inches. Default: 5

  • height (float) – Height of each subplot in inches. Default: 4

  • dpi (int) – DPI resolution for output figure. Default: 300

  • n_levels (int) – Number of KDE contour levels to plot. Default: 3

  • show (bool or None) – Whether to display plot. None=auto-detect environment. Default: None

Returns:

None

Return type:

Saves plot to file and optionally displays it

Examples

For jacknife analysis:

predictions = locator.run_jacknife(genotypes, samples, return_df=True)
plot_predictions(predictions, locator, "jacknife_example")

For bootstrap analysis:

predictions = locator.run_bootstraps(genotypes, samples, return_df=True)
plot_predictions(predictions, locator, "bootstrap_example")

For windows analysis:

predictions = locator.run_windows(genotypes, samples, return_df=True)
plot_predictions(predictions, locator, "windows_example")

Plot specific samples:

plot_predictions(predictions, locator, "selected",
               samples=['HG001', 'HG002', 'HG003'])

Note

  • Requires matplotlib and scipy for KDE calculation

  • If plot_map=True, requires cartopy for geographic projections

  • Automatically adjusts plot limits based on prediction ranges

  • KDE may fail for samples with very few predictions

plot_error_summary(predictions, sample_data, out_prefix=None, plot_map=True, width=20, height=10, dpi=300, use_geodesic=True, include_training_locs=True, show=None, return_merged=False)[source]

Plot summary of prediction errors from holdout analysis.

Creates a comprehensive error visualization with two panels:

  1. Map/Scatter panel: Shows true locations colored by prediction error, with lines connecting true and predicted locations

  2. Histogram panel: Distribution of errors with summary statistics

This function is designed for analyzing results from holdout methods like:

  • run_holdouts()

  • run_k_fold_holdouts()

  • run_leave_one_out()

Parameters:
  • predictions (pandas.DataFrame) – DataFrame with columns: - sampleID: Sample identifiers - x_pred: Predicted longitude - y_pred: Predicted latitude

  • sample_data (pandas.DataFrame or str) – DataFrame or path to TSV file with columns: - sampleID: Sample identifiers (must match predictions) - x: True longitude - y: True latitude

  • out_prefix (str, optional) – Prefix for output files. If provided, saves as {out_prefix}_error_summary.png (or .html for interactive). Default: None

  • plot_map (bool) – Whether to plot on a geographic map using cartopy projection. If False, uses regular scatter plot. Default: True

  • width (float) – Figure width in inches. Default: 20

  • height (float) – Figure height in inches. Default: 10

  • dpi (int) – Figure resolution in dots per inch. Default: 300

  • use_geodesic (bool) – If True, calculate geodesic distances in kilometers. If False, use Euclidean distances in coordinate units. Default: True

  • include_training_locs (bool) – Whether to plot training locations (gray circles) and use their extent for map bounds. Default: True

  • show (bool or None) – Whether to display plot. None=auto-detect environment, True=always show, False=never show. Default: None

  • return_merged (bool) – If True, return the internal merged DataFrame used for plotting. Default: False

Returns:

  • None (Saves plot to file and optionally displays it.)

  • If return_merged is True, returns the internal merged DataFrame containing prediction errors and true locations.

Raises:

ValueError – If predictions or sample_data are empty, have missing columns,: or have no matching samples

Examples

Basic usage with k-fold results:

predictions = locator.run_k_fold_holdouts(genotypes, samples, return_df=True)
plot_error_summary(predictions, "samples.tsv", "kfold_errors")

With DataFrame input and Euclidean distances:

plot_error_summary(predictions, sample_df,
                 out_prefix="holdout_errors",
                 use_geodesic=False)

Without map projection:

plot_error_summary(predictions, sample_df,
                 plot_map=False,
                 width=10, height=5)

Return merged DataFrame:

merged = plot_error_summary(predictions, sample_df, return_merged=True)

Note

  • Summary statistics shown: mean, median, max error, R² for x and y

  • Training locations help visualize geographic sampling bias

  • Geodesic distances account for Earth’s curvature

  • Map projection requires cartopy to be installed

plot_sample_weights(locator, out_prefix=None, plot_map=True, width=5, height=3, dpi=300, show=None)[source]

Plot sample weights assigned to training locations.

Visualizes the geographic distribution of sample weights used during training. This is useful for understanding which regions are upweighted or downweighted based on sampling density.

Sample weights are typically computed using:

  • Kernel density (KD) method: Upweights samples in sparse regions

  • Histogram binning method: Based on 2D histogram counts

The plot uses a log-scale color mapping to better show weight variations.

Parameters:
  • locator (Locator) – Locator instance that has been trained with sample weighting enabled. Must have computed sample_weights attribute.

  • out_prefix (str, optional) – Prefix for output files. If provided, saves as {out_prefix}_sample_weights.png. Default: None

  • plot_map (bool) – Whether to plot on a geographic map using cartopy projection. If False, uses regular scatter plot with equal aspect ratio. Default: True

  • width (float) – Figure width in inches. Default: 5

  • height (float) – Figure height in inches. Default: 3

  • dpi (int) – Figure resolution in dots per inch. Default: 300

  • show (bool or None) – Whether to display plot. None=auto-detect environment, True=always show, False=never show. Default: None

Returns:

None

Return type:

Saves plot to file and optionally displays it

Raises:

ValueError – If locator doesn’t have computed sample weights, or if: required data is missing

Examples

After training with KDE weighting:

config = {
    "weight_samples": {
        "enabled": True,
        "method": "KD"
    }
}
locator = Locator(config)
locator.train(genotypes, samples)
plot_sample_weights(locator, "kde_weights")

With histogram binning weights:

config = {
    "weight_samples": {
        "enabled": True,
        "method": "hist",
        "xbins": 20,
        "ybins": 20
    }
}
locator = Locator(config)
locator.train(genotypes, samples)
plot_sample_weights(locator, "hist_weights", plot_map=False)

Note

  • Requires that locator was trained with weight_samples enabled

  • Log scale coloring helps visualize large weight variations

  • Higher weights (yellow) indicate undersampled regions

  • Lower weights (purple) indicate oversampled regions

  • Map projection requires cartopy to be installed

kde_predict(x_coords, y_coords, xlim=(0, 50), ylim=(0, 50), n_points=100)[source]

Calculate kernel density estimate of predictions.

This is a helper function used internally by plot_predictions() to compute kernel density estimates for visualizing prediction uncertainty.

Parameters:
  • x_coords (array-like) – Array of x coordinates (longitude values)

  • y_coords (array-like) – Array of y coordinates (latitude values)

  • xlim (tuple) – Tuple of (min, max) x values for grid. Default: (0, 50)

  • ylim (tuple) – Tuple of (min, max) y values for grid. Default: (0, 50)

  • n_points (int) – Number of points for density estimation grid. Default: 100

Returns:

tuple

  • x_grid (numpy.ndarray): X coordinates of the mesh grid

  • y_grid (numpy.ndarray): Y coordinates of the mesh grid

  • density (numpy.ndarray): Density values at each grid point

Returns (None, None, None) if KDE calculation fails.

Return type:

A 3-tuple containing:

Note

The function uses scipy.stats.gaussian_kde for density estimation. Grid limits should match the geographic extent of your predictions.

PlottingMixin Class

class PlottingMixin[source]

Bases: object

Mixin class providing plotting functionality for Locator.

This mixin is inherited by the main Locator class to provide visualization methods for training history and Jupyter notebook integration.

_repr_html_: Generate rich HTML representation for Jupyter notebooks

Configuration Options

This section provides an overview of the available configuration options.

Default Configuration

The default configuration for Locator includes:

{
    # Data parameters
    "train_split": 0.9,
    "batch_size": 32,
    "min_mac": 2,
    "max_SNPs": None,
    "impute_missing": False,

    # Network architecture
    "width": 256,
    "nlayers": 8,
    "dropout_prop": 0.25,

    # Training parameters
    "max_epochs": 5000,
    "patience": 100,
    "learning_rate": 0.001,
    "min_epochs": 10,
    "min_delta": 1e-4,
    "restore_best_weights": True,

    # Optimizer parameters
    "optimizer_algo": "adam",
    "weight_decay": 0.004,

    # Output control
    "keras_verbose": 1,
    "prediction_frequency": 1,

    # Validation
    "validation_split": 0.1,

    # Data augmentation
    "augmentation": {
        "enabled": False,
        "flip_rate": 0.05,
    },

    # Sample weighting
    "weight_samples": {
        "enabled": False,
        "method": "KD",
        "xbins": 10,
        "ybins": 10,
        "lam": 1.0,
        "bandwidth": None,
        "weightdf": None,
    },

    # Range penalty
    "use_range_penalty": False,
    "species_range_shapefile": None,
    "resolution": 0.05,
    "penalty_weight": 1.0,
    "out": "locator",

    # NA handling
    "na_action": "separate",

    # GPU optimization (enabled by default)
    "use_mixed_precision": True,
    "gpu_batch_size": "auto",
    "gradient_accumulation_steps": 1,
    "gpu_memory_mode": "growth",
    "enable_xla": False,

    # Performance optimization
    "optimize_tf_parallelism": True,
    "holdout_no_intermediate_saves": True,
    "save_fold_models": True,

    # Verbosity control
    "verbose_splits": False,
    "verbose_batch_size": False,
}

Input Formats

Genotype Data

Supported input formats for genotype data:

  1. VCF files (.vcf or .vcf.gz)

  2. Zarr format (recommended for large datasets)

  3. Pandas DataFrame with: - Samples as index - SNP positions as columns - Genotype counts (0,1,2) as values

Sample Data

Required format for sample coordinate data:

  • Tab-delimited file or DataFrame with columns: - sampleID: Sample identifier - x: Longitude - y: Latitude

Output Formats

Prediction Results

Default output files:

  • {out}_predlocs.txt: Main predictions

  • {out}_history.txt: Training history

  • {out}_fitplot.pdf: Training plots

  • {out}.weights.h5: Model weights

For special analyses:

  • {out}_bootstrap_predlocs.csv: Bootstrap results

  • {out}_jacknife_predlocs.csv: Jacknife results

  • {out}_windows_predlocs.csv: Windowed analysis results

  • {out}_holdout_predlocs.csv: Holdout analysis results