Analysis#
The tfscreen.tfmodel module provides a hierarchical Bayesian model that
infers per-genotype and operator occupancy
(θ) from bacterial growth data and direct binding measurements.
Standard Workflow#
A complete analysis run consists of the following steps. The example commands
assume all input files are in the current working directory and use the
default output-file naming convention (tfs_configure_*, tfs_fit_model_*,
etc.). The example run.srun
shows a complete Slurm script for a cluster run.
Step 1: Configure Model (tfs-configure-model)#
Validates the input data, maps categorical labels to numerical indices, selects model components, and writes three configuration files:
{out_prefix}_config.yaml— main configuration read by all downstream steps{out_prefix}_priors.csv— prior distribution settings for all parameters{out_prefix}_guesses.csv— initial-value guesses for array parameters
binding_df (direct binding measurements) is the only required argument.
growth_df is optional; omitting it configures a binding-only model.
tfs-configure-model binding.csv \
--growth_df growth.csv \
--out_prefix tfs_configure
Key component flags (see Model Components for the full list):
--condition_growth_model— how growth rate depends on TF occupancy (default:linear)--growth_transition_model— pre-to-selection phase lag model (default:instant)--activity_model— per-genotype TF activity prior (default:horseshoe_geno)--theta_model— operator occupancy parameterisation (default:hill_geno)--dk_geno_model— pleiotropic growth-effect prior (default:hierarchical_geno)--transformation_model— multi-plasmid congression correction (default:empirical)
Step 2: Pre-fit Calibration (tfs-prefit-calibration)#
Runs a fast MAP fit on a simplified version of the model to calibrate the priors
for the condition_growth and growth_transition components. The
calibration fit is restricted to the intersection of genotypes and titrant
conditions present in both the growth and binding data.
After convergence the script updates the production {out_prefix}_priors.csv
and {out_prefix}_guesses.csv files in place (a .bak backup is written
first), giving the full production fit a warm start. Diagnostic PDFs (one per
genotype) and a {out_prefix}_calib_stats.json file are also written.
tfs-prefit-calibration tfs_configure_config.yaml \
--seed 42 \
--convergence_tolerance 0.00001
Step 3: Fit Model (tfs-fit-model)#
Performs the main parameter estimation. Three inference methods are available
via --analysis_method:
map — Maximum A Posteriori optimisation (Adam). Fast; produces a point estimate. Use
tfs-sample-posteriorafterwards to obtain uncertainty estimates via a Laplace approximation.svi (default) — Stochastic Variational Inference. Automatically runs a short MAP pre-pass (
--pre_map_num_epoch) before the full variational fit. Produces a full approximate posterior; posterior samples are drawn after convergence.nuts — No-U-Turn Sampler (exact MCMC). Slowest; most accurate.
The example below matches the MAP configuration used in the example run.srun:
tfs-fit-model \
tfs_configure_config.yaml \
--seed 42 \
--analysis_method map \
--adam_step_size 1e-6 \
--convergence_check_interval 100 \
--convergence_window 50 \
--checkpoint_interval 100 \
--max_num_epochs 100000000 \
--pre_map_num_epoch 100000 \
--convergence_tolerance 0.0005 \
--patience 5
Key outputs (with default --out_prefix tfs_fit_model):
tfs_fit_model_checkpoint.pkl— optimizer checkpoint; resume with--checkpoint_filetfs_fit_model_params.npz— MAP/SVI parameter point estimates
Step 4: Sample Posterior (tfs-sample-posterior)#
Draws posterior samples from a checkpoint produced by tfs-fit-model. The
checkpoint type is detected automatically:
MAP checkpoint — constructs a Laplace (Hessian-based) Gaussian approximation at the MAP point, then draws samples.
SVI checkpoint — draws directly from the fitted variational distribution (resumes with 0 additional optimisation epochs).
NUTS checkpoint — reconstructs posteriors from the saved MCMC samples.
tfs-sample-posterior \
tfs_configure_config.yaml \
tfs_fit_model_checkpoint.pkl \
--num_posterior_samples 1000 \
--sampling_batch_size 10 \
--seed 42
Output (default --out_prefix tfs_posterior):
tfs_posterior.h5— posterior samples for all latent variables; passed to the prediction steps below.
Step 5: Extract Parameters (tfs-extract-params)#
Extracts interpretable parameter summaries from a posterior .h5 file and
writes one CSV per parameter group.
tfs-extract-params \
tfs_configure_config.yaml \
tfs_posterior.h5
Outputs (default --out_prefix tfs_params):
tfs_params_activity.csv— per-genotype TF activity A with posterior quantilestfs_params_theta.csv— inferred occupancy parameters (Hill Kd, n, etc.)Additional CSVs depending on the components selected during configuration.
Step 6: Predict Growth (tfs-predict-growth)#
Predicts ln(CFU) from the fitted model. By default, predictions are produced at every (genotype, replicate, condition, titrant_name, titrant_conc, time) combination present in the training data.
tfs-predict-growth \
tfs_configure_config.yaml \
tfs_posterior.h5
Output (default --out_prefix tfs_growth_pred):
tfs_growth_pred.csv— one row per prediction point; quantile columns (median,lower_95,upper_95, etc.) plusin_training_data.
To add predictions at novel genotypes or concentrations, use
--genotypes_file or --titrant_concs_file (plain-text files, one value
per line). Pass --only_files to skip training-data combinations and predict
only at the file-specified inputs.
Step 7: Predict Theta (tfs-predict-theta)#
Predicts operator occupancy θ as a function of titrant concentration.
tfs-predict-theta \
tfs_configure_config.yaml \
tfs_posterior.h5 \
--genotypes_file predict_genotypes.txt
Output (default --out_prefix tfs_theta_pred):
tfs_theta_pred.csv— one row per (genotype, titrant_name, titrant_conc) with posterior quantile columns and anin_training_dataflag.
predict_genotypes.txt is a plain-text file with one genotype per line
(e.g. M42I/K84L or wt). Genotypes not seen during training can be
predicted using the mutation-additivity model when the chosen theta_model
supports it (e.g. hill_mut).
Step 8: Categorise Response (tfs-cat-response)#
Fits categorical response curve models to the θ-vs-titrant output of
tfs-predict-theta and selects the best-fitting model per
(genotype, titrant_name) pair by AIC weight.
tfs-cat-response \
tfs_theta_pred.csv \
--workers 8
Output (default --out_prefix tfs_cat_response):
tfs_cat_response.csv— one row per (genotype, titrant_name) withbest_model, AIC weights, and fitted parameters for every model.
—
Model Components#
tfs-configure-model accepts --<axis>_model flags to select from a
registry of pluggable sub-models for each aspect of the generative process.
Condition Growth (--condition_growth_model)#
Maps operator occupancy to per-condition growth rates: k = b + A·m·θ.
linear (default) — shared hierarchical prior for m and b across conditions
power — power-law relationship; incompatible with
--theta_rescale_model logitsaturation — saturating (Michaelis-Menten-like) relationship; incompatible with
logit
Growth Transition (--growth_transition_model)#
Models the lag phase when bacteria switch from pre-selection to selection medium.
instant (default) — no lag; genotypes immediately adopt the new growth rate
memory — occupancy-dependent lag time
baranyi — Baranyi–Roberts lag model
baranyi_k — Baranyi model parameterised through growth rate k
baranyi_tau — Baranyi model parameterised through lag time τ
two_pop — two-population lag model
Initial Population (--ln_cfu0_model)#
Models the starting genotype frequencies in each replicate.
hierarchical (default) — shared global prior on ln(CFU0)
hierarchical_factored — factored hierarchical prior
Pleiotropic Growth Effect (--dk_geno_model)#
Models the growth-rate offset attributable to each genotype independent of TF occupancy (dk_geno).
hierarchical_geno (default) — per-genotype effects drawn from a global prior
fixed — no genotype-specific pleiotropic effects
Activity (--activity_model)#
Models the per-genotype scalar A that multiplies the occupancy contribution to growth.
horseshoe_geno (default) — sparse horseshoe prior over genotypes
hierarchical_geno — standard hierarchical prior over genotypes
horseshoe_mut — sparse horseshoe prior, decomposed by mutation
hierarchical_mut — hierarchical prior, decomposed by mutation
fixed — all genotypes share the wildtype activity (A = 1)
Occupancy (--theta_model)#
Parameterises fractional operator occupancy θ as a function of titrant concentration.
hill_geno (default) — Hill equation with per-genotype Kd and n
categorical_geno — independent θ at each titrant concentration
hill_mut — Hill equation with per-mutation additive effects on Kd and n
thermo.* — thermodynamic partition-function models (see
configure_model_cli.pydocstring for the full set of registry keys)
Transformation Correction (--transformation_model)#
Corrects for congression (multiple plasmids entering one cell during transformation).
empirical (default) — empirical correction curve
logit_norm — logit-normal correction
single — no correction (assumes exactly one plasmid per cell)
Theta Rescale (--theta_rescale_model)#
Rescales θ before it enters the condition-growth model.
passthrough (default) — identity; θ ∈ [0, 1]
logit — maps θ → log(θ/(1−*θ*)); expands dynamic range at extremes. Incompatible with the
powerandsaturationgrowth models.
Noise Models#
Additional noise components (default zero for all):
--theta_growth_noise_model:zero(default),beta,logit_normal--theta_binding_noise_model:zero(default),beta--growth_noise_model:zero(default),normal_kt(learns a global growth-rate noise term σ_k in quadrature with ln_cfu_std)
—
Model Naming Conventions#
Model component names follow a set of conventions that encode what each component does.
Level of Parameterisation#
Models that infer one parameter per genotype carry a _geno suffix; models that
decompose effects at the mutation level carry a _mut suffix. Models with no
natural per-mutation alternative (e.g. fixed) have no suffix.
Examples: hierarchical_geno, horseshoe_mut, hill_geno, hill_mut.
Thermodynamic Theta Models#
Operator-occupancy (θ) models derived from an explicit partition function use a three-part dot-separated name:
thermo.{MODEL}.{PRIOR}
MODEL encodes the partition-function topology with four underscore-separated fields:
Field |
Meaning |
|---|---|
|
Oligomeric state (e.g. |
|
Number of conformational states |
|
Number of independent equilibrium constants |
|
Unfolded state: |
A trailing letter (a, b, …) disambiguates topologically distinct models
that share the same O/C/K/U counts. These letters carry no ordering; a simply
means “first registered variant.”
Currently implemented models:
O2_C4_K3_U0_a— four-state lac-repressor homodimer (no unfolding)O2_C4_K3_U1_a— same with an explicit folding/unfolding equilibriumO2_C12_K5_U0_a— full MWC two-state homodimer (no unfolding)O2_C12_K5_U1_a— same with an explicit folding/unfolding equilibrium
PRIOR describes how the equilibrium constants are parameterised:
Name |
Description |
|---|---|
|
Independent normal prior on each log-K |
|
Priors informed by estimated ΔΔG values |
|
Neural network predicting per-conformation ΔΔG values |
|
Neural network predicting log-K values directly (planned) |
Full example names: thermo.O2_C4_K3_U0_a.PK, thermo.O2_C12_K5_U1_a.PnnC.