Analysis#

The tfscreen.tfmodel module provides a hierarchical Bayesian model that infers per-genotype and operator occupancy (θ) from bacterial growth data and direct binding measurements.

Standard Workflow#

A complete analysis run consists of the following steps. The example commands assume all input files are in the current working directory and use the default output-file naming convention (tfs_configure_*, tfs_fit_model_*, etc.). The example run.srun shows a complete Slurm script for a cluster run.

Step 1: Configure Model (tfs-configure-model)#

Validates the input data, maps categorical labels to numerical indices, selects model components, and writes three configuration files:

  • {out_prefix}_config.yaml — main configuration read by all downstream steps

  • {out_prefix}_priors.csv — prior distribution settings for all parameters

  • {out_prefix}_guesses.csv — initial-value guesses for array parameters

binding_df (direct binding measurements) is the only required argument. growth_df is optional; omitting it configures a binding-only model.

tfs-configure-model binding.csv \
    --growth_df growth.csv \
    --out_prefix tfs_configure

Key component flags (see Model Components for the full list):

  • --condition_growth_model — how growth rate depends on TF occupancy (default: linear)

  • --growth_transition_model — pre-to-selection phase lag model (default: instant)

  • --activity_model — per-genotype TF activity prior (default: horseshoe_geno)

  • --theta_model — operator occupancy parameterisation (default: hill_geno)

  • --dk_geno_model — pleiotropic growth-effect prior (default: hierarchical_geno)

  • --transformation_model — multi-plasmid congression correction (default: empirical)

Step 2: Pre-fit Calibration (tfs-prefit-calibration)#

Runs a fast MAP fit on a simplified version of the model to calibrate the priors for the condition_growth and growth_transition components. The calibration fit is restricted to the intersection of genotypes and titrant conditions present in both the growth and binding data.

After convergence the script updates the production {out_prefix}_priors.csv and {out_prefix}_guesses.csv files in place (a .bak backup is written first), giving the full production fit a warm start. Diagnostic PDFs (one per genotype) and a {out_prefix}_calib_stats.json file are also written.

tfs-prefit-calibration tfs_configure_config.yaml \
    --seed 42 \
    --convergence_tolerance 0.00001

Step 3: Fit Model (tfs-fit-model)#

Performs the main parameter estimation. Three inference methods are available via --analysis_method:

  • map — Maximum A Posteriori optimisation (Adam). Fast; produces a point estimate. Use tfs-sample-posterior afterwards to obtain uncertainty estimates via a Laplace approximation.

  • svi (default) — Stochastic Variational Inference. Automatically runs a short MAP pre-pass (--pre_map_num_epoch) before the full variational fit. Produces a full approximate posterior; posterior samples are drawn after convergence.

  • nuts — No-U-Turn Sampler (exact MCMC). Slowest; most accurate.

The example below matches the MAP configuration used in the example run.srun:

tfs-fit-model \
    tfs_configure_config.yaml \
    --seed 42 \
    --analysis_method map \
    --adam_step_size 1e-6 \
    --convergence_check_interval 100 \
    --convergence_window 50 \
    --checkpoint_interval 100 \
    --max_num_epochs 100000000 \
    --pre_map_num_epoch 100000 \
    --convergence_tolerance 0.0005 \
    --patience 5

Key outputs (with default --out_prefix tfs_fit_model):

  • tfs_fit_model_checkpoint.pkl — optimizer checkpoint; resume with --checkpoint_file

  • tfs_fit_model_params.npz — MAP/SVI parameter point estimates

Step 4: Sample Posterior (tfs-sample-posterior)#

Draws posterior samples from a checkpoint produced by tfs-fit-model. The checkpoint type is detected automatically:

  • MAP checkpoint — constructs a Laplace (Hessian-based) Gaussian approximation at the MAP point, then draws samples.

  • SVI checkpoint — draws directly from the fitted variational distribution (resumes with 0 additional optimisation epochs).

  • NUTS checkpoint — reconstructs posteriors from the saved MCMC samples.

tfs-sample-posterior \
    tfs_configure_config.yaml \
    tfs_fit_model_checkpoint.pkl \
    --num_posterior_samples 1000 \
    --sampling_batch_size 10 \
    --seed 42

Output (default --out_prefix tfs_posterior):

  • tfs_posterior.h5 — posterior samples for all latent variables; passed to the prediction steps below.

Step 5: Extract Parameters (tfs-extract-params)#

Extracts interpretable parameter summaries from a posterior .h5 file and writes one CSV per parameter group.

tfs-extract-params \
    tfs_configure_config.yaml \
    tfs_posterior.h5

Outputs (default --out_prefix tfs_params):

  • tfs_params_activity.csv — per-genotype TF activity A with posterior quantiles

  • tfs_params_theta.csv — inferred occupancy parameters (Hill Kd, n, etc.)

  • Additional CSVs depending on the components selected during configuration.

Step 6: Predict Growth (tfs-predict-growth)#

Predicts ln(CFU) from the fitted model. By default, predictions are produced at every (genotype, replicate, condition, titrant_name, titrant_conc, time) combination present in the training data.

tfs-predict-growth \
    tfs_configure_config.yaml \
    tfs_posterior.h5

Output (default --out_prefix tfs_growth_pred):

  • tfs_growth_pred.csv — one row per prediction point; quantile columns (median, lower_95, upper_95, etc.) plus in_training_data.

To add predictions at novel genotypes or concentrations, use --genotypes_file or --titrant_concs_file (plain-text files, one value per line). Pass --only_files to skip training-data combinations and predict only at the file-specified inputs.

Step 7: Predict Theta (tfs-predict-theta)#

Predicts operator occupancy θ as a function of titrant concentration.

tfs-predict-theta \
    tfs_configure_config.yaml \
    tfs_posterior.h5 \
    --genotypes_file predict_genotypes.txt

Output (default --out_prefix tfs_theta_pred):

  • tfs_theta_pred.csv — one row per (genotype, titrant_name, titrant_conc) with posterior quantile columns and an in_training_data flag.

predict_genotypes.txt is a plain-text file with one genotype per line (e.g. M42I/K84L or wt). Genotypes not seen during training can be predicted using the mutation-additivity model when the chosen theta_model supports it (e.g. hill_mut).

Step 8: Categorise Response (tfs-cat-response)#

Fits categorical response curve models to the θ-vs-titrant output of tfs-predict-theta and selects the best-fitting model per (genotype, titrant_name) pair by AIC weight.

tfs-cat-response \
    tfs_theta_pred.csv \
    --workers 8

Output (default --out_prefix tfs_cat_response):

  • tfs_cat_response.csv — one row per (genotype, titrant_name) with best_model, AIC weights, and fitted parameters for every model.

Model Components#

tfs-configure-model accepts --<axis>_model flags to select from a registry of pluggable sub-models for each aspect of the generative process.

Condition Growth (--condition_growth_model)#

Maps operator occupancy to per-condition growth rates: k = b + A·m·θ.

  • linear (default) — shared hierarchical prior for m and b across conditions

  • power — power-law relationship; incompatible with --theta_rescale_model logit

  • saturation — saturating (Michaelis-Menten-like) relationship; incompatible with logit

Growth Transition (--growth_transition_model)#

Models the lag phase when bacteria switch from pre-selection to selection medium.

  • instant (default) — no lag; genotypes immediately adopt the new growth rate

  • memory — occupancy-dependent lag time

  • baranyi — Baranyi–Roberts lag model

  • baranyi_k — Baranyi model parameterised through growth rate k

  • baranyi_tau — Baranyi model parameterised through lag time τ

  • two_pop — two-population lag model

Initial Population (--ln_cfu0_model)#

Models the starting genotype frequencies in each replicate.

  • hierarchical (default) — shared global prior on ln(CFU0)

  • hierarchical_factored — factored hierarchical prior

Pleiotropic Growth Effect (--dk_geno_model)#

Models the growth-rate offset attributable to each genotype independent of TF occupancy (dk_geno).

  • hierarchical_geno (default) — per-genotype effects drawn from a global prior

  • fixed — no genotype-specific pleiotropic effects

Activity (--activity_model)#

Models the per-genotype scalar A that multiplies the occupancy contribution to growth.

  • horseshoe_geno (default) — sparse horseshoe prior over genotypes

  • hierarchical_geno — standard hierarchical prior over genotypes

  • horseshoe_mut — sparse horseshoe prior, decomposed by mutation

  • hierarchical_mut — hierarchical prior, decomposed by mutation

  • fixed — all genotypes share the wildtype activity (A = 1)

Occupancy (--theta_model)#

Parameterises fractional operator occupancy θ as a function of titrant concentration.

  • hill_geno (default) — Hill equation with per-genotype Kd and n

  • categorical_geno — independent θ at each titrant concentration

  • hill_mut — Hill equation with per-mutation additive effects on Kd and n

  • thermo.* — thermodynamic partition-function models (see configure_model_cli.py docstring for the full set of registry keys)

Transformation Correction (--transformation_model)#

Corrects for congression (multiple plasmids entering one cell during transformation).

  • empirical (default) — empirical correction curve

  • logit_norm — logit-normal correction

  • single — no correction (assumes exactly one plasmid per cell)

Theta Rescale (--theta_rescale_model)#

Rescales θ before it enters the condition-growth model.

  • passthrough (default) — identity; θ ∈ [0, 1]

  • logit — maps θ → log(θ/(1−*θ*)); expands dynamic range at extremes. Incompatible with the power and saturation growth models.

Noise Models#

Additional noise components (default zero for all):

  • --theta_growth_noise_model: zero (default), beta, logit_normal

  • --theta_binding_noise_model: zero (default), beta

  • --growth_noise_model: zero (default), normal_kt (learns a global growth-rate noise term σ_k in quadrature with ln_cfu_std)

Model Naming Conventions#

Model component names follow a set of conventions that encode what each component does.

Level of Parameterisation#

Models that infer one parameter per genotype carry a _geno suffix; models that decompose effects at the mutation level carry a _mut suffix. Models with no natural per-mutation alternative (e.g. fixed) have no suffix.

Examples: hierarchical_geno, horseshoe_mut, hill_geno, hill_mut.

Thermodynamic Theta Models#

Operator-occupancy (θ) models derived from an explicit partition function use a three-part dot-separated name:

thermo.{MODEL}.{PRIOR}

MODEL encodes the partition-function topology with four underscore-separated fields:

Field

Meaning

O

Oligomeric state (e.g. O2 = homodimer)

C

Number of conformational states

K

Number of independent equilibrium constants

U

Unfolded state: U0 = folded only, U1 = folding equilibrium

A trailing letter (a, b, …) disambiguates topologically distinct models that share the same O/C/K/U counts. These letters carry no ordering; a simply means “first registered variant.”

Currently implemented models:

  • O2_C4_K3_U0_a — four-state lac-repressor homodimer (no unfolding)

  • O2_C4_K3_U1_a — same with an explicit folding/unfolding equilibrium

  • O2_C12_K5_U0_a — full MWC two-state homodimer (no unfolding)

  • O2_C12_K5_U1_a — same with an explicit folding/unfolding equilibrium

PRIOR describes how the equilibrium constants are parameterised:

Name

Description

PK

Independent normal prior on each log-K

PddG

Priors informed by estimated ΔΔG values

PnnC

Neural network predicting per-conformation ΔΔG values

PnnK

Neural network predicting log-K values directly (planned)

Full example names: thermo.O2_C4_K3_U0_a.PK, thermo.O2_C12_K5_U1_a.PnnC.