======== Analysis ======== The ``tfscreen.tfmodel`` module provides a hierarchical Bayesian model that infers per-genotype and operator occupancy (*θ*) from bacterial growth data and direct binding measurements. Standard Workflow ================= A complete analysis run consists of the following steps. The example commands assume all input files are in the current working directory and use the default output-file naming convention (``tfs_configure_*``, ``tfs_fit_model_*``, etc.). The :download:`example run.srun <../../examples/tfmodel/run.srun>` shows a complete Slurm script for a cluster run. Step 1: Configure Model (``tfs-configure-model``) -------------------------------------------------- Validates the input data, maps categorical labels to numerical indices, selects model components, and writes three configuration files: * ``{out_prefix}_config.yaml`` — main configuration read by all downstream steps * ``{out_prefix}_priors.csv`` — prior distribution settings for all parameters * ``{out_prefix}_guesses.csv`` — initial-value guesses for array parameters ``binding_df`` (direct binding measurements) is the only required argument. ``growth_df`` is optional; omitting it configures a binding-only model. .. code-block:: bash tfs-configure-model binding.csv \ --growth_df growth.csv \ --out_prefix tfs_configure Key component flags (see :ref:`model-components` for the full list): * ``--condition_growth_model`` — how growth rate depends on TF occupancy (default: ``linear``) * ``--growth_transition_model`` — pre-to-selection phase lag model (default: ``instant``) * ``--activity_model`` — per-genotype TF activity prior (default: ``horseshoe_geno``) * ``--theta_model`` — operator occupancy parameterisation (default: ``hill_geno``) * ``--dk_geno_model`` — pleiotropic growth-effect prior (default: ``hierarchical_geno``) * ``--transformation_model`` — multi-plasmid congression correction (default: ``empirical``) Step 2: Pre-fit Calibration (``tfs-prefit-calibration``) --------------------------------------------------------- Runs a fast MAP fit on a simplified version of the model to calibrate the priors for the ``condition_growth`` and ``growth_transition`` components. The calibration fit is restricted to the intersection of genotypes and titrant conditions present in both the growth and binding data. After convergence the script updates the production ``{out_prefix}_priors.csv`` and ``{out_prefix}_guesses.csv`` files in place (a ``.bak`` backup is written first), giving the full production fit a warm start. Diagnostic PDFs (one per genotype) and a ``{out_prefix}_calib_stats.json`` file are also written. .. code-block:: bash tfs-prefit-calibration tfs_configure_config.yaml \ --seed 42 \ --convergence_tolerance 0.00001 Step 3: Fit Model (``tfs-fit-model``) -------------------------------------- Performs the main parameter estimation. Three inference methods are available via ``--analysis_method``: * **map** — Maximum A Posteriori optimisation (Adam). Fast; produces a point estimate. Use ``tfs-sample-posterior`` afterwards to obtain uncertainty estimates via a Laplace approximation. * **svi** (default) — Stochastic Variational Inference. Automatically runs a short MAP pre-pass (``--pre_map_num_epoch``) before the full variational fit. Produces a full approximate posterior; posterior samples are drawn after convergence. * **nuts** — No-U-Turn Sampler (exact MCMC). Slowest; most accurate. The example below matches the MAP configuration used in the example ``run.srun``: .. code-block:: bash tfs-fit-model \ tfs_configure_config.yaml \ --seed 42 \ --analysis_method map \ --adam_step_size 1e-6 \ --convergence_check_interval 100 \ --convergence_window 50 \ --checkpoint_interval 100 \ --max_num_epochs 100000000 \ --pre_map_num_epoch 100000 \ --convergence_tolerance 0.0005 \ --patience 5 Key outputs (with default ``--out_prefix tfs_fit_model``): * ``tfs_fit_model_checkpoint.pkl`` — optimizer checkpoint; resume with ``--checkpoint_file`` * ``tfs_fit_model_params.npz`` — MAP/SVI parameter point estimates Step 4: Sample Posterior (``tfs-sample-posterior``) ---------------------------------------------------- Draws posterior samples from a checkpoint produced by ``tfs-fit-model``. The checkpoint type is detected automatically: * **MAP checkpoint** — constructs a Laplace (Hessian-based) Gaussian approximation at the MAP point, then draws samples. * **SVI checkpoint** — draws directly from the fitted variational distribution (resumes with 0 additional optimisation epochs). * **NUTS checkpoint** — reconstructs posteriors from the saved MCMC samples. .. code-block:: bash tfs-sample-posterior \ tfs_configure_config.yaml \ tfs_fit_model_checkpoint.pkl \ --num_posterior_samples 1000 \ --sampling_batch_size 10 \ --seed 42 Output (default ``--out_prefix tfs_posterior``): * ``tfs_posterior.h5`` — posterior samples for all latent variables; passed to the prediction steps below. Step 5: Extract Parameters (``tfs-extract-params``) ---------------------------------------------------- Extracts interpretable parameter summaries from a posterior ``.h5`` file and writes one CSV per parameter group. .. code-block:: bash tfs-extract-params \ tfs_configure_config.yaml \ tfs_posterior.h5 Outputs (default ``--out_prefix tfs_params``): * ``tfs_params_activity.csv`` — per-genotype TF activity *A* with posterior quantiles * ``tfs_params_theta.csv`` — inferred occupancy parameters (Hill *Kd*, *n*, etc.) * Additional CSVs depending on the components selected during configuration. Step 6: Predict Growth (``tfs-predict-growth``) ------------------------------------------------ Predicts ln(CFU) from the fitted model. By default, predictions are produced at every (genotype, replicate, condition, titrant_name, titrant_conc, time) combination present in the training data. .. code-block:: bash tfs-predict-growth \ tfs_configure_config.yaml \ tfs_posterior.h5 Output (default ``--out_prefix tfs_growth_pred``): * ``tfs_growth_pred.csv`` — one row per prediction point; quantile columns (``median``, ``lower_95``, ``upper_95``, etc.) plus ``in_training_data``. To add predictions at novel genotypes or concentrations, use ``--genotypes_file`` or ``--titrant_concs_file`` (plain-text files, one value per line). Pass ``--only_files`` to skip training-data combinations and predict only at the file-specified inputs. Step 7: Predict Theta (``tfs-predict-theta``) ---------------------------------------------- Predicts operator occupancy *θ* as a function of titrant concentration. .. code-block:: bash tfs-predict-theta \ tfs_configure_config.yaml \ tfs_posterior.h5 \ --genotypes_file predict_genotypes.txt Output (default ``--out_prefix tfs_theta_pred``): * ``tfs_theta_pred.csv`` — one row per (genotype, titrant_name, titrant_conc) with posterior quantile columns and an ``in_training_data`` flag. ``predict_genotypes.txt`` is a plain-text file with one genotype per line (e.g. ``M42I/K84L`` or ``wt``). Genotypes not seen during training can be predicted using the mutation-additivity model when the chosen ``theta_model`` supports it (e.g. ``hill_mut``). Step 8: Categorise Response (``tfs-cat-response``) --------------------------------------------------- Fits categorical response curve models to the *θ*-vs-titrant output of ``tfs-predict-theta`` and selects the best-fitting model per (genotype, titrant_name) pair by AIC weight. .. code-block:: bash tfs-cat-response \ tfs_theta_pred.csv \ --workers 8 Output (default ``--out_prefix tfs_cat_response``): * ``tfs_cat_response.csv`` — one row per (genotype, titrant_name) with ``best_model``, AIC weights, and fitted parameters for every model. --- .. _model-components: Model Components ================ ``tfs-configure-model`` accepts ``--_model`` flags to select from a registry of pluggable sub-models for each aspect of the generative process. Condition Growth (``--condition_growth_model``) ----------------------------------------------- Maps operator occupancy to per-condition growth rates: *k = b + A·m·θ*. * **linear** (default) — shared hierarchical prior for *m* and *b* across conditions * **power** — power-law relationship; incompatible with ``--theta_rescale_model logit`` * **saturation** — saturating (Michaelis-Menten-like) relationship; incompatible with ``logit`` Growth Transition (``--growth_transition_model``) -------------------------------------------------- Models the lag phase when bacteria switch from pre-selection to selection medium. * **instant** (default) — no lag; genotypes immediately adopt the new growth rate * **memory** — occupancy-dependent lag time * **baranyi** — Baranyi–Roberts lag model * **baranyi_k** — Baranyi model parameterised through growth rate *k* * **baranyi_tau** — Baranyi model parameterised through lag time *τ* * **two_pop** — two-population lag model Initial Population (``--ln_cfu0_model``) ----------------------------------------- Models the starting genotype frequencies in each replicate. * **hierarchical** (default) — shared global prior on ln(CFU\ :sub:`0`) * **hierarchical_factored** — factored hierarchical prior Pleiotropic Growth Effect (``--dk_geno_model``) ------------------------------------------------ Models the growth-rate offset attributable to each genotype independent of TF occupancy (*dk_geno*). * **hierarchical_geno** (default) — per-genotype effects drawn from a global prior * **fixed** — no genotype-specific pleiotropic effects Activity (``--activity_model``) --------------------------------- Models the per-genotype scalar *A* that multiplies the occupancy contribution to growth. * **horseshoe_geno** (default) — sparse horseshoe prior over genotypes * **hierarchical_geno** — standard hierarchical prior over genotypes * **horseshoe_mut** — sparse horseshoe prior, decomposed by mutation * **hierarchical_mut** — hierarchical prior, decomposed by mutation * **fixed** — all genotypes share the wildtype activity (*A* = 1) Occupancy (``--theta_model``) ------------------------------ Parameterises fractional operator occupancy *θ* as a function of titrant concentration. * **hill_geno** (default) — Hill equation with per-genotype *Kd* and *n* * **categorical_geno** — independent *θ* at each titrant concentration * **hill_mut** — Hill equation with per-mutation additive effects on *Kd* and *n* * **thermo.\*** — thermodynamic partition-function models (see ``configure_model_cli.py`` docstring for the full set of registry keys) Transformation Correction (``--transformation_model``) ------------------------------------------------------- Corrects for congression (multiple plasmids entering one cell during transformation). * **empirical** (default) — empirical correction curve * **logit_norm** — logit-normal correction * **single** — no correction (assumes exactly one plasmid per cell) Theta Rescale (``--theta_rescale_model``) ------------------------------------------ Rescales *θ* before it enters the condition-growth model. * **passthrough** (default) — identity; *θ* ∈ [0, 1] * **logit** — maps *θ* → log(*θ*/(1−*θ*)); expands dynamic range at extremes. Incompatible with the ``power`` and ``saturation`` growth models. Noise Models ------------ Additional noise components (default ``zero`` for all): * ``--theta_growth_noise_model``: ``zero`` (default), ``beta``, ``logit_normal`` * ``--theta_binding_noise_model``: ``zero`` (default), ``beta`` * ``--growth_noise_model``: ``zero`` (default), ``normal_kt`` (learns a global growth-rate noise term *σ_k* in quadrature with *ln_cfu_std*) --- .. _model-naming: Model Naming Conventions ======================== Model component names follow a set of conventions that encode what each component does. Level of Parameterisation -------------------------- Models that infer one parameter per *genotype* carry a ``_geno`` suffix; models that decompose effects at the *mutation* level carry a ``_mut`` suffix. Models with no natural per-mutation alternative (e.g. ``fixed``) have no suffix. Examples: ``hierarchical_geno``, ``horseshoe_mut``, ``hill_geno``, ``hill_mut``. Thermodynamic Theta Models --------------------------- Operator-occupancy (*θ*) models derived from an explicit partition function use a three-part dot-separated name:: thermo.{MODEL}.{PRIOR} *MODEL* encodes the partition-function topology with four underscore-separated fields: .. list-table:: :header-rows: 1 :widths: 10 60 * - Field - Meaning * - ``O`` - Oligomeric state (e.g. ``O2`` = homodimer) * - ``C`` - Number of conformational states * - ``K`` - Number of independent equilibrium constants * - ``U`` - Unfolded state: ``U0`` = folded only, ``U1`` = folding equilibrium A trailing letter (``a``, ``b``, …) disambiguates topologically distinct models that share the same O/C/K/U counts. These letters carry no ordering; ``a`` simply means "first registered variant." Currently implemented models: * ``O2_C4_K3_U0_a`` — four-state lac-repressor homodimer (no unfolding) * ``O2_C4_K3_U1_a`` — same with an explicit folding/unfolding equilibrium * ``O2_C12_K5_U0_a`` — full MWC two-state homodimer (no unfolding) * ``O2_C12_K5_U1_a`` — same with an explicit folding/unfolding equilibrium *PRIOR* describes how the equilibrium constants are parameterised: .. list-table:: :header-rows: 1 :widths: 10 60 * - Name - Description * - ``PK`` - Independent normal prior on each log-*K* * - ``PddG`` - Priors informed by estimated ΔΔG values * - ``PnnC`` - Neural network predicting per-conformation ΔΔG values * - ``PnnK`` - Neural network predicting log-*K* values directly (planned) Full example names: ``thermo.O2_C4_K3_U0_a.PK``, ``thermo.O2_C12_K5_U1_a.PnnC``.