=================== LigandMPNN Features =================== Some tfscreen theta components (e.g. ``thermo.O2_C4_K3_U0_a.PnnC``) use per-position log-probability features from `LigandMPNN `_ as structural inputs. These features are computed once — outside the main tfscreen environment — and stored in an NPZ file that is read at analysis time. .. warning:: The LigandMPNN conda environment is **incompatible** with the main tfscreen environment. LigandMPNN pins NumPy 1.23.5 and PyTorch, which conflict with the JAX/NumPyro stack that the rest of tfscreen requires. The feature generator script is intentionally standalone (no tfscreen installation needed) so it can run in the LigandMPNN environment without conflict. Do not attempt to run any other tfscreen commands from the LigandMPNN environment. Overview -------- The feature generation pipeline is a two-step, two-environment workflow: 1. **LigandMPNN environment** — run ``tfs-generate-ligandmpnn-features`` to score each PDB structure and write an NPZ feature file. 2. **tfscreen environment** — run the hierarchical model as usual; point the relevant theta component at the NPZ file via the run config YAML. The NPZ file is a permanent, reusable artifact. You only need to regenerate it if you change the input PDB files or the LigandMPNN model weights. Installation ------------ Follow the `LigandMPNN installation instructions `_ to create a dedicated conda environment (LigandMPNN requires Python 3.11 and specific PyTorch pinning). **Do not install tfscreen** in that environment — the two packages have incompatible NumPy and JAX dependencies. The feature generator is a standalone script that requires only ``numpy`` and ``PyYAML``, both of which are already available in any environment that can run LigandMPNN. Copy or clone the tfscreen repository and run the script directly:: conda activate ligandmpnn_env python /path/to/tfscreen/scripts/generate_ligandmpnn_features.py ... Preparing Inputs ---------------- structures YAML ^^^^^^^^^^^^^^^ Create a YAML file that maps short structure names to PDB file paths. The names must match what the downstream theta component expects. For ``lac_dimer_nn_mut`` the required keys are ``H``, ``HD``, ``L``, and ``LE2``, representing the four thermodynamic states: .. code-block:: yaml H: /path/to/H_apo.pdb HD: /path/to/HD_dna_bound.pdb L: /path/to/L_allosteric.pdb LE2: /path/to/LE2_iptg_bound.pdb Each PDB must contain the full protein chain(s) in the conformation appropriate for that state. LigandMPNN scores every amino acid position in the PDB, so make sure the chain contains only the residues you intend to score. PDB residue numbering ^^^^^^^^^^^^^^^^^^^^^ Mutation labels used by the theta component follow the convention ``{wt_aa}{PDB_resnum}{mut_aa}`` (e.g. ``A42G``). The residue numbers must match the ``ATOM`` record numbers in the PDB files — not a zero-based index. Verify that your PDB files use consistent numbering across all four structures. Running the Feature Generator ----------------------------- Activate the LigandMPNN environment, then run:: python /path/to/tfscreen/scripts/generate_ligandmpnn_features.py structures.yaml \ --out features.npz \ --ligandmpnn_dir /path/to/LigandMPNN \ [--model_type ligand_mpnn] \ [--checkpoint /path/to/weights.pt] \ [--num_batches 10] \ [--seed 42] **Required arguments** ``structures.yaml`` Path to the YAML file mapping structure names to PDB paths (see above). ``--out features.npz`` Output file. A single NPZ is written containing one array per structure name and one residue-number index array per structure. ``--ligandmpnn_dir`` Path to the root of the LigandMPNN repository (the directory that contains ``score.py``). **Optional arguments** ``--model_type`` LigandMPNN model variant. Default: ``ligand_mpnn``. Choices: ``ligand_mpnn``, ``protein_mpnn``, ``per_residue_label_membrane_mpnn``, ``global_label_membrane_mpnn``, ``soluble_mpnn``. ``--checkpoint`` Path to a custom model-weights ``.pt`` file. If omitted, LigandMPNN uses its built-in default weights for the selected model type. ``--num_batches`` Number of random decoding-order batches to average. More batches reduce variance in the log-probability estimates. Default: ``10``; at least ``10`` is recommended. ``--seed`` Random seed passed to LigandMPNN. Default: ``42``. NPZ file contents ^^^^^^^^^^^^^^^^^ For each structure name ``{name}`` the NPZ contains: ``{name}`` ``float32`` array of shape ``(L, 20)``: mean log P(AA | structure, context) averaged over all decoding-order batches. Columns follow the LigandMPNN / ProteinMPNN alphabet ``ACDEFGHIKLMNPQRSTVWY`` (indices 0–19). ``{name}_residue_nums`` ``int32`` array of shape ``(L,)``: PDB residue numbers for each row, used to look up positions by mutation label. Using the NPZ in a Run Config ----------------------------- Point the ``ligandmpnn_features`` field in your run config YAML at the NPZ file. The exact field name depends on the theta component; for ``lac_dimer_nn_mut`` it is typically set under the theta component block. Refer to the component's ``get_hyperparameters()`` for the full list of required config keys.