LigandMPNN Features#

Some tfscreen theta components (e.g. thermo.O2_C4_K3_U0_a.PnnC) use per-position log-probability features from LigandMPNN as structural inputs. These features are computed once — outside the main tfscreen environment — and stored in an NPZ file that is read at analysis time.

Warning

The LigandMPNN conda environment is incompatible with the main tfscreen environment. LigandMPNN pins NumPy 1.23.5 and PyTorch, which conflict with the JAX/NumPyro stack that the rest of tfscreen requires. The feature generator script is intentionally standalone (no tfscreen installation needed) so it can run in the LigandMPNN environment without conflict. Do not attempt to run any other tfscreen commands from the LigandMPNN environment.

Overview#

The feature generation pipeline is a two-step, two-environment workflow:

LigandMPNN environment — run tfs-generate-ligandmpnn-features to score each PDB structure and write an NPZ feature file.
tfscreen environment — run the hierarchical model as usual; point the relevant theta component at the NPZ file via the run config YAML.

The NPZ file is a permanent, reusable artifact. You only need to regenerate it if you change the input PDB files or the LigandMPNN model weights.

Installation#

Follow the LigandMPNN installation instructions to create a dedicated conda environment (LigandMPNN requires Python 3.11 and specific PyTorch pinning). Do not install tfscreen in that environment — the two packages have incompatible NumPy and JAX dependencies.

The feature generator is a standalone script that requires only numpy and PyYAML, both of which are already available in any environment that can run LigandMPNN. Copy or clone the tfscreen repository and run the script directly:

conda activate ligandmpnn_env
python /path/to/tfscreen/scripts/generate_ligandmpnn_features.py ...

Preparing Inputs#

structures YAML#

Create a YAML file that maps short structure names to PDB file paths. The names must match what the downstream theta component expects. For lac_dimer_nn_mut the required keys are H, HD, L, and LE2, representing the four thermodynamic states:

H:   /path/to/H_apo.pdb
HD:  /path/to/HD_dna_bound.pdb
L:   /path/to/L_allosteric.pdb
LE2: /path/to/LE2_iptg_bound.pdb

Each PDB must contain the full protein chain(s) in the conformation appropriate for that state. LigandMPNN scores every amino acid position in the PDB, so make sure the chain contains only the residues you intend to score.

PDB residue numbering#

Mutation labels used by the theta component follow the convention {wt_aa}{PDB_resnum}{mut_aa} (e.g. A42G). The residue numbers must match the ATOM record numbers in the PDB files — not a zero-based index. Verify that your PDB files use consistent numbering across all four structures.

Running the Feature Generator#

Activate the LigandMPNN environment, then run:

python /path/to/tfscreen/scripts/generate_ligandmpnn_features.py structures.yaml \
    --out features.npz \
    --ligandmpnn_dir /path/to/LigandMPNN \
    [--model_type ligand_mpnn] \
    [--checkpoint /path/to/weights.pt] \
    [--num_batches 10] \
    [--seed 42]

Required arguments

structures.yaml: Path to the YAML file mapping structure names to PDB paths (see above).
--out features.npz: Output file. A single NPZ is written containing one array per structure name and one residue-number index array per structure.
--ligandmpnn_dir: Path to the root of the LigandMPNN repository (the directory that contains score.py).

Optional arguments

--model_type: LigandMPNN model variant. Default: ligand_mpnn. Choices: ligand_mpnn, protein_mpnn, per_residue_label_membrane_mpnn, global_label_membrane_mpnn, soluble_mpnn.
--checkpoint: Path to a custom model-weights .pt file. If omitted, LigandMPNN uses its built-in default weights for the selected model type.
--num_batches: Number of random decoding-order batches to average. More batches reduce variance in the log-probability estimates. Default: 10; at least 10 is recommended.
--seed: Random seed passed to LigandMPNN. Default: 42.

NPZ file contents#

For each structure name {name} the NPZ contains:

{name}: float32 array of shape (L, 20): mean log P(AA | structure, context) averaged over all decoding-order batches. Columns follow the LigandMPNN / ProteinMPNN alphabet ACDEFGHIKLMNPQRSTVWY (indices 0–19).
{name}_residue_nums: int32 array of shape (L,): PDB residue numbers for each row, used to look up positions by mutation label.

Using the NPZ in a Run Config#

Point the ligandmpnn_features field in your run config YAML at the NPZ file. The exact field name depends on the theta component; for lac_dimer_nn_mut it is typically set under the theta component block. Refer to the component’s get_hyperparameters() for the full list of required config keys.