Fragment Growing Validation Dataset

Here, we provide datasets to validate computational methods for fragment growing. The basis for these datasets is the PDBbind refined set, as its structural considerations render it a high-quality dataset for structure-based validation.[1] Below are summaries of the two separate collections. The associated publication contains comprehensive descriptions of the data set generation.[2]

Self-Growing Set

All ligands in the PDBbind Refined Set[1] were cut at random single bonds to produce fragments. These fragments were then filtered using the Rule of Three.[3] Fragments also had to have at least three heavy atoms. Furthermore, they had to fulfil structural criteria to ensure that they were, for example, not completely solvent-exposed.

The dataset can be downloaded at https://amd.zbh.uni-hamburg.de/download/fgvd/v1.1/self_growing_set.zip.

Cross-Growing Set

Pairs of ligands bound to the same pocket but in different PDB structures that could conceptually be grown from one another by extending or modifying one part of one of the ligands were mined from the PDBbind Refined Set[1]. The ligands of all pockets were compared to each other in a pairwise maximum common substructure search to find a common core. The variable parts, in other words, the potential fragments, were filtered according to the same rules as in the self-growing set. We ensure equivalent binding modes by requiring the position and direction of the exit bonds to be similar.

The dataset can be downloaded at https://amd.zbh.uni-hamburg.de/download/fgvd/v1.1/cross_growing_set.zip.

Ensemble Validation

RMSD-clustered ensembles of binding sites from the PDB were generated for all test cases of the cross-growing set using SIENA[4] in the "Docking" configuration. It implies, for example, binding site sequence identity. The output was limited to five binding site conformations using the built-in all-atom clustering. The SIENA query binding site was the input binding site of the cross-growing test case in question, not the reference binding site. A minimum of two binding site conformations was necessary for inclusion in the ensemble flexibility subset. Note that the reference binding site containing the ligand is not part of the binding site ensemble.

Water Replacement

We generated a subset of cross-growing test cases by checking for water molecule replacements after the fragment growing. To this end, we calculated van der Waals (vdW) radius overlaps between waters in the binding site used for growing and the ligand to be grown. If the ligand to be grown and a water molecule exceeded a 60% vdW overlap threshold, we considered the water molecule as replaced by the generated ligand. Search points were generated for replaced water molecules and the ligand. If a search point generated for a water molecule was within 2 Å of a search point of the same type generated for the ligand, then the search point of the water was used as a query for the water replacement growing.

[1] Liu, Z.; Su, M.; Han, L.; Liu, J.; Yang, Q.; Li, Y.; Wang, R. Forging the Basis for Developing Protein-Ligand Interaction Scoring Functions. Acc Chem Res 2017, 50 (2), 302-309. DOI: https://doi.org/10.1021/acs.accounts.6b00491
[2] Penner, P.; Martiny, V.; Gohier, A.; Gastreich, M.; Ducrot, P.; Brown, D.; Rarey, M. Shape-Based Descriptors for Efficient Structure-Based Fragment Growing. J Chem Inf Model 2020, 60 (12), 6269-6281. DOI: https://doi.org/10.1021/acs.jcim.0c00920
[3] Congreve, M.; Carr, R.; Murray, C.; Jhoti, H. A 'Rule of Three' for Fragment-Based Lead Discovery? Drug Discov Today 2003, 8 (19), 876-877. DOI: https://doi.org/10.1016/s1359-6446(03)02831-9
[4] Bietz, S.; Rarey, M. SIENA: Efficient Compilation of Selective Protein Binding Site Ensembles. J Chem Inf Model 2016, 56 (1), 248-259. DOI: https://doi.org/10.1021/acs.jcim.5b00588