Fragment Growing Validation Dataset
The basis of the FGVD is the PDBBind refined set, which was chosen for it's structural considerations that make it a useful dataset for structure-based validation. [1] Below are short summaries of the two separate collection that make up the full data set. The full data set generation procedure can be found in the associated publication: Shape-Based Descriptors for Efficient Structure-Based Fragment Growing
Self-Growing Set
[Download]
All ligands in the PDBBind Refined Set[1] were cut at random single bonds to produce fragments. These fragments were then filtered filtered using the "Rule of Three"[2]. Fragments also had to have at least 3 heavy atoms to be considered as such. Furthermore fragments were filtered according to structural criteria found in the Supporting Information to ensure that they were, for example, not completely solvent-exposed.
Cross-Growing Set
[Download]
Pairs of ligands bound to the same pocket but in different PDB structures that could conceptually be grown from one another by extending or modifying one part of one of the ligands were mined from the PDBBind Refined Set[1]. The ligands of all pockets were compared to each other in a pairwise maximum common substructure search to find a common core. The variable parts, in other words the potential fragments, were filtered according to the same rules as in the self-growing set. Furthermore, to ensure the binding mode of the two ligands was equivalent, the position and direction of the exit bonds were compared.
Ensemble Validation
[Download]
RMSD clustered ensembles of binding sites from the PDB were generated for all test cases of the cross-growing set using SIENA[3]. SIENA was run in the "docking" configuration, which implies, for example, binding site sequence identity. SIENA output was limited to five binding site conformations using the built-in all-atom clustering. The SIENA query binding site was the input binding site of the cross-growing test case in question, not the reference binding site. A minimum of two binding site conformations was necessary for a test case to be included in the ensemble flexibility subset. Note that the reference binding site containing the ligand to be grown is excluded from the binding site ensemble.
Water Replacement
[Download]
A subset of cross-growing test cases was extracted by checking whether a water was replaced in the course of the growing. This was detected by calculating van der Waals (vdW) radii overlaps between waters in the binding site that was used for growing and the ligand to be grown. If the ligand to be grown and a water exceeded a 60\% vdW overlap threshold, the water was considered to have been replaced by the ligand to be grown. Search points were generated for replaced waters and the ligand. If a search point that was generated by a water molecule was within 2\r{A} of a search point of the same type being generated by the ligand, then the search point of the water was used as a query for the water replacement growing.
References:
[1] Liu, Z., Su, M., Han, L., Liu, J., Yang, Q., Li, Y., & Wang, R. (2017). Forging the Basis for Developing Protein–Ligand Interaction Scoring Functions. Accounts of Chemical Research, 50(2), 302–309. https://doi.org/10.1021/acs.accounts.6b00491
[2] Congreve, M., Carr, R., Murray, C., & Jhoti, H. (2003). A “Rule of Three” for fragment-based lead discovery? Drug Discovery Today, 8(19), 876–877. https://doi.org/10.1016/S1359-6446(03)02831-9
[3] S. Bietz and M. Rarey, “SIENA: Efficient Compilation of Selective Protein Binding Site Ensembles,” J. Chem. Inf. Model., vol. 56, no. 1, pp. 248–259, 2016, https://doi.org/10.1021/acs.jcim.5b00588