iRAISE Datasets

Structure-based computational target prediction methods identify potential protein targets for a bioactive compound. Methods based on protein-ligand docking face several challenges. The probably greatest one is the ranking of true targets in a large dataset of protein structures. Still, no standard datasets for evaluation exist. Therefore, objectively comparing methods and demonstrating improvements is infeasible.

We composed two datasets and evaluation strategies for a meaningful evaluation of new target prediction methods, i.e., a small dataset consisting of three target classes for detailed proof-of-concept and selectivity studies, and a large dataset based on the sc-PDB[1] consisting of 7992 protein structures and 72 drug-like ligands from DrugBank, allowing statistical evaluation with performance metrics on a drug-like chemical space. Both datasets build on openly available resources. The composition of the data sets, the setup of screening experiments, and the evaluation strategy are described in the corresponding publication.[2] There, performance metrics capable of assessing the early recognition of enrichments, such as AUC, BEDROC, and NSLR, are proposed.

The datasets were used for evaluating our inverse screening method iRAISE.[3] The small dataset reveals limitations in distinguishing between sequentially related protein structures. The large dataset simulates real target identification scenarios. iRAISE achieves excellent or good enrichment for 55% of the drug molecules, a median AUC of 0.67, and RMSD values below 2.0 Å for 74%. It predicted the first true target in 59 of 72 cases in the top 2% of the protein dataset with approx. 8000 structures.

The Supporting Information of the dataset publication [2] contains the list of PDB codes, DrugBank IDs, and true positives for the large dataset.

Both data sets can be downloaded at https://www.zbh.uni-hamburg.de/forschung/amd/datasets/iraise-datasets/dataset-iraise-tar.gz.

[1] Kellenberger, E.; Muller, P.; Schalon, C.; Bret, G.; Foata, N.; Rognan, D. sc-PDB: An Annotated Database of Druggable Binding Sites from the Protein Data Bank. J Chem Inf Model 2006, 46 (2), 717-727. DOI: https://doi.org/10.1021/ci050372x
[2] Schomburg, K. T.; Rarey, M. Benchmark Data Sets for Structure-Based Computational Target Prediction. J Chem Inf Model 2014, 54 (8), 2261-2274. DOI: https://doi.org/10.1021/ci500131x
[3] Schomburg, K. T.; Bietz, S.; Briem, H.; Henzler, A. M.; Urbaczek, S.; Rarey, M. Facing the Challenges of Structure-Based Target Prediction by Inverse Virtual Screening. J Chem Inf Model 2014, 54 (6), 1676-1686. DOI: https://doi.org/10.1021/ci500130e