iRAISE Datasets
Structure-based computational target prediction methods identify potential protein targets for a bioactive compound. Methods based on protein?ligand docking so far face many challenges, where the greatest probably is the ranking of true targets in a large data set of protein structures. Currently, no standard data sets for evaluation exist, rendering comparison and demonstration of improvements of methods cumbersome. Therefore, we composed two data sets and evaluation strategies for a meaningful evaluation of new target prediction methods, i.e., a small data set consisting of three target classes for detailed proof-of-concept and selectivity studies and a large data set based on the sc-PDB consisting of 7992 protein structures and 72 drug-like ligands from Drugbank allowing statistical evaluation with performance metrics on a drug-like chemical space. Both data sets are built from openly available resources.. The composition of the data sets, the setup of screening experiments, and the evaluation strategy are described in Schomburg and Rarey (2014). Performance metrics capable to measure the early recognition of enrichments like AUC, BEDROC, and NSLR are proposed. The data sets are used for method evaluation of our new inverse screening method iRAISE. The small data set reveals the method’s capability and limitations to selectively distinguish between rather similar protein structures. The large data set simulates real target identification scenarios. iRAISE achieves excellent or good enrichment in 55% of the cases, a median AUC of 0.67 and RMSDs below 2.0 Å for 74% and was able to predict the first true target in 59 out of 72 cases in the top 2% of the protein data set of about 8000 structures.
Download
The supporting information of Schomburg and Rarey (2014) contains the list of PDB codes, Drugbank ligand IDs and list of true positives of the large dataset. Both data sets zipped to one file can be downloaded here.