SMARTS Dataset
Smiles ARbitary Target Specification (SMARTS) is a language to formulate chemical patterns like substructures in molecules [1]. In order to evaluate the algorithms to search for chemical patterns in molecules, we present a collection of SMARTS expressions extracted from various literature sources [2-13] and a collection of SMARTS-molecules pairs created from the ZINC database. In addition, a test case comprised of a highly symmetric SMARTS-SMILES-pair and a subset of the ZINC lead-like database is provided. If you use this set or any subset, please cite:
Ehrlich, H.-C. Rarey, M.: Systematic benchmark of substructure search in molecular graphs - from Ullmann to VF2. J Cheminf 2012, DOI: 10.1186/1758-2946-4-13 (Open Access)
and the original sources accordingly.
SMARTS
The following table includes the literature references and links to the files containing the corresponding SMARTS expressions:
Name (Original first author) | Reference | Download |
Hann et al. | [2] | hann.smarts |
Walters et al. | [3] | walters.smarts |
Olah et al. | [4] | olah.smarts |
Maass et al. | [5] | maass.smarts |
Abolmaali et al. | [6] | abolmaali.smarts |
Degen et al. | [7] | brics.smats |
Ahmed et al. | [8] | ahmed.smarts |
Daylight | [9] | daylight.smarts |
Agrafiotis et al. | [10] | agrafiotis.smarts |
Enoch et al. | [11] | enoch.smarts |
Baell et al. | [12] | pains.smarts |
Kenny et al. | [13] | kenny.smarts |
Note that the original paper [11] contain patterns in SLN notation. A conversion into SMARTS was performed by R.Guha using Cactvs [14]. For further information on conversion, see Rajarshi Guhas blog entry http://blog.rguha.net/?p=850.
BENCHMARK SETS
The following sets contain the literature SMARTS files, different versions of a subset of the PAINS[11] SMARTS, sets of SMARTS-SMILES-pairs to evaluatethe influence of substructure and molecule size on the algorithmic runtime, the first 100k molecules from the ZINC lead-like database [15] as of 12th February 2011 to represent a small database and a phenylring-fullerene-pair file as a worst-case symmetry search case.
v2.0:
An extension of the benchmark set by SMARTS published in Kenny, P.; Montanari, C. & Prokopczyk, I. ClogPalk: a method for predicting alkane/water partition coefficient Journal of Computer-Aided Molecular Design, Springer Netherlands, 2013, 27, 389-402. See the changelog for a detailed overview of the changes.
Download v2.0.zip
v1.1:
A minor revision of the benchmark set adding and correcting SMARTS expressions that where missing or incorrect. See the changelog for a detailed overview of the changes. Thanks to Andrew Dalke who provided us with many hints for improvements of the dataset.
Download v1.1.zip
References:
[1] Daylight Theory Manual: http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html
[2] Hann M, Hudson B, Lewell X, Lifely R, Miller L, Ramsden N: Strategic pooling of compounds for high-throughput screening. J Chem Inf Comput Sci, 1999, 39(5):897–902. [http://pubs.acs.org/doi/abs/10.1021/ci990423o]
[3] Walters W, Murcko MA: Prediction of ‘drug-likeness’. Adv Drug Delivery Rev, 2002, 54(3):255–271.[http://www.sciencedirect.com/science/article/pii/S0169409X02000030].[Computational Methods for the Prediction of ADME and Toxicity]
[4] Abolmaali SFB, Wegner JK, Zell A: The compressed feature matrix - a fast method for feature based substructure search. J Mol Model, 2003, 9:235–241. DOI:0.1007/s00894-003-0126-0. [10.1007/s00894-003-0126-0]
[5] Olah M, Bologa C, Oprea TI:An automated PLS search for biologically relevant QSAR descriptors. J Comput Aided Mol Des, 2004, 18:437–449. DOI:0.1007/s10822-004-4060-8. [10.1007/s10822-004-4060-8]
[6] Maass P, Schulz-Gasch T, Stahl M, Rarey M: Recore: a fast and versatile method for sca?old hopping based on small molecule crystal structure conformations.J Chem Inf Model, 2007, 47(2):390–399. [http://pubs.acs.org/doi/abs/10.1021/ci060094h]. [PMID: 17305328]
[7] Degen J, Wegscheid-Gerlach C, Zaliani A, Rarey M: On the art of compiling and using ’drug-like’ chemical fragment spaces. Chem Med Chem, 2008, 3:1503-1507. DOI:10.1002/cmdc.200800178
[8] Ahmed HEA, Vogt M, Bajorath J: Design and evaluation of bonded atom pair descriptors. J Chem Inf Model 2010, 50:487-499. DOI:10.1021/ci900512g
[9] Daylight SMARTS examples; Daylight Chemical Information Systems, Inc. Laguna Niguel, CA;
http://www.daylight.com/dayhtml_tutorials/languages/smarts/smarts_examples.html. Accessed May 25, 2010.
[10] Agrafiotis DK, Gibbs AC, Zhu F, Izrailev S, Martin E: Conformational sampling of bioactive molecules: a comparative study. J Chem Inf Model, 2007, 47(3):1067–1086. [http://pubs.acs.org/doi/abs/10.1021/ci6005454].
[11] Enoch SJ, Madden JC, Cronin MTD: Identifcation of mechanisms of toxic action for skin sensitisation using a SMARTS pattern based approach. SAR QSAR Environ Res, 2008, 19(5-6):555–578.
[12] Baell, J. B., Holloway, G. A. New Substructure Filters for Removal of Pan Assay Interference Compounds (PAINS) from Screening Libraries and for their Exclusion in Bioassays, J Med Chem, 2010 , 53 (7), pp 2719-2740. DOI:10.1021/jm901137j
[13] Kenny, P.; Montanari, C., Prokopczyk, I.: ClogPalk: a method for predicting alkane/water partition coefficient Journal of Computer-Aided Molecular Design, Springer Netherlands, 2013, 27, 389-402. DOI:10.1007/s10822-013-9655-5
[14] Ihlenfeldt WD, Takahashi Y, Abe H, ichi Sasaki S: Computation and management of chemical properties in CACTVS: An extensible networked approach toward modularity and compatibility.J Chem Inf Comput Sci 1994, 34:109–116. DOI;10.1021/ci00017a013
[15] ZINC Database: http://zinc.docking.org