SMARTS Dataset
Smiles ARbitary Target Specification (SMARTS) is a language to formulate chemical patterns, such as substructures in molecules.[1] To evaluate algorithms to search for chemical patterns in molecules, we present a collection of SMARTS expressions extracted from various literature sources[2-13] and a collection of SMARTS-molecule pairs created from the ZINC database. Additionally, a test case with a highly symmetric SMARTS-SMILES pair and a subset of the ZINC Lead-Like database is provided.
If you use this set or any subset, please cite
- Ehrlich, H. C.; Rarey, M. Systematic Benchmark of Substructure Search in Molecular Graphs - From Ullmann to VF2. J Cheminform 2012, 4 (1), 13. DOI: https://doi.org/10.1186/1758-2946-4-13
- and the original reference(s)
SMARTS Files
In the following, the literature references and download links to the files containing the corresponding SMARTS expressions are listed.
- Hann et al. [2]
hann.smarts (https://www.zbh.uni-hamburg.de/forschung/amd/datasets/smarts-dataset/hann-smarts-tar.gz) - Walters et al. [3]
walters.smarts (https://www.zbh.uni-hamburg.de/forschung/amd/datasets/smarts-dataset/walters-smarts-tar.gz) - Olah et al. [4]
olah.smarts (https://www.zbh.uni-hamburg.de/forschung/amd/datasets/smarts-dataset/olah-smarts-tar.gz) - Maass et al. [5]
maass.smarts (https://www.zbh.uni-hamburg.de/forschung/amd/datasets/smarts-dataset/maass-smarts-tar.gz) - Abolmaali et al. [6]
abolmaali.smarts (https://www.zbh.uni-hamburg.de/forschung/amd/datasets/smarts-dataset/abolmaali-smarts-tar.gz) - Degen et al. [7]
brics.smarts (https://www.zbh.uni-hamburg.de/forschung/amd/datasets/smarts-dataset/brics-smarts-tar.gz) - Ahmed et al. [8]
ahmed.smarts (https://www.zbh.uni-hamburg.de/forschung/amd/datasets/smarts-dataset/ahmed-smarts-tar.gz) - Daylight [9]
daylight.smarts (https://www.zbh.uni-hamburg.de/forschung/amd/datasets/smarts-dataset/daylight-smarts-tar.gz) - Agrafiotis et al. [10]
agrafiotis.smarts (https://www.zbh.uni-hamburg.de/forschung/amd/datasets/smarts-dataset/agrafiotis-smarts-tar.gz) - Enoch et al. [11]
enoch.smarts (https://www.zbh.uni-hamburg.de/forschung/amd/datasets/smarts-dataset/enoch-smarts-tar.gz) - Baell et al. [12]
pains.smarts (https://www.zbh.uni-hamburg.de/forschung/amd/datasets/smarts-dataset/pains-smarts-tar.gz) - Kenny et al. [13]
kenny.smarts (https://www.zbh.uni-hamburg.de/forschung/amd/datasets/smarts-dataset/kenny-smarts-tar.gz)
Note that the original publication by Baell and colleagues[12] contains patterns in SLN notation. A conversion into SMARTS was performed by Rajarshi Guha using CACTVS.[14]
Benchmark Sets
The following sets contain the literature-derived SMARTS files, various versions of a subset of the PAINS[12] SMARTS patterns, sets of SMARTS-SMILES pairs to evaluate the influence of substructure and molecule size on the algorithmic runtime, the first 100,000 molecules from the ZINC Lead-Like database[15] as of 12 February 2011 to represent a small database, and a phenyl ring-fullerene pair file as worst-case symmetry search case.
- Version 2.0:
An extension of the benchmark set by SMARTS published in Kenny, P.; Montanari, C. & Prokopczyk, I. ClogPalk: a method for predicting alkane/water partition coefficient Journal of Computer-Aided Molecular Design, Springer Netherlands, 2013, 27, 389-402. See the changelog at https://www.zbh.uni-hamburg.de/forschung/amd/datasets/smarts-dataset/changelog-2-0.docx for a detailed overview of the changes.
Download v2.0.zip at https://www.zbh.uni-hamburg.de/forschung/amd/datasets/smarts-dataset/v2-0.zip - Version 1.1:
A minor revision of the benchmark set, adding and correcting missing or incorrect SMARTS expressions. See the changelog for at https://www.zbh.uni-hamburg.de/forschung/amd/datasets/smarts-dataset/changelog-v1-1.docx a detailed overview of the changes. We thank Andrew Dalke who provided us with many hints for improving the dataset.
Download v1.1.zip at https://www.zbh.uni-hamburg.de/forschung/amd/datasets/smarts-dataset/v1-1.zip
[1] Daylight Theory Manual. https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html
[2] Hann, M.; Hudson, B.; Lewell, X.; Lifely, R.; Miller, L.; Ramsden, N. Strategic Pooling of Compounds for High-Throughput Screening. J Chem Inf Comput Sci 1999, 39 (5), 897-902. DOI: https://doi.org/10.1021/ci990423o
[3] Walters, W. P.; Murcko, M. A. Prediction of 'Drug-Likeness'. Adv Drug Deliv Rev 2002, 54 (3), 255-271. DOI: https://doi.org/10.1016/s0169-409x(02)00003-0
[4] Abolmaali, S. F.; Wegner, J. K.; Zell, A. The Compressed Feature Matrix - A Fast Method for Feature Based Substructure Search. J Mol Model 2003, 9 (4), 235-241. DOI: https://doi.org/10.1007/s00894-003-0126-0
[5] Olah, M.; Bologa, C.; Oprea, T. I. An Automated PLS Search for Biologically Relevant QSAR Descriptors. J Comput Aided Mol Des 2004, 18 (7-9), 437-449. DOI: https://doi.org/10.1007/s10822-004-4060-8
[6] Maass, P.; Schulz-Gasch, T.; Stahl, M.; Rarey, M. Recore: A Fast and Versatile Method for Scaffold Hopping Based on Small Molecule Crystal Structure Conformations. J Chem Inf Model 2007, 47 (2), 390-399. DOI: https://doi.org/10.1021/ci060094h
[7] Degen, J.; Wegscheid-Gerlach, C.; Zaliani, A.; Rarey, M. On the Art of Compiling and Using 'Drug-Like' Chemical Fragment Spaces. ChemMedChem 2008, 3 (10), 1503-1507. DOI: https://doi.org/10.1002/cmdc.200800178
[8] Ahmed, H. E.; Vogt, M.; Bajorath, J. Design and Evaluation of Bonded Atom Pair Descriptors. J Chem Inf Model 2010, 50 (4), 487-499. DOI: https://doi.org/10.1021/ci900512g
[9] Daylight SMARTS Examples. https://www.daylight.com/dayhtml_tutorials/languages/smarts/smarts_examples.html (last accessed May 25, 2010)
[10] Agrafiotis, D. K.; Gibbs, A. C.; Zhu, F.; Izrailev, S.; Martin, E. Conformational Sampling of Bioactive Molecules: A Comparative Study. J Chem Inf Model 2007, 47 (3), 1067-1086. DOI: https://doi.org/10.1021/ci6005454
[11] Enoch, S. J.; Madden, J. C.; Cronin, M. T. Identification of Mechanisms of Toxic Action for Skin Sensitisation Using a SMARTS Pattern Based Approach. SAR QSAR Environ Res 2008, 19 (5-6), 555-578. DOI: https://doi.org/10.1080/10629360802348985
[12] Baell, J. B.; Holloway, G. A. New Substructure Filters for Removal of Pan Assay Interference Compounds (PAINS) from Screening Libraries and for Their Exclusion in Bioassays. J Med Chem 2010, 53 (7), 2719-2740. DOI: https://doi.org/10.1021/jm901137j
[13] Kenny, P. W.; Montanari, C. A.; Prokopczyk, I. M. ClogP(alk): A Method for Predicting Alkane/Water Partition Coefficient. J Comput Aided Mol Des 2013, 27 (5), 389-402. DOI: https://doi.org/10.1007/s10822-013-9655-5
[14] Ihlenfeldt, W. D.; Takahashi, Y.; Abe, H.; Sasaki, S. Computation and Management of Chemical Properties in CACTVS: An Extensible Networked Approach toward Modularity and Compatibility. Journal of Chemical Information and Computer Sciences 2002, 34 (1), 109-116. DOI: https://doi.org/10.1021/ci00017a013
[15] ZINC Database. https://zinc.docking.org/