Deep-Sea Protein Structure Dataset

Predicting molecular protein adaptations is a key challenge in protein engineering. In particular, proteins of extremophile organisms often exhibit desirable properties, such as tolerance to extremely high temperature and/or pressure. A promising resource for such proteins is the deep-sea, which is the largest extreme environment on Earth. In past years, through large-scale metagenomic projects, increasing protein data from these environments have been provided. Not surprisingly, there is considerable interest in systematically analyzing the currently available data.

We compiled a dataset of 1281 experimental protein structures from 25 deep-sea organisms available in the Protein Data Bank (PDB) and paired them with orthologous proteins. This dataset is one of the first to provide protein structure pairs for building data-driven methods and analyzing structural protein adaptations to extreme environmental conditions in the deep-sea. We thoroughly removed redundancy and processed the data set into cross-validation folds for easy use in machine learning. We also annotated the protein pairs by the environmental preferences of the deep-sea and decoy source organisms. In this way, proteins from thermophile, mesophile, and piezophile organisms can be directly compared. The final data set contains 501 deep-sea protein and 8200 decoy protein chains from 20 different deep-sea and 1379 decoy organisms, forming 17,148 chain pairs. Further details and a machine learning-based analysis of the dataset can be found in the corresponding publication.[1]

The dataset can be downloaded at https://www.zbh.uni-hamburg.de/forschung/amd/datasets/deep-sea-protein-structure/deep-sea-proteins-1.zip.

[1] Sieg, J.; Sandmeier, C. C.; Lieske, J.; Meents, A.; Lemmen, C.; Streit, W. R.; Rarey, M. Analyzing Structural Features of Proteins from Deep-Sea Organisms. Proteins 2022, 90 (8), 1521-1537. DOI: https://doi.org/10.1002/prot.26337