Structator

Structator: fast index-based search for RNA sequence-structure patterns

Meyer, F., Kurtz, S., Backofen, R., Will, S., Beckstette, M. (2011). Structator: fast index-based search for RNA sequence-structure patterns. BMC Bioinformatics 12:214.

Background:
The secondary structure of RNA molecules is intimately related to their function and often more conserved than the sequence. Hence, the important task of searching databases for RNAs requires to match sequence-structure patterns. Unfortunately, current tools for this task have, in the best case, a running time that is only linear in the size of sequence databases. Furthermore, established index data structures for fast sequence matching, like suffix trees or arrays, cannot benefit from the complementarity constraints introduced by the secondary structure of RNAs.

Results: We present a novel method and readily applicable software for time efficient matching of RNA sequence-structure patterns in sequence databases. Our approach is based on affix arrays, a recently introduced index data structure, preprocessed from the target database. Affix arrays support bidirectional pattern search, which is required for efficiently handling the structural constraints of the pattern. Structural patterns like stem loops can be matched inside out, such that the loop region is matched first and then the pairing bases on the boundaries are matched consecutively. This allows to exploit base pairing information for search space reduction and leads to an expected running time that is sublinear in the size of the sequence database. The incorporation of a new chaining approach in the search of RNA sequence-structure patterns enables the description of molecules folding into complex secondary structures with multiple ordered patterns. The chaining approach removes spurious matches from the set of intermediate results, in particular of patterns with little specificity. In benchmark experiments on the Rfam database, our method runs up to two orders of magnitude faster than previous methods.

Conclusions: The presented method's sublinear expected running time makes it well suited for RNA sequence-structure pattern matching in large sequence databases. RNA molecules containing several stem loop substructures can be described by multiple sequence-structure patterns and their matches are efficiently handled by a novel chaining method. Beyond our algorithmic contributions, we provide with Structator a complete and robust open-source software solution for index-based search of RNA sequence-structure patterns.

Availability

Structator is available under the GNU General Public License Version 3.
Please select a suitable file to download.

Structator1.1-sources.tar.gzSource code
Structator1.1-linux-gnu.i386.tar.gzLinux 32 Bit version for IA32 based systems
Structator1.1-linux-gnu.amd64.tar.gzLinux 64 Bit version for AMD64/EMT64 based systems
Structator1.1-macOS.tar.gzMac OS version

Version history

Version 1.1: includes a new deterministic quadratic time global chaining algorithm which can be used with the option -allglobal
Version 1.02: contains contributions from: Albrecht, B., Heun, V. (2012). Space Efficient Modifications to Structator - a Fast Index-Based Search Tool for RNA Sequence-Structure Patterns. In 11th International Symposium on Experimental Algorithms
Version 1.01: minor bug fixes
Version 1.0: first release

Developers

Fernando Meyer, meyer(at)zbh.uni-hamburg.de

Application examples

Searching a subset of Rfam release 10.0 for RNA family CTV_rep_sig (Rfam Acc. RF00193) by building high-scoring global chains of matches

Files required
Data: RFAM10_8MB.fa
Secondary structure descriptor (SSD): RF00193.pat
Alphabet: rna.alphab
Watson-Crick and wobble complementarity rules: dna_rna.comp

Sample command for the index construction
./afconstruct RFAM10_8MB.fa -alph rna.alphab -a -s indexname1

Sample command for the search
./afsearch indexname1 -pat RF00193.pat -comp dna_rna.comp -a -global -minlen 5

Searching human chromosome 20 for RNA gene HAR1F (Rfam Acc. RF00635) by building high-scoring local chains of matches

Files required
Data: ftp://ftp.ensembl.org/pub/release-60/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.60.dna.chromosome.20.fa.gz
The file can be unpacked with gunzip.

SSD: RF00635.pat
Alphabet: dna.alphab
Watson-Crick and wobble complementarity rules: dna_rna.comp

Sample command for the index construction
./afconstruct Homo_sapiens.GRCh37.60.dna.chromosome.20.fa -alph rna.alphab -a -s indexname2

Sample command for the search
./afsearch indexname2 -pat RF00635.pat -comp dna_rna.comp -a -local -wf 10 -show


Secondary structure descriptors for 42 highly structured Rfam 10 families

SSDs.tar.gz