Fast index based algorithms for matching position specific scoring matrices


In biological sequence analysis, position specific scoring matrices (PSSMs) are widely used to represent sequence motifs. We present a new non-heuristic algorithm, called ESAsearch, to efficiently find matches of such matrices in large databases. Our approach preprocesses the search space, e.g. a complete genome or a set of protein sequences, and builds an enhanced suffix array which is stored on file. The enhanced suffix array only requires 9 bytes per input symbol, and allows to search a database with a PSSM in sublinear expected time. Since ESAsearch benefits from small alphabets, we present a variant operating on sequences recoded according to a reduced alphabet. We also address the problem of non-comparable PSSM-scores by developing a method which allows to efficiently compute a matrix similarity threshold for a PSSM, given an E-value or a P-value. Our method is based on dynamic programming. In contrast to other methods it employs lazy evaluation of the dynamic programming matrix: it only evaluates those matrix entries that are necessary to derive the sought similarity threshold.

The PoSSuMsearch program, included in the PoSSuM software distribution implements the algorithms Simplesearch, Lookaheadsearch, ESAsearch and Lazydistrib. A user can search for PSSMs in enhanced suffix arrays built by mkvtree from the VMATCH package (also included in the PoSSuM software distribution), as well as on plain sequence data in Fasta, GENBANK, EMBL, or SPROT format. The search algorithm can be chosen from the command line.

PSSMs are specified in a simple plain text format, where one file may contain multiple PSSMs. The alphabet a PSSM refers to, and alphabet character to PSSM column assignments can be specified on a per-PSSM basis for most flexible alphabet support. All implemented algorithms support alphabet transformations. PSSMs can contain integer as well as floating point scores. To prevent rounding errors for integer based PSSMs, PoSSuMsearch uses integer arithmetics for these, resulting in an additional speedup on most CPU architectures. Searching on the reverse strand of nucleotide sequences is implemented by PSSM transformation according to Watson-Crick base pairing. Hence it is sufficient to build the enhanced suffix array for one strand only. This can then be used to search both strands.

The cutoff can be specified as p-value, E-value, MSS (matrix similarity score), or raw score threshold. If only the best matches with the highest scores need to be known, then PoSSuMsearch can be asked to report only the k highest scoring matches without even specifying an explicit cutoff. To do so, the search algorithms dynamically adapt the threshold during the search. When using p- or E-values, the score threshold is determined by either the lazy dynamic programming algorithm introduced in this contribution, or read from file that stores the complete precalculated probability distribution. Background distributions can be specified arbitrarily by the user, or determined from a given sequence database. We provide a tool, PoSSuMdist, to generate a compressed file containing the complete precalculated probability distributions for a set of PSSMs.

PSSM matches can be sorted by specifying a list of sort keys, like p-value, match score, sequence number, and so on. The output formats of PoSSuMsearch print out all available information about a match, either in a human readable format, tab delimited, or in machine readable, XML-based CisML. PoSSuMsearch as well as PoSSuMdist support multi-threading for a further reduction of running time on multi CPU machines.


The PoSSuM software package, including the program PoSSuMsearch implementing the ESAsearch algorithm, is available free of charge for non-commercial research institutions. Software download.

If you use PoSSuM in your own research, please cite:

Michael Beckstette, Robert Homann, Robert Giegerich, Stefan Kurtz
Fast index based algorithms for matching position specific scoring matrices
in BMC Bioinformatics 7:389, 2006
Download Publication

Michael Beckstette, Dirk Strothmann, Robert homann, Robert Giegerich, Stefan Kurtz
PoSSuMsearch: Fast and Sensitive Matching of Position Specific Scoring Matrices using Enhanced Suffix Arrays
in Proceedings of the German Conference on Bioinformatics 2004
GI Lecture Notes in Informatics,vol.: P-53, pages 53-64, ISBN 3-88579-382-2
Download Publication