Tallymer
is a collection of flexible and memory-efficient programs for k-mer counting and indexing of large sequence sets.
Background
Unlike previous methods, Tallymer is based on enhanced suffix arrays. This gives a much larger flexibility concerning the choice of the k-mer size. Tallymer can process large data sizes of several billion bases. We used it in a variety of applications to study the genomes of maize and other plant species. In particular, Tallymer was used to index a set whole genome shotgun sequences from maize (B73) (total size 109 bp).
Tallymer was effective in a variety of applications to aid genome annotation in maize, despite limitations imposed by the relatively low coverage of sequence available.
A manual can be found here.
Availability
Tallymer is available as part of the genometools software (version 1.2.2 and higher).
The Perl-scripts post processing the Tallymer output, developed by Apurva Narechania (apurva(at)cshl.org), are available here:
Script for analyzing copy numbers
Script for counting repeats
Dan Bolser (dan.bolser(at)gmail.com) has developed a tallymer-based pipeline to annotate repeats in a fasta database.
The scripts comprising the pipeline can be found here:
http://github.com/dbolser/PGSC/tree/master/kmer-filter
Developers
The Tallymer software was written by Stefan Kurtz. kurtz(at)zbh.uni-hamburg.de. The Perl scripts above were developed by Apurva Narechania and Dan Bolser.
Publication
S. Kurtz, A. Narechania, J.C. Stein and D. Ware:
A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes
BMC Genomics, 9:517 (2009)
Contact
Stefan Kurtz
Center for Bioinformatics, University of Hamburg
Bundesstr. 43, 20146 Hamburg, Germany
Phone +49 40 42838 7311, Fax. +49 40 42838 7312