Interpretation
MetaGenomeThreader
Result Interpretation (short DNA sequences)
The test results for the use of short DNA sequences proved the functionality of the MetaGenomeThreader (cp. interpretation by long DNA sequences, here). Tab. 2.4 shows which of the coding DNA sequences of the test data set had been identified (column 5; column 4 marks partial covered sequence regions). Column 3 marks within which boundaries the DNA sequences were simulated by ReadSim. All coding DNA sequence regions were covered with simulated DNA sequences. In the end, there were 2 of 11 coding DNA sequences that could not be identified despite existing simulated DNA sequences (18%).
species
(cp. Tab. 1.1)
coding DNA sequence regions
sequence coverage
(from - to)
partial
identified
Candidatus Pelagibacter
-3 891...-4 832
-4 835...-6 247
-6 249...-6 434
-6 444...-7 310
+4 031...+6 955
x
x
yes
yes
no
yes
Vibrio cholerae
+794 241...+795 817
+795 485...+795 817
+795 839...+797 692
+797 707...+798 654
+795 003...+797 940
x
x
yes
yes
no
yes
Pyrococcus horikoshii
-171 622...-172 662
-172 610...-173 815
+173 822...+174 907
+172 001...+173 993
x
x
yes
yes
no
Tab. 2.4: Test data set (column 2), sequence coverage (column 3) and the result (column 4) of the MetaGenomeThreader PCS identification (cp. summary of the test results by long DNA sequences, here).
The quality of the PCS prediction is very good. 63 out of 81 PCS's could be assigned to the correct coding DNA sequence (78% of the identified PCS's; cp. Tab. 2.5). The first column of Tab. 2.5 shows the target protein name whereas the second column shows the protein name of a BLAST of the identified PCS's against the Swissprot as well as the TrEMBL database. Only the first / best hit out of the BLAST result is given in column 2. Almost all protein sequences could be correctly identified, except with the YacC protein of Vibrio cholerae. Here only a hypothetical protein could be identified, but could also relate to the YacC protein searched. In column 3 the numbers of different PCS's of the MetaGenomeThreader result which can be assigned to the target protein are given. The whole DNA sequence of the identified PCS was not always used in the BLAST results. Thus in column 5 and 6 of Tab. 2.5 the length of the BLAST hit protein sequence, in reference to the protein sequence of the identified PCS, as well as the sequence identity of the protein sub-sequence of the PCS and the BLAST hit protein sequence are given. To evaluate the quality of the BLAST hits the numbers of PCS's, within every percentage interval, are provided in column 4. Acceptable similarities could not be identified in relation to the target sequence in the outcomes of the SecD protein of Vibrio cholerae, therefore relating to 1 of 29 protein sequences of the BLAST hits (3%) and 1 of 63 PCS's (1,6%) where the target protein could not be identified again.
Conclusion: Thus, in the case of low sequence coverage of the target protein sequences with PCS's protein sequences, the detection of the target protein sequences is possible with a high precision through a BLAST of the PCS protein sequences against a protein database.
target protein
protein of the BLAST hits (Swissprot / TrEMBL-DB)
number of PCS's
DB subsequence length
sequence identity
protein sequence identity
tRNA isopentenyltransferase
tRNA delta(2)-isopentenyl-pyrophosphate transferase
1
100%
1 x 60-69%
100%
probable periplasmic serine protease DO-like precursor
probable periplasmic serine protease DO-like
2
2 x 80-89%
100%
100%
probable integral membrane proteinase
probable integral membrane proteinase
2
1 x 70-79%
1 x 80-89%
1 x 90-99%
100%
100%
queuine tRNA-ribosyltransferase
queuine tRNA-ribosyltransferase
3
1 x 40-49%
2 x 50-59%
1 x 70-79%
2 x 90-99%
63%
81%
100%
preprotein translocase subunit YajC
putative uncharacterized protein
3
2 x 50-59%
1 x 90-99%
1 x 70-79%
2 x 100%
3 x 100%
protein export protein SecD
protein-export membrane protein SecD
(putative uncharacterized protein)
29
1 x 30-39%
2 x 40-49%
1 x 50-59%
4 x 60-69%
7 x 70-79%
4 x 80-89%
7 x 90-99%
1 x 100%
2 x > 100 %
(56%)
1 x 50-59%
4 x 70-79%
6 x 80-89%
5 x 90-99%
13 x 100%
(32%)
89%
27 x 100%
protein export protein SecF
protein-export membrane protein SecF
1
97%
100%
100%
DNA primase small subunit
DNA primase small subunit
12
1 x 40-49%
2 x 50-59%
1 x 80-89%
1 x 90-99%
7 x 100%
1x 70-79%
1 x 80-89%
3 x 90-99%
7 x 100%
100%
DNA primase large subunit
DNA primase large subunit
10
1 x 40-49%
1 x 50-59%
2 x 90-99%
5 x 100%
1 > 100%
2 x 70-79%
8 x 100%
100%
Tab. 2.5: Results of a BLAST of the identified PCS's against a protein database
The 18 falsely predicted PCS's (22% of the identified PCSs) were based on the non-optimal detection of the correct reading frame (cp. Tab. 2.6). In 10 cases (56%) there are two or more identical PCSs relative to the length, and the false reading frame that was chosen based on the selection order of +3,+2,+1,-1,-2,-3 of the reading frame. In the remaining 8 cases (44%) the correct reading frames could not be predicted, because there were one or more stop codons in the DNA sequences in the correct reading frame as well as equal or longer PCS's without a stop codon. Longer PCS's without stop codons are scored higher and therefore choosing the wrong PCS predictions.
The following example indicates a falsely predicted PCS based on the order of the reading frame detection, whereas the second example shows a falsely predicted PCS based on stop codons in the DNA sequence in the correct reading frame.
PFLSHRSTLTIFGIASIKSLIDPNPFSSSLAFSL - output PCS
SEKAKELLKGFGSINDFMDAIPKIVSVERCDKK - reading frame: +1
SEKAKELLKGFGSINDFMDAIPKIVSVERCDKK - reading frame: -2
SEKAKELLKGFGSINDFMDAIPKIVSV----ER
SEKAKELLKGFGSINDFMDAIPKIVSVDDVIER - Pyrococcus horikoshii; DNA primase large subunit
MFFFENADILLPPSLIERNVHLWATIFVGAL - output PCS
MFFFENADILLPPSLIERNVHLWATIFVGAL - reading frame: -3
QSADEYGRPQVNISLD*RRRQQDVSVLEKE - reading frame: +2 and stop codon
SADEYGRPQVNISLD - Vibrio cholerae; Protein-export membrane protein SecD
target protein
not correct PCS
type of error of the reading frame detection
result of the correct PCS's
probable periplasmic serine protease DO-like precursor
2
1 x order of detection
1 x stop codon
probable periplasmic serine protease DO-like
queuine tRNA-ribosyltransferase
1
1 x order of detection
queuine tRNA-ribosyltransferase
protein export protein SecD
9
5 x order of detection
4 x stop codon
protein export protein SecD
(1 x not-identical protein sequence)
DNA primase small subunit
1
1 x order of detection
DNA primase small subunit
DNA primase large subunit
5
2 x order of detection
3 x stop codon
DNA primase large subunit
Tab. 2.6: Error types and number of errors of the not correct identified short DNA sequences
MetaGenomeThreader: Main
here
Test Results: Statistic Section
here
Test Results: Sequence Data
here