Interpretation
MetaGenomeThreader
Result Interpretation (long DNA Sequences)
The test results proved the general functionality of the MetaGenomeThreader. Tab. 1.4 shows the boundaries of the calculated PCS's (column 5) of the test data set (column 2). Column 3 and column 4 shows, which of the sequence regions were partial covered (p) or not covered (n) with simulated sequences by ReadSim. It therefore became clear that some coding DNA sequences could not be identified because of a lack of simulated DNA sequence data. There were finally 3 (respectively 7, but 4 of the 7 were not covered) coding DNA sequences which could not be identified despite existing simulated DNA sequences.
species
(cp. Tab. 1.1)
coding DNA
sequence region
sequence coverage
(from - to)
partial (p)/
not covered (n)
identified PCS's
Candidatus
Pelagibacter
-3 891...-4 832
-4 835...-6 247
-6 249...-6 434
-6 444...-7 310
+4 082...+6 147
p
n
n
-
-
-
-
Vibrio
cholerae
+794 241...+795 380
+795 485...+795 817
+795 839...+797 692
+797 707...+798 654
+795 149...+797 162
p
n
-
+795 467...+795 814
+795 824...+797 095
-
Pyrococcus
horikoshii
-171 622...-172 662
-172 610...-173 815
+173 822...+174 907
+172 005...+173 275
p
p
p
-172 003...-172 662
-172 607...-173 269
-
Tab. 1.4: Test data set (column 2), sequence coverage (column 3), coverage (column 4) and identified PCS's (column 5), (cp. summary of the test results by short DNA sequences, here).
Verification was required to explaining why 3 PCS's were not identified. The PCS of Vibrio cholerae could not be predicted, probably because there are only 230 DNA bases which were covered with simulated DNA sequences. The program could not predict any coding DNA sequences for Candidatus Pelagibacter. The statistics and the taxonomical classification for Candidatus Pelagibacter already showed that there were no adequate qualitative DNA sequences data available (cp. Tab. 1.2 and Tab. 1.3) to predict the coding DNA sequences derived from the test data set. A BLAST of the protein sequences from the test data set compared to a database with non-redundant protein sequences (nr-DB) is presented in Tab. 1.5. For Candidatus Pelagibacter indicating the protein sequence identity values and the values for the amino acid exchanges with a positive score, using a BLOSUM62 matrix. The outcome is significantly lower than the values of Vibrio cholerae and Pyrococcus horikoshii.
Conclusion: The data used as the basis for the PCS calculations were not as good for Candidatus Pelagibacter as for Vibrio cholerae and Pyrococcus horikoshii. For Candidatus Pelagibacter only 3 PCS's could be predicted for the higher values of the serine protease (cp. Tab. 1.5). The 3 predicted PCS's were not properly calculated due to incorrect reading frame detection.
target protein
dominant species of the PCS identification
(cp. Tab. 1.2 - statistic section (long DNA sequences))
sequence identity
(in %)
positive AA-excanges
(in %)
tRNA isopentenyltransferase
Rhodopseudomonas palustris HaA2
35
55
Rhodopseudomonas palustris BisB5
34
53
probable periplasmic serine protease DO-like precursor
Rhodopseudomonas palustris HaA2
43
64
Rhodopseudomonas palustris BisB5
43
64
preprotein translocase subunit YajC
cholerae O395
ca. 100
ca. 100
cholerae vulnificus YJ016
84
90
protein export protein SecD
cholerae O395
ca. 100
ca. 100
cholerae vulnificus YJ016
91
95
DNA primase small subunit
Pyrococcus furiosus
79
90
Pyrococcus abyssi
83
91
DNA primase large subunit
Pyrococcus furiosus
68
87
Pyrococcus abyssi
75
91
Tab. 1.5: Comparison of the identity (column 3) and the positives (column 4) in the amino acid alignments of the target proteins against a non-redundant protein database for all three species.
The following table (cp. tab. 1.6) compares the target protein with the one that was assigned to the target protein in the MetaGenomeThreader result. In almost all cases the correct protein was assigned. In the case of the overlapping genes of Pyrococcus horikoshii and the closely stacked genes of Vibrio cholerae, the overlapping and surrounding genes appears in the BLAST hits. The identification of the correctly identified genes is only possible with the number of DNA bases of the particular assigned genes.
target Protein
species for the PCS identification
(cp. Tab. 1.2)
BLAST hits and the assigned proteins
(number of the DNA bases in brackets)
periplasmic serine protease
Rhodop. palustris HaA2
peptidase S1C, DO
Rhodop. palustris BisB5
peptidase S1C, DO
preprotein translocase subunit YajC
Vibrio cholerae O395
queuine tRNA-ribosyltransferase (ca. 133)
conserved hypothetical protein (ca. 330)
protein-export membrane protein SecD (ca. 750)
Vibrio cholerae vulnificus YJ016
preprotein translocase, YajC subunit (ca. 330)
preprotein translocase subunit SecD (ca. 510)
protein export protein SecD
Vibrio cholerae O395
queuine tRNA-ribosyltransferase (ca. 133)
conserved hypothetical protein (ca. 330)
protein-export membrane protein SecD (ca. 1350)
Vibrio cholerae vulnificus YJ016
preprotein translocase, YajC subunit (ca. 330)
preprotein translocase subunit SecD (ca. 1 300)
DNA primase small subunit
Pyrococcus furiosus
DNA-primase, putative (ca. 660)
hypothetical protein (ca. 50)
Pyrococcus abyssi
DNA primase (ca. 660)
eukaryotic-type DNA primase, large subunit (ca. 290)
DNA primase large subunit
Pyrococcus furiosus
hypothetical protein (ca. 570)
Pyrococcus abyssi
DNA primase (ca. 250)
eukaryotic-type DNA primase, large subunit (ca. 550)
Tab. 1.6: Comparison of the target proteins and the proteins of the BLAST hits of the MetaGenomeThreader result.
Conclusion: The general functionality of the MetaGenomeThreader could be proved. One should however take the following two aspects into consideration. The sequence data used as the data basis has to be good enough for the prediction of PCS's and to bare in mind that the detection of the reading frame could be a key problem area. Eg. if there are PCS's of the same length, the reading frame is chosen in following orders of +3, +2, +1, -1, -2, -3.
MetaGenomeThreader: Main
here
Test Results: Statistic Section
here
Test Results: Sequence Data
here