- Open Access
Phylogenetic placement of metagenomic reads using the minimum evolution principle
© Filipski et al; licensee BioMed Central Ltd. 2015
- Published: 15 January 2015
A central problem of computational metagenomics is determining the correct placement into an existing phylogenetic tree of individual reads (nucleotide sequences of varying lengths, ranging from hundreds to thousands of bases) obtained using next-generation sequencing of DNA samples from a mixture of known and unknown species. Correct placement allows us to easily identify or classify the sequences in the sample as to taxonomic position or function.
Here we propose a novel method (PhyClass), based on the Minimum Evolution (ME) phylogenetic inference criterion, for determining the appropriate phylogenetic position of each read. Without using heuristics, the new approach efficiently finds the optimal placement of the unknown read in a reference phylogenetic tree given a sequence alignment for the taxa in the tree. In short, the total resulting branch length for the tree is computed for every possible placement of the unknown read and the placement that gives the smallest value for this total is the best (optimal) choice. By taking advantage of computational efficiencies and mathematical formulations, we are able to find the true optimal ME placement for each read in the phylogenetic tree. Using computer simulations, we assessed the accuracy of the new approach for different read lengths over a variety of data sets and phylogenetic trees. We found the accuracy of the new method to be good and comparable to existing Maximum Likelihood (ML) approaches.
In particular, we found that the consensus assignments based on ME and ML approaches are more correct than either method individually. This is true even when the statistical support for read assignments was low, which is inevitable given that individual reads are often short and come from only one gene.
- Pairwise Distance
- Correct Placement
- Reference Tree
- Minimum Evolution
- True Alignment
Rapid and inexpensive sequencing methods yielding short reads have become common for analyzing mixed-species biological samples [1–8]. Phylogenetic and taxonomic classification of species present may be done by extracting and amplifying fragments of a distinctive gene such as one for ribosomal RNA from the sample and comparing the results to reference samples [9, 10]. Early methods for identification of metagenomic reads were based on BLAST [11, 12], but these approaches do not define the best phylogenetic placement [e.g., [13, 14]]. Consequently, rigorous phylogenetic methods under the maximum likelihood (ML) principle have been developed for ascertaining the phylogenetic placement of a sequence read in a given species tree [15–17]. However, an approach using the Minimum Evolution (ME) principle is not yet available for classifying metagenomic reads, which is important because, as we show below, it is possible to develop a method that does not require heuristics for classifying reads when using a matrix of pairwise distances. Furthermore, methods employing different optimality principles (ML and ME) can be useful in molecular phylogenetics to assess the robustness of inferences to underlying biases of individual methods.
Therefore, we have developed a distance-based approach (called PhyClass) under the Minimum Evolution (ME) principle for classifying metagenomics reads. We have also implemented an efficient procedure for producing bootstrap statistical support for the assignment of any read to any position in the reference species tree. In the following, we first describe the new method and then assess the absolute accuracy of the new approach by using computer simulations. We also compare the performance of PhyClass with the most accurate existing Maximum Likelihood based placement program, EPA . EPA is based on the RAxML package  and has been shown to perform as well as or better than other methods for this purpose [15, 16]. We also discuss the usefulness of concordance between PhyClass and EPA inferences in identifying correct assignments even when the statistical support for the read assignment is low.
Details of the Algorithm
In the PhyClass method, the input consists of a set of partial sequences of a specific gene/genomic segment (reads), a reference tree topology (T) describing the phylogenetic relationships among some set of n species, and a multiple sequence alignment (MSA) of the relevant genes or genomic segments for these n species. For a given read r, the goal is to find a best-fit placement for r in the tree T. This may be done by minimizing the cost, defined as the sum of branch lengths S b , r of the tree containing read r attached to branch b of T, over all possible placements b. Under the ME principle, the configuration where S b , r is the smallest is the best placement . In this calculation, we use a matrix D of distances among all n+1 sequences (reference alignment plus the given read). For all placements of a read, it is only necessary to recalculate pairwise distances between the given read and the n reference sequences, with the pairwise distances among the reference sequences calculated once at the start of the analysis. Furthermore, the calculation of S b , r for all placements of the same read in a fixed reference tree can be done efficiently, because calculating the change between S b , r and S b’ , r where b and b’ are adjacent branches of T, requires only a limited calculation involving local branch lengths [19, 20]. Therefore, no approximate heuristics are necessary to efficiently apply the ME principle by evaluating all possible topological locations for read r. Different distance measures may be used for the ME computations.
Accuracy of the Method Using Complete Sequences
We began by establishing a baseline profile, where we assumed that the original full length sequence was available (2000 base pairs) along with the true MSA as query. In this way, we established the maximum possible accuracy one could achieve if the metagenomics sequence extraction process produced complete full-length error-free sequences that could be aligned perfectly. We used each individual sequence in the reference tree of 500 sequences as a read (query) for metagenomic analysis by first removing that sequence from the tree and then evaluating the ability of PhyClass to replace it at the correct topological position in the tree of 499 remaining taxa. Under these conditions, 66% (± 1.7%) of the original perfect sequences could be re-assigned correctly (fc = 66%).
Effect of Alignment
As the true alignments and perfect sequences are never available in a real metagenomics context, we next examined the performance when using HMMER for aligning query reads to sequences in the reference MSA . For added realism, we inflicted three types of errors to the query sequences: 1% of the bases were mutated to another base, another 1% bases were deleted, and 1% bases were duplicated (stutter) [26, 27]. The resulting accuracy of 49.4% was lower than the ideal case (Figure 2a) (f c = 49.4 ± 1.6%).
It is important to realize that the true estimate of the metagenomics accuracy will generally be higher than the fc we report here (for a given sequence length), because, in our evaluations species corresponding to our queries were never allowed to appear in the reference tree. That is, we removed the query sequence from the reference tree and alignment before using PhyClass. Otherwise, we expect to be able to correctly assign the query to its source data due to the small evolutionary distance between the query and the correct reference sequences. This was confirmed by our analyses, where the f c for known sequences (2000 bp queries with simulated read error and HMMER alignment) was 97.2% for the simulated dataset. Therefore, the metagenomics read assignment accuracy will be considerably higher when the proportion of sequences from known species is high. Nevertheless, in order to make the tests as challenging as possible, we introduced read errors as described above, applied HMMER instead of using true alignment information, and conducted analyses after deleting the true target species from the tree in all subsequent analyses reported below. In this sense, our results are worst-case scenarios. A similar protocol was used in the study by Berger et al. .
Anatomy of Misplacements
In order to better understand the anatomy of misplacements, we analyzed the proximity of the erroneous assignments to correct location by recording the number of intervening nodes separating the correct branch and the branch assigned by PhyClass (Node separation, SN) and the evolutionary distance, in terms of mean number of substitutions per site, spanned by the correct and incorrect placement (Branch length separation, SB). For the correct placements, both of them were 0. We found that 43% of the incorrectly placed reads were assigned to a branch immediately adjacent to the correct branch (SN = 1) and the other 40% were assigned on branches that were just two nodes away (SN = 2; Figure 2b). Thus, approximately 94% of these placements were made no more than two nodes from their correct position in the tree.
In ordinary (non-testing) use, evolutionary distance information can also be useful. For example, if the unknown read is placed into the tree at a long distance from the nearest reference node, then we know that the item has been placed into a relatively isolated phylogenetic position with respect to the reference samples. This information, which is readily available as all branch lengths in the tree are automatically computed, may be useful as an indicator of reads from anomalous species.
Analysis of Partial Sequences
We hypothesized that an additional factor affecting accuracy is evolutionary distance measure used, although it was not clear a priori what the effect would be, as more sophisticated evolutionary models reduce distance estimation bias, but also increase the variance of the estimate . For the above results, we used p-distance (fraction of sites different between two sequences), which has a relatively small estimation variance and is computable for all sequence pairs, unlike more sophisticated distance measures that often fail (e.g. Jukes-Cantor distance for sequence pairs that differ at more than 75% of sites ). It is known that p-distance is a good approximation to more sophisticated model-based distances over short evolutionary distances, and it is generally these distances we are most concerned with when making difficult differentiations among closely-placed nodes on short branches . Use of p-distance in an ME context is further supported in several studies which found that, unless sequence lengths are very great, the simple p-distance generally gives better results in phylogenetic inference than more complicated distance measures [31, 32]. In agreement with these results, empirical tests using PhyClass on our data sets show that p-distance performed slightly better than more sophisticated distances in terms of placement accuracy. For shorter sequences, p-distance provided improved accuracy over a very sophisticated distance measure (Figure 3d) based on the Maximum Composite Likelihood method [33, 34] with a gamma model to account for rate variation among sites , which matched the Tamura-Nei model  used to generate the sequence data according to the tree.
Significance of placements
We assess the statistical significance of the placement using the bootstrap resampling procedure, where the n pairwise distances between the given read and the n reference sequences in the MSA are obtained by using multinomial sampled counts of 16 (4×4) possible nucleotide pairs for each read-reference sequence comparison. Therefore, the pairwise distances in D corresponding to the read-reference pairs are updated in bootstrap replicate based on the multinomial counts, with rest of the pairwise distances, ½ n×(n-1), between reference sequence pairs remaining the same in every bootstrap replicate, because the topology of the reference tree is assumed to be fixed in each replicate. We used p-distances, as bootstrapping for these distances is fast to compute using the multinomial sampling.
Additional Data Sets
Characteristics of test data sets.
Mean branch length
Comparison with EPA
In order to assess the value of Phyclass in the context of the state of the art, we compared its accuracy with EPA which is widely used. EPA chooses optimal placement using a heuristic approximation of the total Maximum Likelihood (ML) of the candidate trees . It is based on the widely-used ML tree inference program RAxML . The accuracy of EPA was consistently approximately 4 error percentage points higher than PhyClass over all query lengths for the model S500R data set and a similar amount on the empirical data sets (Figure 3a). This is consistent with slightly better performance of likelihood-based methods over ME methods (e.g. ).
Although our goal here is to evaluate the accuracy of the Phyclass method, and not to do time optimizations, especially if they involve heuristic shortcuts that may lead to sub-optimal placement, we made some observations about timings. Currently, our prototype Phyclass classifier, exclusive of one-time initial alignment using HMMER, takes around 6 seconds per query to compute one placement on the S500R data set. This is compared to the performance of EPA with the “slow” option (which we used for all accuracy tests because it provides the highest accuracy) which used around 14 seconds per query. (EPA also provides a suboptimal “fast” option, which uses about 0.3 seconds/query after an initial setup time of 4 minutes or so for the entire set of queries.) There are several clear opportunities for time optimizations in PhyClass without compromising the current strictly optimal ME calculation. In our prototype version of PhyClass, each query is processed completely independently and most time is spent in reading and writing text files, setting up data structures and formatting them for output. This is appropriate, since, because of the test mode requirement of removing the true source taxon from the tree and the reference set to see if it is re-inserted at the same point, the data changes with each query. Most of this can be eliminated in more typical use by running a large number of queries in batch mode so that most computation (e.g. pairwise distances among reference sequences) can be factored out of the per-query workload into a setup phase and held in computer memory rather than written to external files. An analysis of the 6-second runtime of PhyClass on the S500R data set shows that more than half is consumed reading and writing text files, with the remainder divided between distance and sum-of-branch-length calculations, the former of which can be factored to a set-up phase instead of being done on a per-query basis. EPA already incorporates optimizations to pre-compute and retain in memory as much as possible.
Phyclass and EPA used together
Using sequence simulation, we evaluated the use of the Minimum Evolution principle to place aligned fragmented DNA sequences representing metagenomic reads into phylogenetically correct positions in an existing tree of known sequences. Our tests showed the accuracy of the new method to be comparable to, but in most cases slightly less than that of the best existing Maximum Likelihood program to deal with the same problem. Consensus assignments based on both approaches together were found to be often more correct than either method individually, even when the statistical support for assignments was low.
We used four primary data sets for testing, three empirical and one simulated. The empirical ones (D150, D218, and D714) were from the EPA study . Inferred ML trees and reference sequences were downloaded from the site associated with that report.
We simulated data set S500R by sampling 500 terminal taxa, including bacteria, archaea, and eukaryota from a 1610-taxon tree that represented all major groups from the tree of life . 2000-bp DNA alignments for the tree were simulated, without indels, using parameters calculated from the Silva rRNA dataset  using Seq-Gen . Using the model testing feature of MEGA , the best model for these sequences was calculated to be the GTR + Γ + I with an alpha value of 0.7, five categories for the gamma rate distribution, and 0.8% invariant sites. A nominal evolutionary rate of 0.22 substitutions/Gy was used to produce an alignment with similar pairwise distances to the Silva empirical data [38, 40]. Variation was added to the rates by independently varying each branch’s rate in the tree over a uniform distribution from 0.11 substitutions/Gy to 0.33 substitutions/Gy (plus or minus 50% of the nominal value). Table 1 shows some empirical characteristics of these data sets.
The EPA program was obtained from the developers’ site. EPA provides a “slow” and “fast” placement option. According to the developers, the “slow” option is more accurate . We confirmed this with a test sample and therefore used EPA “slow” mode for all accuracy comparisons.
All placement accuracy statistics for EPA and Phyclass were calculated by first deleting from the tree and the reference sequence alignment the reference sequence from which the query was extracted. Then the placement of that query is counted as correct if it is inserted into the same branch to which the original sequence had been attached. This is the same testing strategy as used in Berger et al. .
We found HMMER in profile alignment mode (http://hmmer.janelia.org/), to be faster and more accurate than other alignment programs, including Muscle  and CLUSTAL. The EPA developers recommended HMMER as well. We conjectured that alignment method is an important factor in placement accuracy. We tried many different alignment methods while developing this method and found that HMMER profile alignment was the best performer in terms of both speed and accuracy of results. Since the testing method involved extracting queries from known reference sequences, we had available the true alignments for comparison. We found, unexpectedly, that the average difference in percentage of exact placement, over all data sets and categories, between HMMER profile alignment and true alignment is only 3.2 percentage points, suggesting that not much is to be gained, in terms of accuracy, at least, by alignment improvements.
This work was supported by funding from National Institutes of Health (NIH; HG002096-12 to SK and HG006039-02 to A.F.). O.M. was also supported by a training program (NIH, R25GM099650). We thank Dr. Rosa Krajmalnic-Brown for helpful conversations on metagenomic requirements.
Publication charges for this article have been funded from research grants from National Institutes of Health (NIH; HG002096-12) and HiCi-1434-117-1 from KAU.
This article has been published as part of BMC Genomics Volume 16 Supplement 1, 2015: Selected articles from the 2nd International Genomic Medical Conference (IGMC 2013): Genomics. The full contents of the supplement are available online at http://0-www.biomedcentral.com.brum.beds.ac.uk/bmcgenomics/supplements/16/S1
- Ronaghi M, Uhlén M, Nyrén P: A Sequencing Method Based on Real-Time Pyrophosphate. Science. 1998, 281 (5375): 363-365.View ArticlePubMedGoogle Scholar
- Ronaghi M: Pyrosequencing Sheds Light on DNA Sequencing. Genome Research. 2001, 11 (1): 3-11. 10.1101/gr.11.1.3.View ArticlePubMedGoogle Scholar
- Riesenfeld CS, Schloss PD, Handelsman J: Metagenomics: genomic analysis of microbial communities. Annu Rev Genet. 2004, 38: 525-552. 10.1146/annurev.genet.38.072902.091216.View ArticlePubMedGoogle Scholar
- Krause L, Diaz NN, Goesmann A, Kelley S, Nattkemper TW, Rohwer F, Edwards RA, Stoye J: Phylogenetic classification of short environmental DNA fragments. Nucl Acids Res. 2008, gkn038Google Scholar
- Tringe SG, Rubin EM: Metagenomics: DNA sequencing of environmental samples. Nat Rev Genet. 2005, 6 (11): 805-814. 10.1038/nrg1709.View ArticlePubMedGoogle Scholar
- Jung JY, Lee SH, Kim JM, Park MS, Bae J-W, Hahn Y, Madsen EL, Jeon CO: Metagenomic Analysis of Kimchi, a Traditional Korean Fermented Food. Applied and Environmental Microbiology. 2011, 77 (7): 2264-2274. 10.1128/AEM.02157-10.PubMed CentralView ArticlePubMedGoogle Scholar
- Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, et al: Environmental genome shotgun sequencing of the Sargasso Sea. Science. 2004, 304 (5667): 66-74. 10.1126/science.1093857.View ArticlePubMedGoogle Scholar
- Balzer S, Malde K, Lanzén A, Sharma A, Jonassen I: Characteristics of 454 pyrosequencing data—enabling realistic simulation with flowsim. Bioinformatics. 2010, 26 (18): i420-i425. 10.1093/bioinformatics/btq365.PubMed CentralView ArticlePubMedGoogle Scholar
- Woese CR, Fox GE: Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proceedings of the National Academy of Sciences. 1977, 74 (11): 5088-5090. 10.1073/pnas.74.11.5088.View ArticleGoogle Scholar
- Cole JR, Chai B, Farris RJ, Wang Q, Kulam SA, McGarrell DM, Garrity GM, Tiedje JM: The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis. Nucleic Acids Research. 2005, 33 (suppl 1): D294-D296.PubMed CentralPubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410. 10.1016/S0022-2836(05)80360-2.View ArticlePubMedGoogle Scholar
- Huson DH, Auch AF, Qi J, Schuster SC: MEGAN analysis of metagenomic data. Genome Res. 2007, 17 (3): 377-386. 10.1101/gr.5969107.PubMed CentralView ArticlePubMedGoogle Scholar
- Koski LB, Golding GB: The closest BLAST hit is often not the nearest neighbor. J Mol Evol. 2001, 52 (6): 540-542. 10.1007/s002390010184.View ArticlePubMedGoogle Scholar
- Clemente J, Jansson J, Valiente G: Flexible taxonomic assignment of ambiguous sequencing reads. BMC Bioinformatics. 2011, 12 (1): 8-10.1186/1471-2105-12-8.PubMed CentralView ArticlePubMedGoogle Scholar
- Berger SA, Krompass D, Stamatakis A: Performance, Accuracy, and Web Server for Evolutionary Placement of Short Sequence Reads under Maximum Likelihood. Systematic Biology. 2011, 60 (3): 291-302. 10.1093/sysbio/syr010.PubMed CentralView ArticlePubMedGoogle Scholar
- Matsen F, Kodner R, Armbrust EV: pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics. 2010, 11 (1): 538-10.1186/1471-2105-11-538.PubMed CentralView ArticlePubMedGoogle Scholar
- Stark M, Berger S, Stamatakis A, von Mering C: MLTreeMap - accurate Maximum Likelihood placement of environmental DNA sequences into taxonomic and functional reference phylogenies. BMC Genomics. 2010, 11 (1): 461-10.1186/1471-2164-11-461.PubMed CentralView ArticlePubMedGoogle Scholar
- Stamatakis A: RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006, 22 (21): 2688-2690. 10.1093/bioinformatics/btl446.View ArticlePubMedGoogle Scholar
- Rzhetsky A, Nei M: Theoretical Foundation of the Minimum-Evolution Method of Phylogenetic Inference. Mol Biol Evol. 1993, 10 (5): 1073-1095.PubMedGoogle Scholar
- Rzhetsky A, Kumar S, Nei M: Four-cluster analysis: a simple method to test phylogenetic hypotheses. Mol Biol Evol. 1995, 12 (1): 163-167. 10.1093/oxfordjournals.molbev.a040185.View ArticlePubMedGoogle Scholar
- Rambaut A, Grassly NC: Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Computer applications in the biosciences : CABIOS. 1997, 13 (3): 235-238.PubMedGoogle Scholar
- Kumar S, Filipski AJ, Battistuzzi FU, Kosakovsky Pond SL, Tamura K: Statistics and Truth in Phylogenomics. Mol Biol Evol. 2011Google Scholar
- Gontcharov AA, Marin B, Melkonian M: Are Combined Analyses Better Than Single Gene Phylogenies? A Case Study Using SSU rDNA and rbcL Sequence Comparisons in the Zygnematophyceae (Streptophyta). Mol Biol Evol. 2004, 21 (3): 612-624.View ArticlePubMedGoogle Scholar
- Castresana J: Topological variation in single-gene phylogenetic trees. Genome Biol. 2007, 8 (6): 216-PubMed CentralPubMedGoogle Scholar
- Eddy SR: Accelerated Profile HMM Searches. PLoS Comput Biol. 2011, 7 (10):Google Scholar
- Pop M: Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics. 2009, 10 (4): 354-366. 10.1093/bib/bbp026.PubMed CentralView ArticlePubMedGoogle Scholar
- Fernandes F, da Fonseca PG, Russo LM, Oliveira AL, Freitas AT: Efficient alignment of pyrosequencing reads for re-sequencing applications. BMC Bioinformatics. 2011, 12: 163-10.1186/1471-2105-12-163.PubMed CentralView ArticlePubMedGoogle Scholar
- Nei M, Kumar S: Molecular evolution and phylogenetics. 2000, Oxford ; New York: Oxford University PressGoogle Scholar
- Jukes TH, Cantor CR: Evolution of Protein Molecules. Mammalian Protein Molecules. Edited by: Munro HN. 1969, New York: Academic Press, 21-132.View ArticleGoogle Scholar
- Nei M: PHYLOGENETIC ANALYSIS IN MOLECULAR EVOLUTIONARY GENETICS. Annual Review of Genetics. 1996, 30 (1): 371-403. 10.1146/annurev.genet.30.1.371.View ArticlePubMedGoogle Scholar
- Takahashi K, Nei M: Efficiencies of fast algorithms of phylogenetic inference under the criteria of maximum parsimony, minimum evolution, and maximum likelihood when a large number of sequences are used. Mol Biol Evol. 2000, 17 (8): 1251-1258. 10.1093/oxfordjournals.molbev.a026408.View ArticlePubMedGoogle Scholar
- Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4 (4): 406-425.PubMedGoogle Scholar
- Tamura K, Nei M, Kumar S: Prospects for inferring very large phylogenies by using the neighbor-joining method. Proc Natl Acad Sci U S A. 2004, 101 (30): 11030-11035. 10.1073/pnas.0404206101.PubMed CentralView ArticlePubMedGoogle Scholar
- Tamura K, Dudley J, Nei M, Kumar S: MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol Biol Evol. 2007, 24 (8): 1596-1599. 10.1093/molbev/msm092.View ArticlePubMedGoogle Scholar
- Yang Z: Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. Journal of Molecular Evolution. 1994, 39 (3): 306-314. 10.1007/BF00160154.View ArticlePubMedGoogle Scholar
- Kuhner MK, Felsenstein J: A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol Biol Evol. 1994, 11 (3): 459-468.PubMedGoogle Scholar
- Hedges SB, Kumar S: The timetree of life. 2009, Oxford ; New York: Oxford University PressGoogle Scholar
- Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, Glockner FO: SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 2007, 35 (21): 7188-7196. 10.1093/nar/gkm864.PubMed CentralView ArticlePubMedGoogle Scholar
- Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S: MEGA5: Molecular Evolutionary Genetics Analysis Using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods. Mol Biol Evol. 2011, 28 (10): 2731-2739. 10.1093/molbev/msr121.PubMed CentralView ArticlePubMedGoogle Scholar
- Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO: The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Research. 2013, 41 (D1): D590-D596. 10.1093/nar/gks1219.PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics. 2004, 5: 113-10.1186/1471-2105-5-113.PubMed CentralView ArticlePubMedGoogle Scholar
- Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, et al: Clustal W and Clustal X version 2.0. Bioinformatics. 2007, 23 (21): 2947-2948. 10.1093/bioinformatics/btm404.View ArticlePubMedGoogle Scholar
- Kumar S, Stecher G, Peterson D, Tamura K: MEGA-CC: computing core of molecular evolutionary genetics analysis program for automated and iterative data analysis. Bioinformatics. 2012, 28 (20): 2685-2686. 10.1093/bioinformatics/bts507.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.