ESPRIT: Estimating Species Richness Using Large Collections of 16S rRNA Pyrosequences

ESPRIT-Tree: Hierarchical Clustering Analysis of Millions of 16S rRNA Pyrosequences in Quasi-linear Time

Y. Cai* and Y. Sun* (equal contribution)

Nucleic Acids Research, vol. 39, no. 14, e95, 2011.

Taxonomy independent analysis plays an essential role in microbial community analysis. Hierarchical clustering is one of the most widely employed approaches to finding OTUs, the basis for many downstream analyses. Most existing algorithms have quadratic space and computational complexities, and thus can be used only for small or medium-scale problems. We propose a new online-learning based algorithm that simultaneously addresses the space and computational issues of prior work. The basic idea is to partition a sequence space into a set of subspaces using a partition tree constructed using a pseudometric, and then recursively refine a clustering structure in these subspaces. The technique relies on new methods for fast closest-pair searching and efficient dynamic insertion and deletion of tree nodes. To avoid exhaustive computation of pairwise distances between clusters, we represent each cluster of sequences as a probabilistic sequence, and define a set of operations to align these probabilistic sequences and compute genetic distances between them. We present analyses of space and computational complexity, and demonstrate the effectiveness of our new algorithm using a human gut microbiota dataset with over one million sequences. The new algorithm exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard average-linkage based hierarchical clustering algorithm.

o Manuscript

o Software (We are still optimizing the algorithm. However, the beta version of the algorithm is available upon request)

o User manual

o Simulation data and code used in the benchmark study