A fundamental question in metagenomics is the estimation of microbial diversity, often described in Operational Taxonomic Units (OTUs). Pyrosequencing-based 16S ribosomal RNA survey is the most common approach for OTU identification. However, noise in PCR amplification and sequencing often cause over-estimation (up to orders of magnitude) of OTUs.

To address this problem, new methods such as "SLP" were developed and now the gold standard is through flowgram clustering implemented in "PyroNoise", "Denoiser" and "AmpliconNoise". Although these methods reduce much more noise than earlier approaches, they still produce many (can be >50%) spurious OTUs. In addition, both SLP and flowgram-based methods require intensive computation, prohibiting them from being used in very larger datasets.

Compared to these methods, our new tool introduced here, CD-HIT-OTU, has comparable accuracy in identifying true OTUs and produces much fewer spurious OTUs. CD-HIT-OTU is ~2-4 orders of magnitude faster, especially for longer reads. CD-HIT-OTU is developed from our ultra-fast clustering program CD-HIT with several unique algorithms and is available from http://weizhong-lab.ucsd.edu/cd-hit-otu. It is also available as a web server from WebMGA, a collection of web servers we implemented for metagenomic analysis at http://weizhong-lab.ucsd.edu/metagenomic-analysis/server/cd-hit-otu.

CD-HIT-OTU has following steps: (1) Low quality reads are filtered out and extra long tails are trimmed. (2) Filtered reads are clustered at 100% identity using CD-HIT-DUP. (3) Chimeric reads are identified. (4) Secondary clusters are recruited into primary clusters. (5) Noise sequences in clusters of size x or below are removed. Here x is statistically calculated. (6) Remaining representative reads from non-chimeric clusters are clustered into OTUs at a user-specified OTU cutoff (e.g. 97% ID at species level).

CD-HIT-OTU was currently maintained by Dr. Sitao Wu (siw006@ucsd.edu) in Dr. Li's group (http://weizhong-lab.ucsd.edu/).


(Oct 2011) CD-HIT-OTU wiki is updated.
(Oct 2011) CD-HIT-OTU 0.0.1 for illumina rRNA data is released.
(September 2011) CD-HIT-OTU 0.0.2 for 454 rRNA data is released.
(July 2011) A new CD-HIT-OTU is available for processing illumina 16S reads (single or paired) for very accurate OTU identification. Please contact us for beta software.
(May 2011) The CD-HIT-OTU web site was created on May 12, 2011.