CD-HIT is a very widely used program for clustering and comparing protein or nucleotide sequences. CD-HIT was originally developed by Dr. Weizhong Li at Dr. Adam Godzik's Lab at the Burnham Institute (now Sanford-Burnham Medical Research Institute)
CD-HIT is very fast and can handle extremely large databases. CD-HIT helps to significantly reduce the computational and manual efforts in many sequence analysis tasks and aids in understanding the data structure and correct the bias within a dataset.
The CD-HIT package has CD-HIT, CD-HIT-2D, CD-HIT-EST, CD-HIT-EST-2D, CD-HIT-454, CD-HIT-PARA, PSI-CD-HIT, CD-HIT-OTU and over a dozen scripts.
- CD-HIT (CD-HIT-EST) clusters similar proteins (DNAs) into clusters that meet a user-defined similarity threshold.
- CD-HIT-2D (CD-HIT-EST-2D) compares 2 datasets and identifies the sequences in db2 that are similar to db1 above a threshold.
- CD-HIT-454 identifies natural and artificial duplicates from pyrosequencing reads.
- CD-HIT-OTU cluster rRNA tags into OTUs
CD-HIT is currently maintained by the Dr. Li's group (http://weizhong-lab.ucsd.edu/) with support from National Center for Research Resources (Grant # 1R01RR025030).
NEWS
(June 2011) cd-hit-otu is a special cd-hit extension for clustering rRNA tags into OTUs. It is very fast and very accurate.
(July 2010) cdhit@GoogleCode is a new Google Code project created for releasing the latest development version of CDHIT. Usually new minor versions will be released as soon as bug fixings or improvements become available.
(Oct. 2009) CDHIT-454 is a new program to identify exact duplicates and near identical duplicates in pyrosequencing reads: CDHIT-454 (websever), CDHIT-454 (standalone).
(September 2009) CD-HIT web server is now available to run cd-hit or download some pre-calculated clusters.
(December 2006) I made some major updates including several very useful new options for clustering such as alignment coverage control, switch between local and global sequence identity. Please check the newest release and have a try. Weizhong Li.
(February 2006) I recently developed several new programs based on CD-HIT's algorithm: CD-HIT-2D, CD-HIT-EST and CD-HIT-EST-2D. CD-HIT-2D compares two protein sets and report similar matches between them. CD-HIT-EST and CD-HIT-EST-2D are nucleotide versions. Weizhong Li.
