cd-hit-454 is a program to identify artificial duplicates from raw 454 sequencing reads, including exact duplicates and near identical duplicates.

cdhit-cluster-concensus is an accompanying program for cd-hit-454 to generate concensus sequences from clusters of duplicated sequences. It takes the output clusters from cd-hit-454 and sequences in FASTA or FASTQ format as input, to make algnments and calculate the consensus sequences.

cd-hit-454 is implemented in C++, and cdhit-cluster-concensus is implemented in Dao. The interpreter of Dao is also distributed with this program.

The new versions (released since January 26, 2011) include an option to output sequence alignments of clusters.

The new versions (released since July 19, 2010) have no dependency on clustalw, and are much faster than the previous versions.



The CD-HIT-454 main program is now merged into the CD-HIT package, which is available from: The consensus program is distributed below.

Source code download cdhit-cluster-consensus-2011-02-17.tgz

1. Beifang Niu, Limin Fu, Shulei Sun and Weizhong Li. Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinformatics 2010 11:187 doi:10.1186/1471-2105-11-187.