cdhit_454: Identify artificial duplicates from metagenomic samples

The 454 pyrosequencing reads contains artificially duplicates, which might lead to misleading conclusions. cdhit-454 is a fast program to identify exact and nearly identical duplicates, the reads begin at the same position but may vary in length or bear mismatches. cdhit-454 can process a dataset in ~10 minutes. it also provides a consensus sequence for each group of duplicates. Duplicated reads observed in a pyrosequencing run include not only artificial duplicates but also natural duplicates - reads from the same origin that occur at the same genome position by chance. Therefore, it is important not only to identify the duplicates, but also to know their make-up as to artificial duplicates and natural duplicates.
In the cdhit-454 package, we provide a tool, meta-simulation, to estimate the number of natural duplicates under a hypothetical metagenimic sample type defined by users. Users provide the number of reads and size and abundance of genomes in a hypothetical sample (see template to run this program and then can decide whether to retain or remove duplicates in their projects. Retrieve the result of you previous submitted jobs cdhit-454 is available from

  1. Beifang Niu, Limin Fu, Shulei Sun and Weizhong Li. Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinformatics 2010 11:187 doi:10.1186/ PDF
