The 454 pyrosequencing reads contains artificially duplicates,
which might lead to misleading conclusions.
cdhit-454 is a fast program to identify exact and nearly
identical duplicates, the reads begin at the same position but may vary
in length or bear mismatches.
cdhit-454 can process a dataset in ~10 minutes.
it also provides a consensus sequence for each group of duplicates.
Duplicated reads observed in a pyrosequencing run include
not only artificial duplicates but also natural duplicates - reads from the
same origin that occur at the same genome position by chance.
Therefore, it is important not only to identify the duplicates,
but also to know their make-up
as to artificial duplicates and natural duplicates.
In the cdhit-454 package, we provide a tool, meta-simulation,
to estimate the number of natural
duplicates under a hypothetical metagenimic sample type defined by users.
Users provide the number of reads and size and abundance of genomes in a
hypothetical sample (see template
run this program and then can
decide whether to retain or remove duplicates in their projects.
Retrieve the result of you previous submitted jobs
cdhit-454 is available from http://cd-hit.org