Identify duplicated pair-ended (PE) reads (cd-hit-dup-PE)

cd-hit-dup (http://cd-hit.org) is a very widely used program for clustering and comparing large sets of biological sequences. cd-hit-dup is a program in cd-hit packate for identifying duplicated reads, specially Illumina pair-ended (PE) or single ended (SE) reads.


Inputs:
  1. R1 reads in FASTA format (required), can be in .gz format
  2. R2 reads in FASTA format (required), can be in .gz format
Outputs:
output.zip will be produced with a README file describing the output files and format
Sequence file to upload (required):
Sequence file for R2 reads (required):
Email (optional):
Parameters
show description
    -u        Length of prefix to be used in the analysis (default 0, for full/maximum length);
    -m        Match length (true/false, default true);
    -e        Maximum number of mismatches allowd;
EXAMPLE
Sequence file to upload (required): input.fasta
Email (optional): you@example.com
Parameters -u 50 -e 2
show description
    -u        Length of prefix to be used in the analysis (default 0, for full/maximum length);
    -m        Match length (true/false, default true);
    -e        Maximum number of mismatches allowd;
Show an example
Submitting......
Program/Database References
1. "Clustering of highly homologous sequences to reduce the size of large protein database", Weizhong Li, Lukasz Jaroszewski and Adam Godzik Bioinformatics (2001) 17:282-283.
2. "Tolerating some redundancy significantly speeds up clustering of large protein databases", Weizhong Li, Lukasz Jaroszewski and Adam Godzik Bioinformatics (2002) 18:77-82.
3. "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li and Adam Godzik Bioinformatics (2006) 22:1658-1659.
4. "CD-HIT Suite: a web server for clustering and comparing biological sequences", Ying Huang, Beifang Niu, Ying Gao, Limin Fu and Weizhong Li Bioinformatics (2010) 26:680-682.
Program/Database Version
Program: cd-hit-dup
Database: N/A