psi-cd-hit-DNA, clustering of very long sequences (genomes or contigs) by psi-cd-hit through blastn or megablast

cd-hit (http://cd-hit.org/) is a very widely used program for clustering and comparing large sets of protein sequences. cd-hit is very fast and can handle extremely large databases. cd-hit helps to significantly reduce the computational and manual efforts in many sequence analysis tasks and aids in understanding the data structure and correct the bias within a dataset.

cd-hit and cd-hit-est can not cluster sequences at low sequence identity cutoffs (e.g. 30% for cd-hit), they can not cluster very long sequences either (e.g. genome sized DNAs). Psi-cd-hit is a script that uses the similar greedy incremental clustering strategy like cd-hit, but it uses BLAST to calculate the sequence similarities.

This server is desinged for clustering proteins at low sequence identity cutoffs.

NOTE: We would suggest to use h-cd-hit or h3-cd-hit to clustering proteins at low sequence identity cutoffs, unless your input sequences are already clustered.
Inputs:
DNA FASTA file (required), can be in .gz format
Outputs:
output.zip will be produced with a README file describing the output files and format
Sequence file to upload (required):
Email (optional):
Parameters
show description
    -c  sequence identity threshold, default 0.3
    -ce clustering threshold (blast expect), default -1, 
        it means by default it doesn't use expect threshold, 
        but with positive value, the program cluster seqs if similarities
        meet either identity threshold or expect threshold 
    -G  (1/0) use global identity? default 1
        two sequences Long (i.e. representative) and Short (redunant) may have multiple
        alignment fragments (i.e. HSPs), see:
        seq1  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   Long sequence
                   ||||||||||||||||||             /////////////          i.e. representative
                   ||||||||||||||||||            /////////////                sequence
                   ||||||||HSP 1 ||||           ////HSP 2 ///
                   ||||||||||||||||||          /////////////
                   ||||||||||||||||||         /////////////
        seq2    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx          Short sequence
                   <<  length 1    >>        <<   len 2 >>               i.e. redundant
                <<<<<<<<<<<< length of short sequence >>>>>>>>>>>>>>          sequence

                          total identical letters from all co-linear and non-overlapping HSPs
        Glogal identity = -------------------------------------------------------------------
                                        length of short sequence
        Local identity  = identity of the top high score HSP
        if you prefer to use -G 0, it is suggested that you also
        use -aS, -aL, such as -aS 0.8, to prevent very short matches.
    -aL	alignment coverage for the longer sequence, default 0.0
 	if set to 0.9, the alignment must covers 90% of the sequence
    -aS	alignment coverage for the shorter sequence, default 0.0
 	if set to 0.9, the alignment must covers 90% of the sequence
    -g  (1/0), default 0
        by cd-hit's default algorithm, a sequence is clustered to the first 
        cluster that meet the threshold (fast cluster). If set to 1, the program
        will cluster it into the most similar cluster that meet the threshold
        (accurate but slow mode)
        but either 1 or 0 won't change the representatives of final clusters

EXAMPLE
Sequence file to upload (required): input.fasta
Email (optional): you@example.com
Parameters -c 0.3
show description
    -c  sequence identity threshold, default 0.3
    -ce clustering threshold (blast expect), default -1, 
        it means by default it doesn't use expect threshold, 
        but with positive value, the program cluster seqs if similarities
        meet either identity threshold or expect threshold 
    -G  (1/0) use global identity? default 1
        two sequences Long (i.e. representative) and Short (redunant) may have multiple
        alignment fragments (i.e. HSPs), see:
        seq1  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   Long sequence
                   ||||||||||||||||||             /////////////          i.e. representative
                   ||||||||||||||||||            /////////////                sequence
                   ||||||||HSP 1 ||||           ////HSP 2 ///
                   ||||||||||||||||||          /////////////
                   ||||||||||||||||||         /////////////
        seq2    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx          Short sequence
                   <<  length 1    >>        <<   len 2 >>               i.e. redundant
                <<<<<<<<<<<< length of short sequence >>>>>>>>>>>>>>          sequence

                          total identical letters from all co-linear and non-overlapping HSPs
        Glogal identity = -------------------------------------------------------------------
                                        length of short sequence
        Local identity  = identity of the top high score HSP
        if you prefer to use -G 0, it is suggested that you also
        use -aS, -aL, such as -aS 0.8, to prevent very short matches.
    -aL	alignment coverage for the longer sequence, default 0.0
 	if set to 0.9, the alignment must covers 90% of the sequence
    -aS	alignment coverage for the shorter sequence, default 0.0
 	if set to 0.9, the alignment must covers 90% of the sequence
    -g  (1/0), default 0
        by cd-hit's default algorithm, a sequence is clustered to the first 
        cluster that meet the threshold (fast cluster). If set to 1, the program
        will cluster it into the most similar cluster that meet the threshold
        (accurate but slow mode)
        but either 1 or 0 won't change the representatives of final clusters

Show an example
Submitting......
Program/Database References
1. "Clustering of highly homologous sequences to reduce the size of large protein database", Weizhong Li, Lukasz Jaroszewski and Adam Godzik Bioinformatics (2001) 17:282-283.
2. "Tolerating some redundancy significantly speeds up clustering of large protein databases", Weizhong Li, Lukasz Jaroszewski and Adam Godzik Bioinformatics (2002) 18:77-82.
3. "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li and Adam Godzik Bioinformatics (2006) 22:1658-1659.
4. "CD-HIT Suite: a web server for clustering and comparing biological sequences", Ying Huang, Beifang Niu, Ying Gao, Limin Fu and Weizhong Li Bioinformatics (2010) 26:680-682.
Program/Database Version
Program: psi-cd-hit beta
Database: N/A