CD-HIT Suite: Biological Sequence Clustering and Comparison


CD-HIT package can perform various jobs like clustering a protein database, clustering a DNA/RNA database, comparing two databases (protein or DNA/RNA), and generating protein families. More infomation is available at CD-HIT home page.

  • cd-hit
    • CD-HIT clusters proteins that meet a similarity threshold, usually a sequence identity. Each cluster has one representative sequence. The input is a protein dataset in fasta format. It generates a fasta file of representative sequences and a text file of list of clusters.

  • cd-hit-est
    • CD-HIT-EST clusters a nucleotide sequences that meet a similarity threshold, usually a sequence identity. The input is a DNA/RNA dataset in fasta format It generates a fasta file of representative sequences and a text file of list of clusters. It can not be used for very long sequences, like full genomes.

  • h-cd-hit
    • Multiple CD-HIT runs. Proteins are first clustered at a high identity (like 90%), the non-redundant sequences are further clustered at a low identity (like 60%). A third cluster can be performed at lower identity. Multi-step run is more efficient and more accurate than a single run.

  • h-cd-hit-est
    • Multiple CD-HIT-EST runs.

  • cd-hit-2d
    • CD-HIT-2D compares 2 protein datasets (db1, db2). It identifies the sequences in db2 that are similar to db1 at a certain threshold. The input are two protein datasets (db1, db2) in fasta format and the output are two files: a fasta file of proteins in db2 that are not similar to db1 and a text file that lists similar sequences between db1 & db2.

  • cd-hit-est-2d
    • Like CD-HIT-2D, CD-HIT-EST-2D compares 2 nucleotide datasets. For same reason as CD-HIT-EST, CD-HIT-EST-2D is not good for very long sequences.

  • result
    • Retrieve the result of you previous submitted jobs, download sample datasets, or view sample results.

  • calculated clusters
    • Use our FTP site to download pre-calcualted sequence clusters for some popular databases, like NR, Swissprot and PDB

  • server usage
    • We are glad if this server can help your research. Although cd-hit is very fast, but clustering is still very computationally intensive. We currently limit the file upload size to 100MB. cd-hit run slower with a low identity cutoff, we further limit the upload size to 20MB if the clustering cutoff is <60% identity for protein sequences. We intent to increase the limit because we will upgrade our computer cluster. Also we provided pre-calcualted clusters from our FTP site. If you need to cluster larger dataset, please contact us.

      We recommand that you download the zipped file after the job finished. We will delete the jobs older than 90 days to save disk space.

      Thank you for your understanding.



Reference:
  1. Ying Huang, Beifang Niu, Ying Gao, Limin Fu and Weizhong Li. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics, 2010(26): 680-682.full text
  2. Weizhong Li and Adam Godzik. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 2006(22): 1658-1659. full text
  3. Weizhong Li, Lukasz Jaroszewski and Adam Godzik. Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics, 2002(18): 77-82. full text
  4. Weizhong Li, Lukasz Jaroszewski and Adam Godzik. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 2001(17): 282-283. full text
Contact @Weizhong Li