CD-HIT package can perform various jobs like clustering a protein database,
clustering a DNA/RNA database, comparing two databases (protein or DNA/RNA), and
generating protein families.
More infomation is available at CD-HIT home page.
News! Gclust server is now available. Gclust is a new tool developed for clustering very long microbial genomic sequences.
News! Gclust server is now available. Gclust is a new tool developed for clustering very long microbial genomic sequences.
- cd-hit
- CD-HIT clusters proteins that meet a similarity threshold, usually
a sequence identity. Each cluster has one representative sequence. The input is a protein
dataset in fasta format. It generates a fasta file of representative sequences
and a text file of list of clusters.
- CD-HIT clusters proteins that meet a similarity threshold, usually
a sequence identity. Each cluster has one representative sequence. The input is a protein
dataset in fasta format. It generates a fasta file of representative sequences
and a text file of list of clusters.
- cd-hit-est
- CD-HIT-EST clusters a nucleotide sequences that meet a similarity threshold, usually
a sequence identity. The input is a DNA/RNA dataset in fasta format
It generates a fasta file of representative sequences
and a text file of list of clusters. It can not be used for very long sequences, like full genomes.
- CD-HIT-EST clusters a nucleotide sequences that meet a similarity threshold, usually
a sequence identity. The input is a DNA/RNA dataset in fasta format
It generates a fasta file of representative sequences
and a text file of list of clusters. It can not be used for very long sequences, like full genomes.
- h-cd-hit
- Multiple CD-HIT runs. Proteins are first clustered at a high identity (like 90%), the non-redundant sequences
are further clustered at a low identity (like 60%). A third cluster can be performed at lower identity.
Multi-step run is more efficient and more accurate than a single run.
- Multiple CD-HIT runs. Proteins are first clustered at a high identity (like 90%), the non-redundant sequences
are further clustered at a low identity (like 60%). A third cluster can be performed at lower identity.
Multi-step run is more efficient and more accurate than a single run.
- h-cd-hit-est
- Multiple CD-HIT-EST runs.
- Multiple CD-HIT-EST runs.
- cd-hit-2d
- CD-HIT-2D compares 2 protein datasets (db1, db2). It identifies the sequences in db2 that
are similar to db1 at a certain threshold. The input are two protein datasets (db1, db2) in
fasta format and the output are two files: a fasta file of proteins in db2 that are not similar
to db1 and a text file that lists similar sequences between db1 & db2.
- CD-HIT-2D compares 2 protein datasets (db1, db2). It identifies the sequences in db2 that
are similar to db1 at a certain threshold. The input are two protein datasets (db1, db2) in
fasta format and the output are two files: a fasta file of proteins in db2 that are not similar
to db1 and a text file that lists similar sequences between db1 & db2.
- cd-hit-est-2d
- Like CD-HIT-2D, CD-HIT-EST-2D compares 2 nucleotide datasets.
For same reason as CD-HIT-EST, CD-HIT-EST-2D is not good for very long sequences.
- Like CD-HIT-2D, CD-HIT-EST-2D compares 2 nucleotide datasets.
For same reason as CD-HIT-EST, CD-HIT-EST-2D is not good for very long sequences.
- result
- Retrieve the result of you previous submitted jobs, download sample datasets, or view sample results.
- Retrieve the result of you previous submitted jobs, download sample datasets, or view sample results.
- calculated clusters
- Use our FTP site to download pre-calcualted sequence
clusters for some popular databases, like NR, Swissprot and PDB
- Use our FTP site to download pre-calcualted sequence
clusters for some popular databases, like NR, Swissprot and PDB
- server usage
- We are glad if this server can help your research. Although cd-hit is very fast, but clustering is
still very computationally intensive. We currently limit the file upload size to 100MB. cd-hit run slower with
a low identity cutoff, we further limit the upload size to 20MB if the clustering cutoff is <60% identity
for protein sequences.
We intent to increase the limit because we will upgrade our computer cluster.
Also we provided pre-calcualted clusters from our FTP site. If you need to cluster larger dataset, please
contact us.
We recommand that you download the zipped file after the job finished. We will delete the jobs older than 90 days to save disk space.
Thank you for your understanding.
- We are glad if this server can help your research. Although cd-hit is very fast, but clustering is
still very computationally intensive. We currently limit the file upload size to 100MB. cd-hit run slower with
a low identity cutoff, we further limit the upload size to 20MB if the clustering cutoff is <60% identity
for protein sequences.
We intent to increase the limit because we will upgrade our computer cluster.
Also we provided pre-calcualted clusters from our FTP site. If you need to cluster larger dataset, please
contact us.