CD-HIT Suite: Biological Sequence Clustering and Comparison

Please input your job id to retrive the running result:
Results older than 180 days will be removed from the server to save disk space!
Samples
Sample protein dataset:
download data, view results by h-cd-hit at three levels: 90%, 60% and 30%

Sample protein dataset with annotation term:
This dataset contains proteins annotated with COG family. the defline of this fasta file looks like ">AF0017_1||COG1250" where the COG family ID is appended after ">sequence_name||". download data, view results by cd-hit clustered at 60% identity

Sample DNA dataset:
download data, view results by cd-hit-est at 95%

Sample DNA dataset with annotation term:
This dataset contains microbial rRNAs annotated with Taxonomy ID at genus rank, the defline of this fasta file looks like ">NC_009925__1405528_1405648_5S||genus_taxid_15597", where the Taxonomy ID is appended after ">sequence_name||". download data, view results by cd-hit-est clustered at 95% identity



Reference:
  1. Ying Huang, Beifang Niu, Ying Gao, Limin Fu and Weizhong Li. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics, 2010(26): 680-682.full text
  2. Weizhong Li and Adam Godzik. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 2006(22): 1658-1659. full text
  3. Weizhong Li, Lukasz Jaroszewski and Adam Godzik. Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics, 2002(18): 77-82. full text
  4. Weizhong Li, Lukasz Jaroszewski and Adam Godzik. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 2001(17): 282-283. full text
Contact @Weizhong Li