CD-HIT web server |
Download |
CDHIT-454
News (July 2010)
cdhit@GoogleCode
is a new Google Code project created for releasing the latest development version
of CDHIT. Usually new minor versions will be released as soon as bug fixings
or improvements become available.
News (Oct. 2009)
CDHIT-454
is a new program to identify exact duplicates and near identical duplicates in pyrosequencing reads: CDHIT-454 (websever),
CDHIT-454 (standalone).
News (September 2009)
CD-HIT web server is now available
to run cd-hit or download some pre-calculated clusters.
News (December 2006)
I made some major updates including several very useful new
options for clustering such as alignment coverage control,
switch between local and global sequence identity.
Please check the newest release and have a try. Weizhong Li.
News (February 2006)
I recently developed several new programs based on CD-HIT's algorithm:
CD-HIT-2D, CD-HIT-EST and CD-HIT-EST-2D. CD-HIT-2D compares two protein
sets and report similar matches between them.
CD-HIT-EST and CD-HIT-EST-2D are nucleotide versions. Weizhong Li.
CD-HIT is a program for clustering large protein database at high
sequence identity threshold.
The program removes redundant sequences and generate a database of only the
representatives.
It can be applied in protein family classification, domain analysis,
organizing large protein databases, improving performance of database search,
and much more.
The program is written by
CD-HI is the first version, CD-HIT is modified from CD-HI. CD-HIT yields
much higher speed than CD-HI, but user will have to tolerate a very
small amount of redundant sequence in the output database. Since the
amount of redundancy is so small, We suggest users use CD-HIT only for
all applications. I am only maintaining CD-HIT now.
The CD-HIT manual and download is available from
bioinformatics.org. If you have special request,
discuss it with the author
If you find CD-HIT useful, please cite:
1. "Clustering of highly homologous sequences to reduce the
size of large protein database",
Weizhong Li, Lukasz Jaroszewski & Adam Godzik
Bioinformatics, (2001) 17:282-283.
PDF
Pubmed
2. "Tolerating some redundancy significantly speeds up clustering of large
protein databases",
Weizhong Li, Lukasz Jaroszewski & Adam Godzik
Bioinformatics, (2002) 18:77-82.
PDF
Pubmed
3. "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences",
Weizhong Li & Adam Godzik
Bioinformatics, (2006) 22:1658-9.
Open access PDF
Pubmed
Who is using CD-HIT
The CD-HIT program is currently used by hundreds of research and
educational groups, including some of the worlds best-known
institutions such as UniProt, PDB, EBI, and TIGR.
UniProt
is the world's most comprehensive catalog of information on proteins.
In UniProt, CD-HIT program is used to generate the UniRef reference data sets,
UniRef90 and UniRef50.
CD-HIT is also used at the
PDB to
treat redundant sequences.
Google CD-HIT.
Related resources:
NRDB90 and nrdb90.pl,
a nonredundant sequence database and the perl script used to generate it.
RSDB,
Representative protein Sequence DataBases.