CD-HI

Cluster Database at High Identity

http://bioinformatics.burnham-inst.org/cd-hi


CD-HI is a program for clustering large protein database at high sequence identity threshold, It can efficient cluster very large database like non-redundant (NR) on normal computer like PC.

The program removes redundant sequences and generate a database with only the representatives, therefore the output database is much smaller. The use of clustered database can not only save time in database searching and result parsing, but also increase the search sensitivity.

The program is written by

                                      Weizhong Li
                                      UCSD, San Diego Supercomputer Center
                                      La Jolla, CA, 92093
                                      Email liwz@sdsc.edu

                 at
                                      Adam Godzik's lab
                                      The Burnham Institute
                                      La Jolla, CA, 92037
                                      Email adam@burnham-inst.org

This program is free. Download with this click.