CD-HIT can be used for clustering large sequence sets or removing identical or highly similar sequences from a sequence set. CD-HIT is often used as a tool to produce a non-redundant sequence set for further analysis of a large sequence set. CD-HIT recognizes fasta and fastq sequence formats.
Version on CSC's Servers
The setup command for CD-HIT in Puhti cluster is:
module load biokit
After the setup command, the server recognizes CD-HIT commands. The CD-HIT package has many programs. The most notable are:
|cd-hit||Clustering and redundance removal tool for protein sequences|
|cd-hit-est||Clustering and redundance removal tool for nucleic acid sequences (only for sequences that do not contain introns)|
|cd-hit-2d||Tool to compare two protein sequence sets|
|cd-hit-est-2d||Tool to compare two nucleic sequence sets|
|cd-hit-454||A program to identify artificial duplicates from raw 454 sequencing reads|
|cd-hit||Cluster peptide sequences|
|psi-cd-hit||Cluster proteins at less than 40% cutoff|
|cd-hit-lap||Identify overlapping reads|
|cd-hit-dup||Identify duplicates from single or paired Illumina reads|
|cd-hit-454||Identify duplicates from 454 reads|
A full list of programs can be found in the CD-HIT user guide.
You can list the command line options of CD-HIT programs by using option
-help. For example:
A simple analysis for a protein sequence set can be done for example with command:
cd-hit -i my_proteins.fasta -o reduced_set.fasta -c 0.95
The sample command above produces two result files:
- reduced_set.fasta contains a pruned sequence set. In this case, if two sequences are more than 95% identical, only the longer one is included to the results.
- reduced_set.fasta.clstr contains information about the clustering of the sequences that share higher similarity than the give threshold value (in this case 95%).