CD-HIT-Grid

The target of this application porting is the Bioinformatic scientific community, and in particular, those members who use a tool called “CD-HIT” which performs protein clustering on a protein sequence database. This consists in removing redundant sequences at a given sequence similarity level and generating a new database with the representatives only. This activity was proposed by CNIO (Spanish National Cancer Research Centre) and started in the context of the BioGridNet Program.

Protein databases are growing up day after day, the clustering process on interesting datasets in a single machine is not feasible due to memory constrains. A Grid environment allows an adaptive database distribution in order to optimize its overall analysis. The complexity of the workflow inherent to “CD-HIT” needs a robust framework able to handle it. In addition, this framework may be successfully used in other applications which result in a same type of workflow.

Download

This release of CD-HIT-Grid (1.0) contains the framework source code (using the OGF DRMAA standard), scripts and the original cd-hit binaries for 32 bits (that can be replaced with newer versions):

Basic Howto

In order to make CD-HIT-Grid work, you will need a machine with a running GridWay instance. The user needs a valid certificate for the resources accessed.

After unpacking the tar.gz file, three directories will be revealed:

You could merge this three directories in one.

In order to learn how to use CD-HIT-Grid, let's consider a database located in a directory called example. We also want to adjust the grain to 10 divisions.

#!/bin/bash

echo “Grid-HIT Preprocess Script - DSA Group”

echo “[1] Dividing the Database (example/protdb)”

./cd-hit-div -i example/protdb -o example/protdb.div -div 10

echo “[2] First Self-comparison (example/protdb.div-0-o)”

./cd-hit -i example/protdb.div-0 -o example/protdb.div-0-o -c 0.9 -n 5

echo “Preprocess Complete!”

gcc cd-hit-grid.c -L $GW_LOCATION/lib \

-I $GW_LOCATION/include -ldrmaa -o cd-hit-grid

Then, you'll execute it with the following parameters:

./cd-hit-grid 10 1 al -1

being the first parameter the number of database divisions. The rest of parameters will be discussed in the Advanced Howto.

./grid-hit-postprocess -i example/protdb -o example/protdb -S 10

Workflow Optimization

CD-HIT's workflow was analyzed and two optimization heuristics were implemented. The first one, called Replication, consists in creating copies of tasks pertaining to the workflow's critical path. This way, probabilities of a fast task execution increase. The second one is called Agglomeration and consists in detaching last level tasks execution to the client machine, so queuing and file transfer times are bypassed.

Advanced Howto

The optimization heuristics mentioned before can be activated when executing:

./cd-hit-grid <div> <rel> <rep> <agg>

where:

License

CD-HIT-Grid is an open source development effort. It is released as “open source” under the the Apache License, Version 2.0. The Apache license allows software to be used by anyone and for any purpose, without restriction. Any academic report, publication, or other academic disclosure of results obtained with CD-HIT-Grid will acknowledge CD-HIT-Grid and GridWay's use by an appropriate citation to related papers shown in the following section.

CD-HIT:

CD-HIT-Grid: