~~NOTOC~~ ====== CD-HIT-Grid ====== The target of this application porting is the Bioinformatic scientific community, and in particular, those members who use a tool called "CD-HIT" which performs protein clustering on a protein sequence database. This consists in removing redundant sequences at a given sequence similarity level and generating a new database with the representatives only. This activity was proposed by [[http://www.cnio.es/ | CNIO]] (Spanish National Cancer Research Centre) and started in the context of the [[http://www.biogridnet.org/ | BioGridNet Program]]. {{ outreach:porting:cdhit1.jpg?530 |}} Protein databases are growing up day after day, the clustering process on interesting datasets in a single machine is not feasible due to memory constrains. A Grid environment allows an adaptive database distribution in order to optimize its overall analysis. The complexity of the workflow inherent to "CD-HIT" needs a robust framework able to handle it. In addition, this framework may be successfully used in other applications which result in a same type of workflow. ===== Download ===== This release of CD-HIT-Grid (1.0) contains the framework source code (using the [[http://drmaa.org/ | OGF DRMAA standard]]), scripts and the original cd-hit binaries for 32 bits (that can be replaced with newer versions): * {{outreach:porting:cd-hit-grid.tar.gz|CD-HIT-Grid 1.0}} ===== Basic Howto ===== In order to make CD-HIT-Grid work, you will need a machine with a running [[http://www.gridway.org/ | GridWay]] instance. The user needs a valid certificate for the resources accessed. After unpacking the //tar.gz// file, three directories will be revealed: * **preprocess:** contains a script named //grid-hit-preprocess//, that calls //cd-hit-div// for dividing the input database and perform the first clustering (with //cd-hit//). Out of the box it assumes that the input database is called protdb **(Which cannot be changed)**, its located in a directory called CNIO and the number of divisions that will be performed is 20. __Edit this script in order to make it fit your needs.__ * **grid:** contains the //grid-hit-grid// source code (in C) with the original //cd-hit// and //cd-hit-2d// binaries, which will be sent to the Grid. Additionally there's a script called agglomeration that will be discussed in the Advanced Howto. * **postprocess:** contains an script called //grid-hit-postprocess.pl// that invokes the original //clstr_merge.pl// (included) for merging all output data. You could merge this three directories in one. In order to learn how to use CD-HIT-Grid, let's consider a database located in a directory called //example//. We also want to adjust the grain to 10 divisions. * **Divide the database for its Grid processing:** You should edit //grid-hit-preprocess// and make it look like: ''#!/bin/bash'' ''echo "Grid-HIT Preprocess Script - DSA Group"'' ''echo "[1] Dividing the Database (example/protdb)"'' ''./cd-hit-div -i example/protdb -o example/protdb.div -div 10'' ''echo "[2] First Self-comparison (example/protdb.div-0-o)"'' ''./cd-hit -i example/protdb.div-0 -o example/protdb.div-0-o -c 0.9 -n 5'' ''echo "Preprocess Complete!"'' * **Process the database on to the Grid:** the framework binary must be generated. For this, you'll use the [[http://www.gridway.org/ | GridWay]]'s implementation of the [[http://drmaa.org/ | OGF DRMAA Standard]]: ''gcc cd-hit-grid.c -L $GW_LOCATION/lib \'' '' -I $GW_LOCATION/include -ldrmaa -o cd-hit-grid'' Then, you'll execute it with the following parameters: ''./cd-hit-grid 10 1 al -1'' being the first parameter the number of database divisions. The rest of parameters will be discussed in the Advanced Howto. * **Postprocess the Grid output:** Once the Grid process is done, the last thing to do is to merge the output files: ''./grid-hit-postprocess -i example/protdb -o example/protdb -S 10'' ===== Workflow Optimization ===== CD-HIT's workflow was analyzed and two optimization heuristics were implemented. The first one, called Replication, consists in creating copies of tasks pertaining to the workflow's critical path. This way, probabilities of a fast task execution increase. The second one is called Agglomeration and consists in detaching last level tasks execution to the client machine, so queuing and file transfer times are bypassed. {{ outreach:porting:cdhit2.jpg?530 |}} ===== Advanced Howto ===== The optimization heuristics mentioned before can be activated when executing: ''./cd-hit-grid