~~NOTOC~~
====== CD-HIT-Grid ======

The target of this application porting is the Bioinformatic scientific community, and in particular, those members who use a tool called "CD-HIT" which performs protein clustering on a protein sequence database. This consists in removing redundant sequences at a given sequence similarity level and generating a new database with the representatives only. This activity was proposed by [[http://www.cnio.es/ | CNIO]] (Spanish National Cancer Research Centre) and started in the context of the [[http://www.biogridnet.org/ | BioGridNet Program]]. 

{{ outreach:porting:cdhit1.jpg?530 |}}

Protein databases are growing up day after day, the clustering process on interesting datasets in a single machine is not feasible due to memory constrains. A Grid environment allows an adaptive database distribution in order to optimize its overall analysis. The complexity of the workflow inherent to "CD-HIT" needs a robust framework able to handle it. In addition, this framework may be successfully used in other applications which result in a same type of workflow. 

===== Download =====

This release of CD-HIT-Grid (1.0) contains the framework source code (using the [[http://drmaa.org/ | OGF DRMAA standard]]), scripts and the original cd-hit binaries for 32 bits (that can be replaced with newer versions):

  * {{outreach:porting:cd-hit-grid.tar.gz|CD-HIT-Grid 1.0}}

===== Basic Howto =====

In order to make CD-HIT-Grid work, you will need a machine with a running [[http://www.gridway.org/ | GridWay]] instance. The user needs a valid certificate for the resources accessed.

After unpacking the //tar.gz// file, three directories will be revealed:

  * **preprocess:** contains a script named //grid-hit-preprocess//, that calls //cd-hit-div// for dividing the input database and perform the first clustering (with //cd-hit//). Out of the box it assumes that the input database is called protdb **(Which cannot be changed)**, its located in a directory called CNIO and the number of divisions that will be performed is 20. __Edit this script in order to make it fit your needs.__

  * **grid:** contains the //grid-hit-grid// source code (in C) with the original //cd-hit// and //cd-hit-2d// binaries, which will be sent to the Grid. Additionally there's a script called agglomeration that will be discussed in the Advanced Howto.

  * **postprocess:** contains an script called //grid-hit-postprocess.pl// that invokes the original //clstr_merge.pl// (included) for merging all output data.

You could merge this three directories in one. 

In order to learn how to use CD-HIT-Grid, let's consider a database located in a directory called //example//. We also want to adjust the grain to 10 divisions. 

  * **Divide the database for its Grid processing:** You should edit //grid-hit-preprocess// and make it look like:
''#!/bin/bash''

''echo "Grid-HIT Preprocess Script - DSA Group"''

''echo "[1] Dividing the Database (example/protdb)"''

''./cd-hit-div -i example/protdb -o example/protdb.div -div 10''

''echo "[2] First Self-comparison (example/protdb.div-0-o)"''

''./cd-hit -i example/protdb.div-0 -o example/protdb.div-0-o -c 0.9 -n 5''

''echo "Preprocess Complete!"''

   * **Process the database on to the Grid:** the framework binary must be generated. For this, you'll use the [[http://www.gridway.org/ | GridWay]]'s implementation of the [[http://drmaa.org/ | OGF DRMAA Standard]]:
''gcc cd-hit-grid.c -L $GW_LOCATION/lib \'' 

''    -I $GW_LOCATION/include -ldrmaa -o cd-hit-grid''

Then, you'll execute it with the following parameters:

''./cd-hit-grid 10 1 al -1''

being the first parameter the number of database divisions. The rest of parameters will be discussed in the Advanced Howto.

  * **Postprocess the Grid output:** Once the Grid process is done, the last thing to do is to merge the output files:
''./grid-hit-postprocess -i example/protdb -o example/protdb -S 10''

===== Workflow Optimization =====

CD-HIT's workflow was analyzed and two optimization heuristics were implemented. The first one, called Replication, consists in creating copies of tasks pertaining to the workflow's critical path. This way, probabilities of a fast task execution increase. The second one is called Agglomeration and consists in detaching last level tasks execution to the client machine, so queuing and file transfer times are bypassed. 

{{ outreach:porting:cdhit2.jpg?530 |}}

===== Advanced Howto =====

The optimization heuristics mentioned before can be activated when executing:

''./cd-hit-grid <div> <rel> <rep> <agg>''

where:

  * **<div>** is the number of database divisions, as explained before.
  * **<rel>** is the level of replication (1 is none at all).
  * **<rep>** is the replication type. Values are //al// (all nodes) and //cp// (just critical path).
  * **<agg>** is the worflow level in which the agglomeration starts. If //-1// is specified, no Agglomeration will be used.

===== License =====

**CD-HIT-Grid is an open source development effort.** It is released as “open source” under the the Apache License, Version 2.0. The Apache license allows software to be used by anyone and for any purpose, without restriction. **Any academic report, publication, or other academic disclosure of results obtained with CD-HIT-Grid will acknowledge CD-HIT-Grid and GridWay's use by an appropriate citation to related papers shown in the following section.**


===== Related Publications =====

**CD-HIT:**

  * Weizhong Li and Adam Godzik. "CD-HIT: A Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences", Bioinformatics (2006) 22:1658-9
 
**CD-HIT-Grid:**

  * J. L. Vázquez-Poletti, E. Huedo, R. S. Montero, I. M. Llorente, J. M. Fernández and A. Valencia: Protein Clustering on the EGEE. EGEE 2nd User Forum, Manchester (UK), May 2007.
  * J. L. Vázquez-Poletti, E. Huedo, R. S. Montero and I. M. Llorente: Workflow Management in a Protein Clustering Application. 5th International Workshop on Biomedical Computations on the Grid (BioGrid'07) on the 7th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2007), Rio de Janeiro (Brazil), May 2007. Proceedings published by IEEE Computer Society Press, pp. 679-684.
  * J. L. Vázquez-Poletti, E. Huedo, R. S. Montero and I. M. Llorente: Advanced Strategies for Efficient Workflow Management in a Protein Clustering Application with GridWay. 1st Iberian Grid Infrastructure Conference, Santiago de Compostela (Spain), May 2007. Proceedings published by CESGA, pp. 199-207.
  * J. L. Vázquez-Poletti, E. Huedo, R. S. Montero and I. M. Llorente: Replication Heuristics for Efficient Workflow Execution on Grids. International Conference on Grid computing, high-performAnce and Distributed Applications at on the Move Federated Conferences (GADA) 2007, Vilamoura (Portugal), November 2007. Proceedings published in Lecture Notes in Computer Science (LNCS). Volume 4805, pp. 31-32, 2007. Springer Verlag.
  * J. L. Vázquez-Poletti, E. Huedo, R. S. Montero, I. M. Llorente, J. M. Fernández and A. Valencia: Protein Clustering with CD-HIT on the EGEE. VII Jornadas de Bioinformática on the 8th Spanish Symposium on Bioinformatics and Computational Biology, Valencia (Spain), February 2008.
  * J. L. Vázquez-Poletti, E. Huedo, R. S. Montero and I. M. Llorente: CD-HIT Workflow Execution on Grids using Replication Heuristics. 1st International Workshop on Modern Computer Tools for the Biosciences - A Grid Perspective - (ModernBio'08) on the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), Lyon (France), May 2008. Proceedings published by IEEE Computer Society Press, pp. 735-740.