Welcome to MCscan's documentation!
==================================

The MCscan download page can be accessed `here <http://chibba.agtec.uga.edu/duplication/mcscan>`__. 


Usage
==================
This software provides a clustering module for viewing the relationship of colinear segments in multiple genomes (or heavily redundant genomes). It takes the predicted pairwise segments from dynamic programming (DAGchainer in particular) and then try to build consensus segments from a set of related, overlapping segments.

Certain part of this package (``dagchainer.cc``) is based on the TIGR software `DAGchainer <http://dagchainer.sf.net>`_. The program used this as an initial step to generate pairwise segments. 

Along with the DAGchainer guidelines, all code is copiable, distributable, modifiable, and usable without any restrictions. 


Installation
==================
.. note::
    MCscan currently will only run on linux or cygwin platform, as it is dependent on GNU function.

Simply put ``mcscan.tar.gz`` in any directory:: 

    $ tar zxf mcscan_version.tar.gz
    $ cd mcscan_version.tar.gz/ && make

the compiled codes are within the same directory as the source.

Then put copy of MCL executable within the same folder as MCscan (MCL program downloadable `here <http://micans.org/mcl/>`__). 


Inputs and outputs
==================
MCscan reads in at least two sources of data: ``.blast`` file and ``.bed`` file. This may seem daunting at first, but these are very easy to retrieve. Have a look at the ``at_at.blast``, ``at_at.bed`` in the folder. In the actual execution, MCL is used to generate mcl file (``at_at.mcl``), which is used in multiple synteny construction.

Here is what can be used to genenerate the files.

The blast file is the following tab-delimited format:: 

    gene1    gene2    e-value

easily genenerated from a m8 blast output format::

    $ cut -f 1,2,11 xyz.m8 > xyz.blast.unfiltered

The first thing please ensure that for each gene pair, only one e-value is reported, the blast output normally would contain multiple HSPs, a convenience script is attached to filter all the redundant pairs::

    $ python filter_blast.py xyz.blast.unfiltered xyz.blast

The ``.bed`` file contains the following tab-delimited format (see `bed format <http://genome.ucsc.edu/FAQ/FAQformat.html#format1>`__)::

    chromosome_id    start    stop    gene_name

notice when you compare multiple genomes, formulate your molecule name carefully to avoid duplicated names. The ``.bed`` file can usually be generated by parsing the gene annotation file provided by the sequencing group (usually the sequencing project ftp will provide a ``.gff3`` file).

Once you have everything ready, put them in the same folder. We need to generate ``.mcl`` file if this is the first run (also take a look at the example in ``run.sh``)::

    $ more xyz.blast | mcl - --abc --abc-neg-log -abc-tf 'mul(0.4343), ceil(200)' -o xyz.mcl 

Some might encounter a problem exec the ``mcl`` command, in which case the ``mcl`` binary needs to be rebuilt from `here <http://micans.org/mcl/>`__. After the first time you run it (the ``mcl`` file has been generated). You can simply use::

    $ ./mcscan xyz


Parameters (for advanced user)
==============================
The help screen::

    Usage: mcscan [OPTION...] prefix_fn
    MCSCAN -- multiple collinearity scan (compiled Jul  6 2010 15:52:03)

    Reference:
     Tang,H., Wang,X., Bowers,J.E., Ming,R., Alam,M., Paterson,A.H.
     Unraveling ancient hexaploidy through multiply-aligned angiosperm gene maps
     Genome Research (2008) 18, 1944-1954.

      -a                         only builds the pairwise blocks (.aligns file)
      -A                         use base pair dist instead of gene ranks
      -b                         limit within genome synteny (e.g. Vv-Vv) mapping
      -e, --e_value=E_VALUE      alignment significance
      -g, --gap_score=GAP_SCORE  gap penalty
      -k, --match_score=MATCH_SCORE   final score=MATCH_SCORE+NUM_GAPS*GAP_SCORE
      -p, --pivot=PIVOT          PIVOT is the reference genome, make it two letter
                                 prefix inyour .bed file, everything else will be
                                 aligned to the reference
      -s, --match_size=MATCH_SIZE   number of genes required to call synteny
      -u, --unit_dist=UNIT_DIST  average intergenic distance
      -?, --help                 Give this help list
          --usage                Give a short usage message
      -V, --version              Print program version

    Mandatory or optional arguments to long options are also mandatory or optional
    for any corresponding short options.

    Report bugs to <bao@uga.edu>.

The default values are quite generic and should work for many instances. The following are more detailed information for users who wish to tune their results.

The pairwise synteny formula is roughly (Haas et al. 2004), note that ``DIST_X``, ``DIST_Y`` is in base pair::

    FINAL_SCORE = MATCH_SCORE*num_matches+max(DIST_X, DIST_Y)/UNIT_DIST*GAP_SCORE

The multiple synteny formula is roughly, now ``DIST_X``, ``DIST_Y`` is the distance in the partial order graph (not in base unit, but in gene index unit)::

    FINAL_SCORE = MATCH_SCORE*num_matches+max(DIST_X, DIST_Y)*GAP_SCORE

Sometimes you may want to run just the pairwise synteny on ``.blast`` and ``.bed`` files, then you can try::

    $ ./mcscan at_at -a

Note that to run this, ``.mcl`` file is not required, the result is now slightly different, since MCscan uses the mcl file to filter the BLAST hits.


Walkthrough example
===================
There are, by default ``at_vv`` sets of files and ``os_sb`` sets of files, which is basically two different projects.

First example, let us compare Os to Sb (rice to sorghum), just default settings, run::

    $ ./mcscan os_sb

It takes about one minute to run, the result is best viewed in EXCEL. The first part of the file lists all the parameters of the program. The result is separated with a line like this::

    ## View 11: pivot Sb02

This is called a view, each view uses a different chromosome as the reference. Then the blocks following this line is the multiply aligned blocks. The first column is numerical identifier, the second column is the actual pivot. Then following columns are the regions that are aligned to the pivot. The alignments between rice and sorghum are in fact complicated by one or more shared WGDs, creating several columns but mostly are four regions matching each other.

For the second example, we wish to align *Arabidopsis* to grape, and use grape as the reference genome, but we need to do it a little differently. Unlike the first example, we are not interested in WGD in grape in this case, and we only wish to see the grape used as pivot. Therefore, we modify the pivot:: 

    $ ./mcscan at_vv -p Vv -b

This trick ``-b`` will limit any Vv-Vv matches (in fact this is an older duplication called gamma) in the output.

There are two outputs. ``.aligns`` file and ``.blocks`` file, corresponding to pairwise and multiple synteny respectively. You will find the ``.aligns`` file very useful too, sometimes. But this is essentially similar to the output of DAGchainer (adding a few statistics and change the default paramters). 


Changelog
==================
* May 12, 2007 (version <0.5) initial release.
* Aug 05, 2007 (version 0.5) add the option of of a *reference* genome
* Oct 13, 2007 (version 0.6) add convenience python script to streamline the process
* Mar 07, 2008 (version 0.7) implement statistical test for pairwise syntenic blocks
* Nov 13, 2008 (version 0.8) partial-order graph for alignment

Contact
==================
Any questions, problems, bugs are welcome and should be dumped to 

Haibao Tang : bao at uga dot edu

`Plant Genome Mapping Laboratory <http://www.plantgenome.uga.edu>`_, University of Georgia