Falcon
===========

Falcon: a set of tools for fast aligning long reads for consensus and assembly

The Falcon tool kit is a set of simple code collection which I use for studying
efficient assembly algorithm for haploid and diploid genomes. It has some back-end 
code implemented in C for speed and some simple front-end written in Python for
convenience. 

Please take a look at the `readme.md` file inside the `examples` directory. It shows 
how to do assembly using `HBAR-DTK` + `Falcon` on Amazon EC2 with a `StarCluster` 
setup . If any one knows anything comparable to `StarCluster` for Google Compute 
Engine, please let me know. I can build a VM there too.

FILES
-----

Here is a brief description of the files in the package

Several C files for implementing sequence matching, alignment and consensus:

    kmer_lookup.c  # kmer match code for quickly identify potential hits
    DW_banded.c    # function for detailed sequence alignment
                   # It is based on Eugene Myers' Paper 
                   # "AnO(ND) difference algorithm and its variations", 1986, 
                   # http://dx.doi.org/10.1007/BF01840446
    falcon.c       # functions for generating consensus sequences for a set of multiple sequence alginment
    common.h       # header file for common declaration

A python wrapper library using Python's ctypes to call the C functions: falcon_kit.py

Some python scripts for (1) overlapping reads (2) generation consensus and (3) generate 
assembly contigs:

    falcon_overlap.py   # an overlapper
    falcon_wrap.py      # generate consensus from a group of reads
    get_rdata.py        # a utility for preparing data for falcon_wrap.py
    falcon_asm.py       # take the overlapping information and the sequence to generate assembled contig
    falcon_fixasm.py    # a script analyzing the assembly graph and break contigs on potential mis-assembly points
    remove_dup_ctg.py   # a utility code to remove duplication contigs in the assembly results


INSTALLATION
------------

You need to install `pbcore` and `networkx` first. You might want to install
the `HBAR-DTK` if you want to assemble genomes from raw PacBio data.  

On a Linux box, you should be able to use the standard `python setup.py
install` to compile the C code and install python package. There is no standard
way to install the shared objects from the C code inside a python package, so I
did some hack to make it work.  It might have some unexpected behavior. You can
simply install the `.so` files in a path where the operation system can find
(e.g. setting the environment variable `LD_LIBRARY_PATH`), and remove all
prefix in Python `ctypes` `CDDL` function calls.


EXAMPLES
--------

Example for generating pre-assembled reads:

    python get_rdata.py queries.fofn targets.fofn m4.fofn 72 0 16 8 64 50 50 | falcon_wrap.py > p-reads-0.fa
    
    bestn : 72
    group_id : 0
    num_chunk : 16
    min_cov : 8
    max_cov : 64
    trim_align : 50
    trim_plr : 50

    It is designed to use with the m4 alignment information generated by blasr + HBAR_WF2.py (https://github.com/PacificBiosciences/HBAR-DTK)

Example for generating overlap data:

    falcon_overlap.py --min_len 4000 --n_core 24 --d_core 3 preads.fa > preads.ovlp

Example for generating assembly

    falcon_asm.py preads.ovlp  preads.fa 

The following files will be generated by `falcon_asm.py` in the same directory:

    full_string_graph.adj  # the adjecent nodes of the edges in the full string graph
    string_graph.gexf      # the gexf file of the string graph for graph visulization
    string_graph.adj       # the adjecent nodes of the edges in the string graph after transitive reduction
    edges_list             # full edge list 
    paths                  # path for the unitigs
    unit_edges.dat         # path and sequence of the untigs
    uni_graph.gexf         # unitig graph in gexf format 
    unitgs.fa              # fasta files of the unitigs
    all_tigs_paths         # paths for all final contigs (= primary contigs + associated contigs)
    all_tigs.fa            # fasta file for all contigs
    primary_tigs_paths     # paths for all primary contigs 
    primary_tigs.fa        # fasta file fot the primary contigs
    asm_graph.gexf         # the assembly graph where the edges are the contigs

Although I have tested this tool kit to genome up to 150Mb and get reasonable
good assembly results, this tool kit is still highly experimental and is not
meant to be used by novice people. If you like to try it out, you will very
likely to know more detail about it and be able to tweak the code to adapt it
to your computation cluster.  I will hope that I can provide more details and
clean the code up a little in the future so it can be useful for more people. 

The principle of the layout algorithm is also available at 
https://speakerdeck.com/jchin/string-graph-assembly-for-diploid-genomes-with-long-reads

ABOUT THE LICENSE
------------------

Major part of the coding work is done with my own time and on my own MacBook(R)
Air. However, as a PacBio(R) employee, most of the testing are done with the data
generated by PacBio and PacBio's computational resources, so it is fair the
code is released with PacBio's version of open source licence. If you are from
a competitor and try to take advantage of any open source code from PacBio, the
only thing you can really justify such practice is to release your real data in
public and your code as open source too. 

Also, releasing this code to public is fully my own discretion. If my employer
has any concern about this, I might have to pull it off.

Standard PacBio Open Source License that is associated with this package:

    #################################################################################$$
    # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc.
    #
    # All rights reserved.
    #
    # Redistribution and use in source and binary forms, with or without
    # modification, are permitted (subject to the limitations in the
    # disclaimer below) provided that the following conditions are met:
    #
    #  * Redistributions of source code must retain the above copyright
    #  notice, this list of conditions and the following disclaimer.
    #
    #  * Redistributions in binary form must reproduce the above
    #  copyright notice, this list of conditions and the following
    #  disclaimer in the documentation and/or other materials provided
    #  with the distribution.
    #
    #  * Neither the name of Pacific Biosciences nor the names of its
    #  contributors may be used to endorse or promote products derived
    #  from this software without specific prior written permission.
    #
    # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE
    # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC
    # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
    # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
    # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
    # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS
    # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
    # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
    # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
    # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
    # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
    # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
    # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
    # SUCH DAMAGE.
    #################################################################################$$

--Jason Chin, Dec 16, 2013

