[med-svn] [falconkit] 01/05: Imported Upstream version 0.1.3+20140820

Afif Elghraoui afif-guest at moszumanska.debian.org
Tue Dec 22 06:53:45 UTC 2015


This is an automated email from the git hooks/post-receive script.

afif-guest pushed a commit to branch master
in repository falconkit.

commit 0627e0296c72f0eccf6ebf99f84932644dda1cd4
Author: Afif Elghraoui <afif at ghraoui.name>
Date:   Mon Dec 21 16:09:21 2015 -0800

    Imported Upstream version 0.1.3+20140820
---
 MANIFEST.in                        |    1 +
 README.md                          |  164 +++++
 doc/file_format_note.md            |  113 ++++
 examples/Dmel_asm.md               |  250 ++++++++
 examples/HBAR.cfg                  |   72 +++
 examples/StarCluster.cfg           |   24 +
 examples/install_note.sh           |   84 +++
 examples/readme.md                 |   92 +++
 examples/run_asm.sh                |   24 +
 setup.py                           |   36 ++
 src/c/DW_banded.c                  |  319 ++++++++++
 src/c/Makefile                     |   20 +
 src/c/Makefile.osx                 |   16 +
 src/c/common.h                     |  177 ++++++
 src/c/falcon.c                     |  613 +++++++++++++++++++
 src/c/kmer_lookup.c                |  594 +++++++++++++++++++
 src/py/__init__.py                 |   39 ++
 src/py/falcon_kit.py               |  193 ++++++
 src/py_scripts/falcon_asm.py       | 1154 ++++++++++++++++++++++++++++++++++++
 src/py_scripts/falcon_asm_dev.py   | 1015 +++++++++++++++++++++++++++++++
 src/py_scripts/falcon_dedup.py     |  119 ++++
 src/py_scripts/falcon_fixasm.py    |  213 +++++++
 src/py_scripts/falcon_overlap.py   |  328 ++++++++++
 src/py_scripts/falcon_overlap2.py  |  337 +++++++++++
 src/py_scripts/falcon_qrm.py       |  370 ++++++++++++
 src/py_scripts/falcon_sense.py     |  243 ++++++++
 src/py_scripts/falcon_ucns_data.py |  120 ++++
 src/py_scripts/falcon_utgcns.py    |  124 ++++
 src/py_scripts/get_rdata.py        |  207 +++++++
 src/py_scripts/overlapper.py       |  216 +++++++
 src/py_scripts/remove_dup_ctg.py   |   75 +++
 src/utils/fetch_preads.py          |   70 +++
 test_data/t1.fa                    |    2 +
 test_data/t1.fofn                  |    1 +
 test_data/t2.fa                    |    2 +
 test_data/t2.fofn                  |    1 +
 36 files changed, 7428 insertions(+)

diff --git a/MANIFEST.in b/MANIFEST.in
new file mode 100644
index 0000000..5b3144c
--- /dev/null
+++ b/MANIFEST.in
@@ -0,0 +1 @@
+include src/c/*
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..de97d3a
--- /dev/null
+++ b/README.md
@@ -0,0 +1,164 @@
+Falcon
+===========
+
+Falcon: a set of tools for fast aligning long reads for consensus and assembly
+
+The Falcon tool kit is a set of simple code collection which I use for studying
+efficient assembly algorithm for haploid and diploid genomes. It has some back-end 
+code implemented in C for speed and some simple front-end written in Python for
+convenience. 
+
+Please take a look at the `readme.md` file inside the `examples` directory. It shows 
+how to do assembly using `HBAR-DTK` + `Falcon` on Amazon EC2 with a `StarCluster` 
+setup . If any one knows anything comparable to `StarCluster` for Google Compute 
+Engine, please let me know. I can build a VM there too.
+
+FILES
+-----
+
+Here is a brief description of the files in the package
+
+Several C files for implementing sequence matching, alignment and consensus:
+
+    kmer_lookup.c  # kmer match code for quickly identify potential hits
+    DW_banded.c    # function for detailed sequence alignment
+                   # It is based on Eugene Myers' Paper 
+                   # "AnO(ND) difference algorithm and its variations", 1986, 
+                   # http://dx.doi.org/10.1007/BF01840446
+    falcon.c       # functions for generating consensus sequences for a set of multiple sequence alginment
+    common.h       # header file for common declaration
+
+A python wrapper library using Python's ctypes to call the C functions: falcon_kit.py
+
+Some python scripts for (1) overlapping reads (2) generation consensus and (3) generate 
+assembly contigs:
+
+    falcon_overlap.py   # an overlapper
+    falcon_wrap.py      # generate consensus from a group of reads
+    get_rdata.py        # a utility for preparing data for falcon_wrap.py
+    falcon_asm.py       # take the overlapping information and the sequence to generate assembled contig
+    falcon_fixasm.py    # a script analyzing the assembly graph and break contigs on potential mis-assembly points
+    remove_dup_ctg.py   # a utility code to remove duplication contigs in the assembly results
+
+
+INSTALLATION
+------------
+
+You need to install `pbcore` and `networkx` first. You might want to install
+the `HBAR-DTK` if you want to assemble genomes from raw PacBio data.  
+
+On a Linux box, you should be able to use the standard `python setup.py
+install` to compile the C code and install python package. There is no standard
+way to install the shared objects from the C code inside a python package, so I
+did some hack to make it work.  It might have some unexpected behavior. You can
+simply install the `.so` files in a path where the operation system can find
+(e.g. setting the environment variable `LD_LIBRARY_PATH`), and remove all
+prefix in Python `ctypes` `CDDL` function calls.
+
+
+EXAMPLES
+--------
+
+Example for generating pre-assembled reads:
+
+    python get_rdata.py queries.fofn targets.fofn m4.fofn 72 0 16 8 64 50 50 | falcon_wrap.py > p-reads-0.fa
+    
+    bestn : 72
+    group_id : 0
+    num_chunk : 16
+    min_cov : 8
+    max_cov : 64
+    trim_align : 50
+    trim_plr : 50
+
+    It is designed to use with the m4 alignment information generated by blasr + HBAR_WF2.py (https://github.com/PacificBiosciences/HBAR-DTK)
+
+Example for generating overlap data:
+
+    falcon_overlap.py --min_len 4000 --n_core 24 --d_core 3 preads.fa > preads.ovlp
+
+Example for generating assembly
+
+    falcon_asm.py preads.ovlp  preads.fa 
+
+The following files will be generated by `falcon_asm.py` in the same directory:
+
+    full_string_graph.adj  # the adjecent nodes of the edges in the full string graph
+    string_graph.gexf      # the gexf file of the string graph for graph visulization
+    string_graph.adj       # the adjecent nodes of the edges in the string graph after transitive reduction
+    edges_list             # full edge list 
+    paths                  # path for the unitigs
+    unit_edges.dat         # path and sequence of the untigs
+    uni_graph.gexf         # unitig graph in gexf format 
+    unitgs.fa              # fasta files of the unitigs
+    all_tigs_paths         # paths for all final contigs (= primary contigs + associated contigs)
+    all_tigs.fa            # fasta file for all contigs
+    primary_tigs_paths     # paths for all primary contigs 
+    primary_tigs.fa        # fasta file fot the primary contigs
+    asm_graph.gexf         # the assembly graph where the edges are the contigs
+
+Although I have tested this tool kit to genome up to 150Mb and get reasonable
+good assembly results, this tool kit is still highly experimental and is not
+meant to be used by novice people. If you like to try it out, you will very
+likely to know more detail about it and be able to tweak the code to adapt it
+to your computation cluster.  I will hope that I can provide more details and
+clean the code up a little in the future so it can be useful for more people. 
+
+The principle of the layout algorithm is also available at 
+https://speakerdeck.com/jchin/string-graph-assembly-for-diploid-genomes-with-long-reads
+
+ABOUT THE LICENSE
+------------------
+
+Major part of the coding work is done with my own time and on my own MacBook(R)
+Air. However, as a PacBio(R) employee, most of the testing are done with the data
+generated by PacBio and PacBio's computational resources, so it is fair the
+code is released with PacBio's version of open source licence. If you are from
+a competitor and try to take advantage of any open source code from PacBio, the
+only thing you can really justify such practice is to release your real data in
+public and your code as open source too. 
+
+Also, releasing this code to public is fully my own discretion. If my employer
+has any concern about this, I might have to pull it off.
+
+Standard PacBio Open Source License that is associated with this package:
+
+    #################################################################################$$
+    # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc.
+    #
+    # All rights reserved.
+    #
+    # Redistribution and use in source and binary forms, with or without
+    # modification, are permitted (subject to the limitations in the
+    # disclaimer below) provided that the following conditions are met:
+    #
+    #  * Redistributions of source code must retain the above copyright
+    #  notice, this list of conditions and the following disclaimer.
+    #
+    #  * Redistributions in binary form must reproduce the above
+    #  copyright notice, this list of conditions and the following
+    #  disclaimer in the documentation and/or other materials provided
+    #  with the distribution.
+    #
+    #  * Neither the name of Pacific Biosciences nor the names of its
+    #  contributors may be used to endorse or promote products derived
+    #  from this software without specific prior written permission.
+    #
+    # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE
+    # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC
+    # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
+    # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+    # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+    # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS
+    # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+    # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+    # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+    # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+    # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+    # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+    # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+    # SUCH DAMAGE.
+    #################################################################################$$
+
+--Jason Chin, Dec 16, 2013
+
diff --git a/doc/file_format_note.md b/doc/file_format_note.md
new file mode 100644
index 0000000..6cbcdd4
--- /dev/null
+++ b/doc/file_format_note.md
@@ -0,0 +1,113 @@
+Quick Note on FALCON Assembly Output Format
+============================================
+
+After running `falcon_asm.py`, the following files will be generated
+
+- `edges_list`: the list of edges in the assembled string graph
+- `unit_edge_paths`: the path of each unitig
+- `unit_edges.dat`: the path and the sequence of each unitig
+- `unitgs.fa`: fasta file of all unitigs
+- `all_tigs_paths`: the path of all contigs
+- `all_tigs.fa`: the sequences of all contigs
+- `primary_tigs_paths`: the path of the primary contigs
+- `primary_tigs.fa`: the sequences of the initial primary contigs
+- `bundle_edges`: the edges and paths of each "string bundles"
+
+After running `falcon_fixasm.py`, it generates the following files
+
+- `primary_tigs_c.fa`: the final primary contigs
+- `primary_tigs_paths_c`: the path of the final primary contigs
+- `all_tiling_path_c`: the "tiling" path of all contigs
+- `primary_tigs_node_pos_c`: the positions of the nodes in each of the primary contigs
+
+The format of each node is the identifier of the DNA fragment followed by `:B` or `:E` indicating the
+end of the read that is corresponding to the node.
+
+The `egdes_list` file has a simple 4 column format: `in_node out_node edge_label overlap_length`.
+ 
+Here is an example how edges are represented in the `egdes_list` file:
+	
+	00099576_1:B 00101043_0:B 00101043_0:1991-0 14333
+	00215514_0:E 00025025_0:B 00025025_0:99-0 14948
+	00223367_0:E 00146924_0:B 00146924_0:1188-0 8452
+	00205542_0:E 00076625_0:B 00076625_0:396-0 11067
+
+The `edge_label`, e.g. `00101043_0:1991-0`, encodes the correspondent sequence of the edge from the DNA fragment. The
+edge `00099576_1:B -> 00101043_0:B` has a sequence from read `00101043_0` base 1991 to 0.
+
+
+The `unit_edge_paths` file contains the information of the path of each unitig. Each line represents 
+an unitig. For example, the unitig `00001c` is represented as:
+
+	>00001c-00169881_0:B-00121915_0:E-133 00169881_0:B 00201238_0:E 00137179_0:E 00142410_0:B 
+     00223493_0:B 00208425_0:B 00102538_0:E 00160115_0:E  ... 00122905_0:E 00121915_0:E
+
+The full unitig id `00001c-00169881_0:B-00121915_0:E-133` includes the unique serial number `00001c`, the begin node `00169881_0:B` and the end node `00121915_0:E` followed by the number of nodes 133 in the path. The rest of the fields list the full path node by node.
+
+The `primary_tigs_paths` and `all_tigs_paths` have the same format as the `unit_edge_paths` except the edges in the path are the unitig edges rather than the edges in the original string graph.
+
+The `unit_edges.dat` contains not only the begin nodes, the end nodes and the paths of the unitigs but also the full sequences of the unitigs.  It has simple 4 column format `begin node`, `end node`, `path`, `sequence`. The different nodes in the path are delimited by `-`.  
+
+The sequence identifiers in `all_tigs.fa` also encode the relationship between different contigs. For example:
+
+	$ grep ">" all_tigs.fa | head -15
+	>0000-0000 2e8a7078_130260_0:B-02eca7b8_135520_0:E
+	>0000-0001 6edbcd5c_128868_0:E-3353572d_72448_963:E
+	>0000-0002 2f1c350c_15083_0:E-8c92434f_60400_0:E
+	>0000-0003 02eca7b8_135520_0:B-02030999_5577_0:B
+	>0000-0004-u 53756d78_87035_13099:B-d850f3f2_135807_0:E
+	>0000-0005-u 80ae02b0_43730_1168:B-4901e842_5163_2833:B
+	>0000-0006-u e1709413_155764_0:E-e55b636f_50757_0:E
+	>0000-0007-u e56a70f0_80897_1520:E-06734432_150537_0:E
+	>0000-0008-u 1ab64aad_59082_807:E-6f9ad27e_23458_5638:E
+	>0000-0009-u 1a88ddf4_21715_0:B-9eb4f7d7_79023_11041:E
+	>0000-0010-u ada57c82_24446_0:E-4ce44ebc_41426_0:E
+	>0000-0011-u 49704ee2_54679_0:B-a9ced3cc_90191_1410:E
+	>0000-0012-u b3728b6f_59022_233:E-bd1579e4_160424_0:B
+
+All these sequences have the same first field `0000`. It means all these contigs are initialized from the same "string bundles". If the second field is `0000`, it means that sequence is the primary contig of this bundle. The rest are the "associated contigs". The second field in the identifier simply indicates the begin and the end node of the contigs.
+
+After running `falcon_fixasm.py`, some of the primary contigs could be broken apart into smaller pieces. For example:
+	
+	$ grep ">" primary_tigs_c.fa |  head -15
+	>0000_00
+	>0001_00
+	>0001_01
+	>0001_02
+	>0002_00
+	>0002_01
+
+In this case, the initial primary contig `0000` (`0000-0000` in the `all_tigs.fa` file) is intact. However, the `0001-0000` has been broken into 3 primary contigs `0001_00`, `0001_01`, and `0001_02`.
+
+Some of the associated contigs might be caused by sequencing / consensus errors or missing overlapping information. Running `falcon_dedup.py` compares the associated contigs to the corresponding sequences in the primary contigs. If the identity is high, namely not large scale variants found, they will be removed. Mummer3 (Nucmer) package is used and is necessary for this step. `falcon_dedup.py` generates a file called `a_nodup.fa` which contains the non-redundant associate contigs.
+
+
+Input File Format For FalconSense
+---------------------------------
+
+The `falcon_sense.py` generates consensus from a set of raw sequences.
+
+The input is a stream of sequences. Each row has two columns.  Different set of reads are delimited by `+ +` and the file should be ended by `+ +`.  Here is an example
+
+	seed_id1 ACTACATACATACTTA...
+	read_id2 TCTGGCAACACTACTTA...
+	...
+	- -
+	seed_id2 ACTACATACATACTTA...
+	read_id3 TCTGGCAACACTACTTA...
+	...
+	- -
+	+ +
+
+In this case, if there are enough coverage to correct `seed_id1` and `seed_id2`, the `falcon_sense.py` will generate two consensus sequences (labeled with `seed_id1` and `seed_id2`) in fasta format to `stdout`.
+
+Final Note
+----------
+
+1. Typically, the size of `unitgs.fa` will be roughly twice of the genome size, since the file contains both dual edges from each overlap. In the process of the assembly, only one of the dual edges will be used in the final contigs.  
+
+2. The relation between the associate contigs and the primary contigs can be simply identified by the begin and the end nods of the associted contigs. One can easily constructed the corresponding sequences in the primary contigs for identify the variants between them.
+
+3. One can construct a unitig graph from the `unit_edge_paths` files, the graph is typically much smaller than the initial string graph which is more convenient for visualization for understanding the assembly/genome structure.
+
+4. The `-` and `:` characters are used as delimiter for parsing, so the initial reads identifier should not have these two characters. 
diff --git a/examples/Dmel_asm.md b/examples/Dmel_asm.md
new file mode 100644
index 0000000..f8d20f1
--- /dev/null
+++ b/examples/Dmel_asm.md
@@ -0,0 +1,250 @@
+Dmel Assembly with FALCON on Amazon EC2
+=========================================
+
+Preparation for Running StarCluster
+-----------------------------------
+
+I use a development version of StarCluster since the stable version does
+not support the kind of instance that we need to use in AWS EC2.
+
+You can install the developement version by directly cloning the 
+StarCluster's GitHub repository. The following is a simple example
+for installing the development version. You might have to install
+other python packages that StarCluster is dependent on.
+
+```
+    git clone https://github.com/jtriley/StarCluster.git
+    cd StarCluster
+    # you can check out the exact revision that I am using for this document
+    git checkout 4149bbed292b0298478756d778d8fbf1dd210daf 
+    python setup.py install
+```
+
+For using StarCluster to create a SGE in AWS EC2, I assume you already know how
+to create an AWS EC2 account, and go through the tutorial for running VMs on
+EC2.
+
+I have built a public EC2 EBS snapshot. You should create a new EBS volume
+using the `PacBio_Dmel_Asm / snap-19e7a0df` snapshop. It already contains the
+raw sequence fasta files and an assembly as example already.
+
+Here is an example of the configuration for StarCluster::
+
+```
+    [aws info]
+    aws_access_key_id = your_access_key
+    aws_secret_access_key = your_secret_access_key
+    aws_user_id = your_user_id
+
+    [volume DMEL]
+    volume_id=your_dmel_data_EBS_id #e.g volume_id=vol-c9df3b85
+    mount_path=/mnt/dmel_asm
+
+    [cluster falcon-pre-asm]
+    keyname = starcluster
+    cluster_size = 1
+    cluster_user = sgeadmin
+    cluster_shell = bash
+    master_image_id = ami-ef3c0e86
+    master_instance_type = c3.8xlarge
+    node_image_id = ami-ef3c0e86
+    node_instance_type = c3.8xlarge
+    availability_zone = us-east-1a
+    volumes = DMEL
+
+    [cluster falcon-bigmem]
+    keyname = starcluster
+    cluster_size = 1
+    cluster_user = sgeadmin
+    cluster_shell = bash
+    master_image_id = ami-73d2d21a
+    master_instance_type = cr1.8xlarge
+    node_image_id = ami-73d2d21a
+    node_instance_type = cr1.8xlarge
+    availability_zone = us-east-1a
+    volumes = DMEL
+
+    [global]
+    default_template = falcon-bigmem
+    ENABLE_EXPERIMENTAL=True
+```
+
+I set up two cluster configurations for different part of the assembly process.
+If you want to run end-to-end in one kind of instance, you can just use the 
+`falcon-bigmem` for assembly. It costs a little bit more.
+
+The AMI images (ami-ef3c0e86 and ami-73d2d21a) are pre-built with most package
+necessary for the assembly work. If you will like to build your own, you can
+check with this script:
+
+```
+    https://raw.github.com/PacificBiosciences/FALCON/v0.1.1/examples/install_note.sh
+```
+
+Get preassembled reads
+------------------------
+
+"Pre-assembly" is the process to error correct PacBio reads to generate
+"preassembled reads" (p-reads) which have good accuracy to be assembled by
+traditional Overlap-Layout-Consensus assembly algorithms directly. In this
+instruction, we use an experimental code `falcon_qrm.py` to match the reads for
+error correction. It is much faster than using `blasr` for the same purpose but
+it may not as robust as `blasr` to generate high quality results yet as many
+statistical properties of the algorithm is not fully studied.
+
+
+First, let start an EC2 cluster of one node to set up a few things by running 
+following `starcluster` command:
+
+```
+    starcluster start -c falcon-pre-asm falcon
+```
+
+Once the cluster is built, one can login the master node by:
+
+```
+    starcluster sshmaster falcon
+```
+
+We will need the following steps to setup the running environment::
+
+1. update SGE environment
+
+    ```
+        cd /mnt/dmel_asm/sge_setup
+        bash sge_setup.sh
+    ```
+
+2. setup HBAD-DTK environment
+
+    ```
+        . /home/HBAR_ENV/bin/activate
+    ```
+
+3. update HBAR-DTK and falcon_asm
+
+    ```
+        cd /mnt/dmel_asm/packages/pbtools.hbar-dtk-0.1.5
+        python setup.py install
+        cd /mnt/dmel_asm/packages/falcon_kit-0.1.1
+        #edit falcon_asm.py to set identity threshold for overlapping at 98%, it is done in the EBS snapshot
+        python setup.py install
+    ```
+
+If you want to do an assembly in `/mnt/dmel_asm/new_asm/`, just clone the 
+configuration in `/mnt/dmel_asm/asm_template/` to `/mnt/dmel_asm/new_asm/`:
+
+```
+    cd /mnt/dmel_asm/
+    cp -a asm_template/ new_asm/
+    cd new_asm/
+```
+
+An example of the assembly result can be found in `/mnt/dmel_asm/asm_example`.
+
+You can start the pre-assembly stage by running the `HBAR_WF3.py` script as following:
+
+```
+    python HBAR_WF3.py HBAR_step1.cfg
+```
+
+It will take while to preparing the fasta files for pre-assembly. Once that is
+one, SGE jobs for matching reads will be submitted. Once the SGE jobs are
+submmited, you can use add more node to run the jobs concurrently to speed up
+the process by issuing this command on your local host to add the nodes:
+
+    starcluster addnode -n 15 falcon # add 15 nodes 
+
+When all nodes are up. You can try to run the load balancer so once the jobs are
+done, the node can be terminated automatically to save some money.
+
+    starcluster loadbalance -k 9 -K -m 16 -n 1 falcon
+
+I found I have to comment out one line of code in `starcluster/plugins/sge.py`
+to make it work properly to remove unused nodes:
+    
+    class SGEPlugin(clustersetup.DefaultClusterSetup):
+        def _remove_from_sge(self, node):
+            #comment out the following line in the code
+            #master.ssh.execute('qconf -de %s' % node.alias)
+
+If you use 16 nodes, it will takes about 4 hours to finish all jobs.  If all
+pre-assembly jobs finish the cluster will be terminated automatically, but the
+results will be kept in the EBS volume.
+
+The generated p-reads will be in `/mnt/dmel_asm/new_asm/2-preads-falcon/pread_*.fa`.
+
+Assembling the p-reads
+------------------------
+
+We use a different instance type which has bigger memory to assemble the genome. We
+only needs one node for the assembly part.  We still use SGE as the code was written 
+to run end-to-end assembly in a general SGE cluster. First, start single node cluster by
+running the commands in the local host:
+
+```
+    starcluster start -c falcon-bigmem falcon
+```
+
+Repeat the setup process:
+
+```
+    cd /mnt/dmel_asm/sge_setup
+    bash sge_setup.sh
+
+    . /home/HBAR_ENV/bin/activate
+
+    cd /mnt/dmel_asm/packages/pbtools.hbar-dtk-0.1.5
+    python setup.py install
+    cd /mnt/dmel_asm/packages/falcon_kit-0.1.1
+    #edit falcon_asm.py to set identity threshold for overlapping at 98%, it is done in the EBS snapshot
+    python setup.py install
+```
+
+You can start the assembly stage by running the `HBAR_WF3.py` script as following:
+
+```
+    cd /mnt/dmel_asm/new_asm/
+    python HBAR_WF3.py HBAR_step2.cfg
+```
+
+It takes about two hours for the assembly process to finish. The results will 
+be in `/mnt/dmel_asm/new_asm/3-asm-falcon`. 
+
+Here is a list of the output files:
+
+```
+    full_string_graph.adj  # the adjacent nodes of the edges in the full string graph
+    string_graph.gexf      # the gexf file of the string graph for graph visualization
+    string_graph.adj       # the adjecent nodes of the edges in the string graph after transitive reduction
+    edges_list             # full edge list 
+    paths                  # path for the unitigs
+    unit_edges.dat         # path and sequence of the untigs
+    uni_graph.gexf         # unitig graph in gexf format 
+    unitgs.fa              # fasta files of the unitigs
+    all_tigs_paths         # paths for all final contigs (= primary contigs + associated contigs)
+    all_tigs.fa            # fasta file for all contigs
+    primary_tigs_paths     # paths for all primary contigs 
+    primary_tigs.fa        # fasta file fot the primary contigs
+    primary_tigs_paths_c   # paths for all primary contigs, detectable mis-assemblies are broken 
+    primary_tigs_c.fa      # fasta file fot the primary contigs, detectable mis-assemblies are broken
+    asm_graph.gexf         # the assembly graph where the edges are the contigs
+```
+
+There might be redundant contig. The following script can be used to remove
+redundant contigs:
+
+```
+    export PATH=$PATH:/home/HBAR_ENV/MUMmer3.23
+    nucmer -mum all_tigs.fa all_tigs.fa -p all_tigs_self >& /dev/null
+    show-coords -o -H -T all_tigs_self.delta | grep CONTAINS | awk '$7>96' | awk '{print $9}' | sort -u > all_tigs_duplicated_ids
+    remove_dup_ctg.py
+    cat p-tigs_nodup.fa a-tigs_nodup.fa > pa-tigs_nodup.fa
+```
+
+The non-reduant set of contigs in `pa-tigs_nodup.fa` will be suitable for further correction
+by the Quvier algorithm. 
+
+-
+Jason Chin, March 9, 2014
+
diff --git a/examples/HBAR.cfg b/examples/HBAR.cfg
new file mode 100755
index 0000000..2257294
--- /dev/null
+++ b/examples/HBAR.cfg
@@ -0,0 +1,72 @@
+[General]
+# list of files of the initial bas.h5 files
+input_fofn = input.fofn
+
+# The length cutoff used for seed reads used for initial mapping
+length_cutoff = 10000 
+
+# The length cutoff used for seed reads usef for pre-assembly
+length_cutoff_pr = 10000
+
+# The read quality cutoff used for seed reads
+RQ_threshold = 0.75
+
+# SGE job option for distributed mapping 
+sge_option_dm = -pe smp 32 -q all.q
+
+# SGE job option for m4 filtering
+sge_option_qf = -pe smp 4 -q all.q
+
+# SGE job option for pre-assembly
+sge_option_pa = -pe smp 32 -q all.q
+
+# SGE job option for CA 
+sge_option_ca = -pe smp 8 -q all.q
+
+# SGE job option for Quiver
+sge_option_qv = -pe smp 32 -q all.q
+
+# blasr for initial read-read mapping for each chunck (do not specific the "-out" option). 
+# One might need to tune the bestn parameter to match the number of distributed chunks to get more optimized results 
+blasr_opt = -nCandidates 64 -minMatch 12 -maxLCPLength 15 -bestn 48 -minPctIdentity 75.0 -maxScore -1000 -nproc 32 
+
+#This is used for running quiver
+SEYMOUR_HOME = /mnt/secondary/Smrtpipe/builds/Assembly_Mainline_Nightly_Archive/build470-116466/
+
+#The number of best alignment hits used for pre-assembly
+#It should be about the same as the final PLR coverage, slight higher might be OK.
+bestn = 64
+
+# target choices are "pre_assembly", "draft_assembly", "all"
+# "mapping": initial mapping
+# "pre_assembly" : generate pre_assembly for any long read assembler to use
+# "draft_assembly": automatic submit CA assembly job when pre-assembly is done
+# "all" : submit job for using Quiver to do final polish, not working yet
+target = pre_assembly
+
+
+# number of chunks for pre-assembly. 
+preassembly_num_chunk = 1 
+
+
+q_chunk_size = 1
+t_chunk_size = 3
+
+# "tmpdir" is for preassembly. A lot of small files are created and deleted during this process. 
+# It would be great to use ramdisk for this. Set tmpdir to a NFS mount will probably have very bad performance.
+tmpdir = /tmp
+
+# "big_tmpdir" is for quiver, better in a big disk
+big_tmpdir = /tmp
+
+# various trimming parameters
+min_cov = 8
+max_cov = 64
+trim_align = 50
+trim_plr = 50
+
+# number of processes used by by blasr during the preassembly process
+q_nproc = 16 
+
+
+concurrent_jobs = 1
diff --git a/examples/StarCluster.cfg b/examples/StarCluster.cfg
new file mode 100644
index 0000000..db5f1f9
--- /dev/null
+++ b/examples/StarCluster.cfg
@@ -0,0 +1,24 @@
+[aws info]
+aws_access_key_id = your_key
+aws_secret_access_key = your_access_key
+aws_user_id = your_user_id
+
+[key starcluster]
+key_location = ~/.ec2/starcluster.rsa
+
+[cluster falcon]
+#The AMI image is based on ami-765b3e1f us-east-1 starcluster-base-ubuntu-12.04-x86_64 
+keyname = starcluster
+cluster_size = 1
+cluster_user = sgeadmin
+cluster_shell = bash
+master_image_id = ami-ef3c0e86
+master_instance_type = c3.8xlarge
+node_image_id = ami-ef3c0e86
+node_instance_type = c3.8xlarge
+availability_zone = us-east-1c
+
+[global]
+default_template = falcon
+ENABLE_EXPERIMENTAL=True
+
diff --git a/examples/install_note.sh b/examples/install_note.sh
new file mode 100644
index 0000000..785358d
--- /dev/null
+++ b/examples/install_note.sh
@@ -0,0 +1,84 @@
+# This is the script that will build everything needed to generate an assembly 
+# on top of the StarCluster Ubuntu AMI 
+HBAR_ROOT=/home
+mkdir -p $HBAR_ROOT/HBAR_ENV
+export HBAR_HOME=$HBAR_ROOT/HBAR_ENV/
+sudo apt-get install python-virtualenv
+virtualenv -p /usr/bin/python2.7 $HBAR_HOME
+cd $HBAR_HOME
+. bin/activate
+pip install numpy==1.6.2
+sudo apt-get install python-dev
+pip install numpy==1.6.2
+wget http://www.hdfgroup.org/ftp/HDF5/prev-releases/hdf5-1.8.9/src/hdf5-1.8.9.tar.gz
+tar zxvf hdf5-1.8.9.tar.gz
+cd hdf5-1.8.9
+./configure --prefix=$HBAR_HOME --enable-cxx
+make install
+cd ..
+wget http://h5py.googlecode.com/files/h5py-2.0.1.tar.gz
+tar zxvf h5py-2.0.1.tar.gz
+cd h5py-2.0.1
+python setup.py build --hdf5=$HBAR_HOME
+python setup.py install
+cd ..
+pip install git+https://github.com/PacificBiosciences/pbcore.git#pbcore
+sudo apt-get install git
+pip install git+https://github.com/PacificBiosciences/pbcore.git#pbcore
+pip install git+https://github.com/PacificBiosciences/pbdagcon.git#pbdagcon
+pip install git+https://github.com/PacificBiosciences/pbh5tools.git#pbh5tools
+pip install git+https://github.com/cschin/pypeFLOW.git#pypeflow
+pip install rdflib==3.4.0
+pip install git+https://github.com/PacificBiosciences/HBAR-DTK.git#hbar-dtk
+pip install git+https://github.com/PacificBiosciences/FALCON.git#falcon
+
+git clone https://github.com/PacificBiosciences/blasr.git
+cd blasr
+export HDF5INCLUDEDIR=/home/HBAR_ENV/include/
+export HDF5LIBDIR=/home/HBAR_ENV/lib/
+make
+cp alignment/bin/blasr ../bin/
+cp alignment/bin/sawriter ../bin/
+cp pbihdfutils/bin/samFilter  ../bin
+cp pbihdfutils/bin/samtoh5  ../bin
+cd ..
+
+
+wget http://downloads.sourceforge.net/project/boost/boost/1.47.0/boost_1_47_0.tar.gz
+tar zxvf boost_1_47_0.tar.gz
+cd boost_1_47_0/
+bash bootstrap.sh
+./b2 install -j 24 --prefix=$HBAR_ROOT/HBAR_ENV/boost
+cd ..
+
+sudo apt-get install libpcre3 libpcre3-dev
+wget http://downloads.sourceforge.net/project/swig/swig/swig-2.0.11/swig-2.0.11.tar.gz
+tar zxvf swig-2.0.11.tar.gz
+cd swig-2.0.11
+./configure --prefix=$HBAR_ROOT/HBAR_ENV
+make
+make install
+cd ..
+
+git clone https://github.com/PacificBiosciences/ConsensusCore.git
+cd ConsensusCore/
+python setup.py install --swig=$HBAR_ROOT/HBAR_ENV/bin/swig --boost=$HBAR_ROOT/HBAR_ENV/boost/include/
+cd ..
+
+pip install git+https://github.com/PacificBiosciences/GenomicConsensus.git#GenomicConsensus
+pip install git+https://github.com/PacificBiosciences/pbalign#pbaligno
+
+wget http://downloads.sourceforge.net/project/mummer/mummer/3.23/MUMmer3.23.tar.gz
+tar zxvf MUMmer3.23.tar.gz
+cd MUMmer3.23/
+make install
+cd ..
+export PATH=$PATH:/home/HBAR_ENV/MUMmer3.23
+
+
+wget http://downloads.sourceforge.net/project/samtools/samtools/0.1.19/samtools-0.1.19.tar.bz2
+tar jxvf samtools-0.1.19.tar.bz2
+cd samtools-0.1.19
+make
+cp samtools ../bin
+cd ..
diff --git a/examples/readme.md b/examples/readme.md
new file mode 100644
index 0000000..0c83259
--- /dev/null
+++ b/examples/readme.md
@@ -0,0 +1,92 @@
+Running an Amazon EC2 instance that has HBAR-DTK + Falcon pre-installed
+=======================================================================
+
+1. Install the latest verison of StarCluster
+```
+    git clone https://github.com/jtriley/StarCluster.git
+    cd StarCluster
+    python setup.py install #better in virtualenv
+```
+The stable version of StarCluster does not support the `c3` instance.  For
+assembly, using one node of `c3.8xlarge` instance is more convenient. In my
+test, I can finish single E. coli genome within almost one hour. Namely, one can
+assembly a bacteria genome in less then 5 bucks.
+
+2. Use the `StarCluster.cfg` as the configuration file for `StarCluster` to
+setup a `falcon` cluster
+
+3. Start the cluster 
+```
+    starcluster start falcon
+```
+
+4. login to the cluster
+```
+    starcluster sshmaster falcon
+```
+
+5. set up the SGE
+```
+    cd /home/sge_setup
+    bash sge_setup.sh
+```
+
+6. There is alreay an existing assembly results in `/home/Ecoli_ASM/`. Here I
+show how to reproduce it. First, create a new assembly working directory in
+`/mnt`, set it up and run HBAR_WF3.py to get preassembled reads
+```
+    cd /mnt
+    mkdir test_asm
+    cd test_asm
+    cp /home/Ecoli_ASM/HBAR.cfg .
+    cp /home/Ecoli_ASM/input.fofn .
+    source /home/HBAR_ENV/bin/activate
+    HBAR_WF3.py HBAR.cfg
+```
+
+7. The next part of the assembly does not start automatically yet. The detail
+steps are in the `run_asm.sh` script and one can use to get contigs and
+consensus. 
+```
+    cp /home/Ecoli_ASM/run_asm.sh .
+    bash run_asm.sh
+```
+The consensus result is in `/mnt/consensus.fasta`. Since we did not do any
+consensus after the unitig step. One more run of quiver consensus may further
+improve the final assembly accuracy.
+
+8. A yeast (S. cerevisiae W303) data set is also included in the AMI. One can try
+to assemble it with a larger cluster setting.
+
+
+9. Here is the result of a timing test:
+```
+    (HBAR_ENV)root at master:/mnt/test_asm# time HBAR_WF3.py HBAR.cfg
+    
+    Your job 1 ("mapping_task_q00002_t000011416727c") has been submitted
+    Your job 2 ("qf_task_q00002a3e75f4c") has been submitted
+    Your job 3 ("mapping_task_q00003_t00001b667b504") has been submitted
+    Your job 4 ("qf_task_q000036974ef22") has been submitted
+    Your job 5 ("mapping_task_q00001_t000017bf52d9c") has been submitted
+    Your job 6 ("qf_task_q000010b31d960") has been submitted
+    Your job 7 ("pa_task_000001ee38aee") has been submitted
+    
+    
+    
+    real    26m51.030s
+    user    1m10.152s
+    sys     0m11.993s
+    
+    (HBAR_ENV)root at master:/mnt/test_asm# time bash run_asm.sh
+    [WARNING] This .cmp.h5 file lacks some of the QV data tracks that are required for optimal performance of the Quiver algorithm.  For optimal results use the ResequencingQVs workflow in SMRTPortal with bas.h5 files from an instrument using software version 1.3.1 or later.
+
+    real    13m2.945s
+    user    244m44.322s
+    sys     2m7.032s
+```
+For better results, one might run `quiver` twice. It is possible to get the whole assembly within one hour (~ 26 + 13 * 2 = 52 minutes). With the overhead on setting up, file transfer, etc., one can assembly a bacteria genome in EC2 less than 5 bucks in principle.
+
+
+--
+Jason Chin, 01/18/2014
+
diff --git a/examples/run_asm.sh b/examples/run_asm.sh
new file mode 100644
index 0000000..35f7323
--- /dev/null
+++ b/examples/run_asm.sh
@@ -0,0 +1,24 @@
+# This script does the assembly and generate the quiver consensus after one gets preassembled reads
+# Modification will be needed for larger genome and different computational cluster setup
+
+# It should be run within the assembly working directory
+
+mkdir 3-asm-falcon/
+cd 3-asm-falcon/
+cat ../2-preads-falcon/pread_*.fa > preads.fa
+falcon_overlap.py  --min_len 8000 --n_core 24 --d_core 1 preads.fa > preads.ovlp
+falcon_asm.py preads.ovlp preads.fa
+falcon_fixasm.py
+
+export PATH=$PATH:/home/HBAR_ENV/MUMmer3.23
+nucmer -maxmatch all_tigs.fa all_tigs.fa -p all_tigs_self >& /dev/null
+show-coords -o -H -T all_tigs_self.delta | grep CONTAINS | awk '$7>96' | awk '{print $9}' | sort -u > all_tigs_duplicated_ids
+remove_dup_ctg.py
+cat p-tigs_nodup.fa a-tigs_nodup.fa > pa-tigs_nodup.fa
+cat p-tigs_nodup.fa a-tigs_nodup.fa > /mnt/pa-tigs_nodup.fa
+
+find /home/data/Ecoli/ -name "*.bax.h5" > /mnt/h5_input.fofn
+cd /mnt
+pbalign.py --forQuiver --nproc 32  --tmpDir /mnt --maxHits 1  h5_input.fofn pa-tigs_nodup.fa output.cmp.h5 
+samtools faidx pa-tigs_nodup.fa
+quiver -j 24 output.cmp.h5 -r pa-tigs_nodup.fa -o variants.gff -o consensus.fasta
diff --git a/setup.py b/setup.py
new file mode 100755
index 0000000..9784b2b
--- /dev/null
+++ b/setup.py
@@ -0,0 +1,36 @@
+#!/usr/bin/env python
+
+from setuptools import setup
+
+from distutils.core import Extension
+
+setup(name='falcon_kit',
+      version='0.1.3',
+      description='a small toolkit for DNA seqeucne alignment, overlapping, and assembly',
+      author='Jason Chin',
+      author_email='jchin at pacificbiosciences.com',
+      packages=['falcon_kit'],
+      package_dir={'falcon_kit':'src/py/'},
+      ext_modules=[Extension('falcon_kit.DW_align', ['src/c/DW_banded.c'], 
+                   extra_link_args=["-fPIC",  "-O3"]),
+                   Extension('falcon_kit.kmer_lookup', ['src/c/kmer_lookup.c'],
+                   extra_link_args=["-fPIC",  "-O3"]),
+                   Extension('falcon_kit.falcon', ['src/c/DW_banded.c', 'src/c/kmer_lookup.c', 'src/c/falcon.c'],
+                   extra_link_args=["-fPIC",  "-O3"]),
+                   ],
+      scripts = ["src/py_scripts/falcon_asm.py", 
+                 "src/py_scripts/falcon_asm_dev.py",
+                 "src/py_scripts/falcon_overlap.py",
+                 "src/py_scripts/falcon_overlap2.py",
+                 "src/py_scripts/falcon_qrm.py",
+                 "src/py_scripts/falcon_fixasm.py",
+                 "src/py_scripts/falcon_dedup.py",
+                 "src/py_scripts/falcon_ucns_data.py",
+                 "src/py_scripts/falcon_utgcns.py",
+                 "src/py_scripts/falcon_sense.py",
+                 "src/py_scripts/get_rdata.py",
+                 "src/py_scripts/remove_dup_ctg.py"],
+      zip_safe = False,
+      install_requires=[ "pbcore >= 0.6.3", "networkx >= 1.7" ]
+     )
+
diff --git a/src/c/DW_banded.c b/src/c/DW_banded.c
new file mode 100755
index 0000000..44a6168
--- /dev/null
+++ b/src/c/DW_banded.c
@@ -0,0 +1,319 @@
+
+/*
+ * =====================================================================================
+ *
+ *       Filename:  DW_banded.c
+ *
+ *    Description:  A banded version for the O(ND) greedy sequence alignment algorithm 
+ *
+ *        Version:  0.1
+ *        Created:  07/20/2013 17:00:00
+ *       Revision:  none
+ *       Compiler:  gcc
+ *
+ *         Author:  Jason Chin, 
+ *        Company:  
+ *
+ * =====================================================================================
+
+ #################################################################################$$
+ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc.
+ #
+ # All rights reserved.
+ #
+ # Redistribution and use in source and binary forms, with or without
+ # modification, are permitted (subject to the limitations in the
+ # disclaimer below) provided that the following conditions are met:
+ #
+ #  * Redistributions of source code must retain the above copyright
+ #  notice, this list of conditions and the following disclaimer.
+ #
+ #  * Redistributions in binary form must reproduce the above
+ #  copyright notice, this list of conditions and the following
+ #  disclaimer in the documentation and/or other materials provided
+ #  with the distribution.
+ #
+ #  * Neither the name of Pacific Biosciences nor the names of its
+ #  contributors may be used to endorse or promote products derived
+ #  from this software without specific prior written permission.
+ #
+ # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE
+ # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC
+ # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
+ # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS
+ # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ # SUCH DAMAGE.
+ #################################################################################$$
+ 
+
+*/
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <limits.h>
+#include <stdbool.h>
+#include "common.h"
+
+int compare_d_path(const void * a, const void * b)
+{
+    const d_path_data2 * arg1 = a;
+    const d_path_data2 * arg2 = b;
+    if (arg1->d - arg2->d == 0) {
+        return  arg1->k - arg2->k;
+    } else {
+        return arg1->d - arg2->d;
+    }
+}
+
+
+void d_path_sort( d_path_data2 * base, unsigned long max_idx) {
+    qsort(base, max_idx, sizeof(d_path_data2), compare_d_path);
+}
+
+d_path_data2 * get_dpath_idx( seq_coor_t d, seq_coor_t k, unsigned long max_idx, d_path_data2 * base) {
+    d_path_data2 d_tmp;
+    d_path_data2 *rtn;
+    d_tmp.d = d;
+    d_tmp.k = k;
+    rtn = (d_path_data2 *)  bsearch( &d_tmp, base, max_idx, sizeof(d_path_data2), compare_d_path);
+    //printf("dp %ld %ld %ld %ld %ld %ld %ld\n", (rtn)->d, (rtn)->k, (rtn)->x1, (rtn)->y1, (rtn)->x2, (rtn)->y2, (rtn)->pre_k);
+    
+    return rtn;
+
+}
+
+void print_d_path(  d_path_data2 * base, unsigned long max_idx) {
+    unsigned long idx;
+    for (idx = 0; idx < max_idx; idx++){
+        printf("dp %ld %ld %ld %ld %ld %ld %ld %ld\n",idx, (base+idx)->d, (base+idx)->k, (base+idx)->x1, (base+idx)->y1, (base+idx)->x2, (base+idx)->y2, (base+idx)->pre_k);
+    }
+}
+
+
+alignment * align(char * query_seq, seq_coor_t q_len,
+                  char * target_seq, seq_coor_t t_len,
+                  seq_coor_t band_tolerance,
+                  int get_aln_str) {
+    seq_coor_t * V;
+    seq_coor_t * U;  // array of matched bases for each "k"
+    seq_coor_t k_offset;
+    seq_coor_t d;
+    seq_coor_t k, k2;
+    seq_coor_t best_m;  // the best "matches" for each d
+    seq_coor_t min_k, new_min_k;
+    seq_coor_t max_k, new_max_k;
+    seq_coor_t pre_k;
+    seq_coor_t x, y;
+    seq_coor_t cd;
+    seq_coor_t ck;
+    seq_coor_t cx, cy, nx, ny;
+    seq_coor_t max_d;
+    seq_coor_t band_size;
+    unsigned long d_path_idx = 0;
+    unsigned long max_idx = 0;
+
+    d_path_data2 * d_path;
+    d_path_data2 * d_path_aux;
+    path_point * aln_path;
+    seq_coor_t aln_path_idx;
+    alignment * align_rtn;
+    seq_coor_t aln_pos;
+    seq_coor_t i;
+    bool aligned = false;
+
+    //printf("debug: %ld %ld\n", q_len, t_len);
+    //printf("%s\n", query_seq);
+   
+    max_d = (int) (0.3*(q_len + t_len));
+
+    band_size = band_tolerance * 2;
+
+    V = calloc( max_d * 2 + 1, sizeof(seq_coor_t) );
+    U = calloc( max_d * 2 + 1, sizeof(seq_coor_t) );
+    
+    k_offset = max_d;
+    
+    // We should probably use hashmap to store the backtracing information to save memory allocation time
+    // This O(MN) block allocation scheme is convient for now but it is slower for very long sequences
+    d_path = calloc( max_d * (band_size + 1 ) * 2 + 1, sizeof(d_path_data2) );
+    
+    aln_path = calloc( q_len + t_len + 1, sizeof(path_point) );
+
+    align_rtn = calloc( 1, sizeof(alignment));
+    align_rtn->t_aln_str = calloc( q_len + t_len + 1, sizeof(char));
+    align_rtn->q_aln_str = calloc( q_len + t_len + 1, sizeof(char));
+    align_rtn->aln_str_size = 0;
+    align_rtn->aln_q_s = 0;
+    align_rtn->aln_q_e = 0;
+    align_rtn->aln_t_s = 0;
+    align_rtn->aln_t_e = 0;
+
+    //printf("max_d: %lu, band_size: %lu\n", max_d, band_size);
+    best_m = -1;
+    min_k = 0;
+    max_k = 0;
+    d_path_idx = 0; 
+    max_idx = 0;
+    for (d = 0; d < max_d; d ++ ) {
+        if (max_k - min_k > band_size) {
+            break;
+        }
+ 
+        for (k = min_k; k <= max_k;  k += 2) {
+
+            if ( k == min_k || k != max_k && V[ k - 1 + k_offset ] < V[ k + 1 + k_offset] ) {
+                pre_k = k + 1;
+                x = V[ k + 1 + k_offset];
+            } else {
+                pre_k = k - 1;
+                x = V[ k - 1 + k_offset] + 1;
+            }
+            y = x - k;
+            d_path[d_path_idx].d = d;
+            d_path[d_path_idx].k = k;
+            d_path[d_path_idx].x1 = x;
+            d_path[d_path_idx].y1 = y;
+
+            while ( x < q_len && y < t_len && query_seq[x] == target_seq[y] ){
+                x++;
+                y++;
+            }
+
+            d_path[d_path_idx].x2 = x;
+            d_path[d_path_idx].y2 = y;
+            d_path[d_path_idx].pre_k = pre_k;
+            d_path_idx ++;
+
+            V[ k + k_offset ] = x;
+            U[ k + k_offset ] = x + y;
+            
+            if ( x + y > best_m) {
+                best_m = x + y;
+            }
+
+            if ( x >= q_len || y >= t_len) {
+                aligned = true;
+                max_idx = d_path_idx;
+                break;
+            }
+        }
+        
+        // For banding
+        new_min_k = max_k;
+        new_max_k = min_k;
+
+        for (k2 = min_k; k2 <= max_k;  k2 += 2) {
+            if (U[ k2 + k_offset] >= best_m - band_tolerance ) {
+                if ( k2 < new_min_k ) {
+                    new_min_k = k2;
+                }
+                if ( k2 > new_max_k ) {
+                    new_max_k = k2;
+                }
+            }
+        }
+        
+        max_k = new_max_k + 1;
+        min_k = new_min_k - 1;
+        
+        // For no banding
+        // max_k ++;
+        // min_k --;
+
+        // For debuging 
+        // printf("min_max_k,d, %ld %ld %ld\n", min_k, max_k, d);
+        
+        if (aligned == true) {
+            align_rtn->aln_q_e = x;
+            align_rtn->aln_t_e = y;
+            align_rtn->dist = d;
+            align_rtn->aln_str_size = (x + y + d) / 2;
+            align_rtn->aln_q_s = 0;
+            align_rtn->aln_t_s = 0;
+
+            d_path_sort(d_path, max_idx);
+            //print_d_path(d_path, max_idx);
+
+            if (get_aln_str > 0) {
+                cd = d;
+                ck = k;
+                aln_path_idx = 0;
+                while (cd >= 0 && aln_path_idx < q_len + t_len + 1) {    
+                    d_path_aux = (d_path_data2 *) get_dpath_idx( cd, ck, max_idx, d_path);
+                    aln_path[aln_path_idx].x = d_path_aux -> x2;
+                    aln_path[aln_path_idx].y = d_path_aux -> y2;
+                    aln_path_idx ++;
+                    aln_path[aln_path_idx].x = d_path_aux -> x1;
+                    aln_path[aln_path_idx].y = d_path_aux -> y1;
+                    aln_path_idx ++;
+                    ck = d_path_aux -> pre_k;
+                    cd -= 1;
+                }
+                aln_path_idx --;
+                cx = aln_path[aln_path_idx].x;
+                cy = aln_path[aln_path_idx].y;
+                align_rtn->aln_q_s = cx;
+                align_rtn->aln_t_s = cy;
+                aln_pos = 0;
+                while ( aln_path_idx > 0 ) {
+                    aln_path_idx --;
+                    nx = aln_path[aln_path_idx].x;
+                    ny = aln_path[aln_path_idx].y;
+                    if (cx == nx && cy == ny){
+                        continue;
+                    }
+                    if (nx == cx && ny != cy){ //advance in y
+                        for (i = 0; i <  ny - cy; i++) {
+                            align_rtn->q_aln_str[aln_pos + i] = '-';
+                        }
+                        for (i = 0; i <  ny - cy; i++) {
+                            align_rtn->t_aln_str[aln_pos + i] = target_seq[cy + i];
+                        }
+                        aln_pos += ny - cy;
+                    } else if (nx != cx && ny == cy){ //advance in x
+                        for (i = 0; i <  nx - cx; i++) {
+                            align_rtn->q_aln_str[aln_pos + i] = query_seq[cx + i];
+                        }
+                        for (i = 0; i <  nx - cx; i++) {
+                            align_rtn->t_aln_str[aln_pos + i] = '-';
+                        }
+                        aln_pos += nx - cx;
+                    } else {
+                        for (i = 0; i <  nx - cx; i++) {
+                            align_rtn->q_aln_str[aln_pos + i] = query_seq[cx + i];
+                        }
+                        for (i = 0; i <  ny - cy; i++) {
+                            align_rtn->t_aln_str[aln_pos + i] = target_seq[cy + i];
+                        }
+                        aln_pos += ny - cy;
+                    }
+                    cx = nx;
+                    cy = ny;
+                }
+                align_rtn->aln_str_size = aln_pos;
+            }
+            break;
+        }
+    }
+
+    free(V);
+    free(U);
+    free(d_path);
+    free(aln_path);
+    return align_rtn;
+}
+
+
+void free_alignment(alignment * aln) {
+    free(aln->q_aln_str);
+    free(aln->t_aln_str);
+    free(aln);
+}
diff --git a/src/c/Makefile b/src/c/Makefile
new file mode 100755
index 0000000..607dcde
--- /dev/null
+++ b/src/c/Makefile
@@ -0,0 +1,20 @@
+DW_align.so: DW_banded.c common.h
+	gcc DW_banded.c -O3 -shared -fPIC -o DW_align.so
+
+kmer_lookup.so: kmer_lookup.c common.h
+	gcc kmer_lookup.c -O3 -shared -fPIC -o kmer_lookup.so
+
+#falcon: DW_banded.c common.h kmer_lookup.c falcon.c 
+#	gcc DW_banded.c kmer_lookup.c falcon.c -O4 -o falcon -fPIC 
+
+falcon.so: falcon.c common.h DW_banded.c kmer_lookup.c
+	gcc DW_banded.c kmer_lookup.c falcon.c -O3 -shared -fPIC -o falcon.so 
+
+#falcon2.so: falcon.c common.h DW_banded_2.c kmer_lookup.c
+#	gcc DW_banded_2.c kmer_lookup.c falcon.c -O3 -shared -fPIC -o falcon2.so 
+
+clean:
+	rm falcon *.so
+
+all: DW_align.so kmer_lookup.so falcon.so
+
diff --git a/src/c/Makefile.osx b/src/c/Makefile.osx
new file mode 100755
index 0000000..99fcce7
--- /dev/null
+++ b/src/c/Makefile.osx
@@ -0,0 +1,16 @@
+DW_align.so: DW_banded.c common.h
+	gcc DW_banded.c -O3 -shared -fPIC -o DW_align.so -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.8.sdk/usr/include/ -L/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.8.sdk/usr/lib
+
+kmer_lookup.so: kmer_lookup.c common.h
+	gcc kmer_lookup.c -O3 -shared -fPIC -o kmer_lookup.so -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.8.sdk/usr/include/ -L/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.8.sdk/usr/lib
+
+falcon: DW_banded.c common.h kmer_lookup.c falcon.c 
+	gcc DW_banded.c kmer_lookup.c falcon.c -O4 -o falcon -fPIC -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.8.sdk/usr/include/ -L/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.8.sdk/usr/lib
+
+falcon.so: falcon.c common.h DW_banded.c kmer_lookup.c
+	gcc DW_banded.c kmer_lookup.c falcon.c -O3 -shared -fPIC -o falcon.so -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.8.sdk/usr/include/ -L/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.8.sdk/usr/lib
+
+
+
+all: DW_align.so kmer_lookup.so falcon.so falcon
+
diff --git a/src/c/common.h b/src/c/common.h
new file mode 100755
index 0000000..e694c3b
--- /dev/null
+++ b/src/c/common.h
@@ -0,0 +1,177 @@
+
+/*
+ * =====================================================================================
+ *
+ *       Filename:  common.h
+ *
+ *    Description:  Common delclaration for the code base 
+ *
+ *        Version:  0.1
+ *        Created:  07/16/2013 07:46:23 AM
+ *       Revision:  none
+ *       Compiler:  gcc
+ *
+ *         Author:  Jason Chin, 
+ *        Company:  
+ *
+ * =====================================================================================
+
+ #################################################################################$$
+ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc.
+ #
+ # All rights reserved.
+ #
+ # Redistribution and use in source and binary forms, with or without
+ # modification, are permitted (subject to the limitations in the
+ # disclaimer below) provided that the following conditions are met:
+ #
+ #  * Redistributions of source code must retain the above copyright
+ #  notice, this list of conditions and the following disclaimer.
+ #
+ #  * Redistributions in binary form must reproduce the above
+ #  copyright notice, this list of conditions and the following
+ #  disclaimer in the documentation and/or other materials provided
+ #  with the distribution.
+ #
+ #  * Neither the name of Pacific Biosciences nor the names of its
+ #  contributors may be used to endorse or promote products derived
+ #  from this software without specific prior written permission.
+ #
+ # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE
+ # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC
+ # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
+ # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS
+ # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ # SUCH DAMAGE.
+ #################################################################################$$
+ */
+
+typedef long int seq_coor_t; 
+
+typedef struct {    
+    seq_coor_t aln_str_size ;
+    seq_coor_t dist ;
+    seq_coor_t aln_q_s;
+    seq_coor_t aln_q_e;
+    seq_coor_t aln_t_s;
+    seq_coor_t aln_t_e;
+    char * q_aln_str;
+    char * t_aln_str;
+
+} alignment;
+
+
+typedef struct {
+    seq_coor_t pre_k;
+    seq_coor_t x1;
+    seq_coor_t y1;
+    seq_coor_t x2;
+    seq_coor_t y2;
+} d_path_data;
+
+typedef struct {
+    seq_coor_t d;
+    seq_coor_t k;
+    seq_coor_t pre_k;
+    seq_coor_t x1;
+    seq_coor_t y1;
+    seq_coor_t x2;
+    seq_coor_t y2;
+} d_path_data2;
+
+typedef struct {
+    seq_coor_t x;
+    seq_coor_t y;
+} path_point;
+
+typedef struct {    
+    seq_coor_t start;
+    seq_coor_t last;
+    seq_coor_t count;
+} kmer_lookup;
+
+typedef unsigned char base;
+typedef base * seq_array;
+typedef seq_coor_t seq_addr;
+typedef seq_addr * seq_addr_array;
+
+
+typedef struct {
+    seq_coor_t count;
+    seq_coor_t * query_pos;
+    seq_coor_t * target_pos;
+} kmer_match;
+
+
+typedef struct {
+    seq_coor_t s1;
+    seq_coor_t e1;
+    seq_coor_t s2;
+    seq_coor_t e2;
+    long int score;
+} aln_range;
+
+
+typedef struct {
+    char * sequence;
+    unsigned int * eff_cov;
+} consensus_data;
+
+kmer_lookup * allocate_kmer_lookup (seq_coor_t);
+void init_kmer_lookup ( kmer_lookup *,  seq_coor_t );
+void free_kmer_lookup(kmer_lookup *);
+
+seq_array allocate_seq(seq_coor_t);
+void init_seq_array( seq_array, seq_coor_t);
+void free_seq_array(seq_array);
+
+seq_addr_array allocate_seq_addr(seq_coor_t size); 
+
+void free_seq_addr_array(seq_addr_array);
+
+
+aln_range *  find_best_aln_range(kmer_match *, 
+                              seq_coor_t, 
+                              seq_coor_t, 
+                              seq_coor_t); 
+
+void free_aln_range( aln_range *);
+
+kmer_match * find_kmer_pos_for_seq( char *, 
+                                    seq_coor_t, 
+                                    unsigned int K, 
+                                    seq_addr_array, 
+                                    kmer_lookup * );
+
+void free_kmer_lookup(kmer_lookup * );
+
+
+
+void add_sequence ( seq_coor_t, 
+                    unsigned int, 
+                    char *, 
+                    seq_coor_t,
+                    seq_addr_array, 
+                    seq_array, 
+                    kmer_lookup *); 
+
+void mask_k_mer(seq_coor_t, kmer_lookup *, seq_coor_t);
+
+alignment * align(char *, seq_coor_t,
+                  char *, seq_coor_t,
+                  seq_coor_t,
+                  int); 
+
+void free_alignment(alignment *);
+
+
+void free_consensus_data(consensus_data *);
+
diff --git a/src/c/falcon.c b/src/c/falcon.c
new file mode 100755
index 0000000..ba7eb9c
--- /dev/null
+++ b/src/c/falcon.c
@@ -0,0 +1,613 @@
+/*
+ * =====================================================================================
+ *
+ *       Filename:  fastcon.c
+ *
+ *    Description:  
+ *
+ *        Version:  0.1
+ *        Created:  07/20/2013 17:00:00
+ *       Revision:  none
+ *       Compiler:  gcc
+ *
+ *         Author:  Jason Chin, 
+ *        Company:  
+ *
+ * =====================================================================================
+
+ #################################################################################$$
+ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc.
+ #
+ # All rights reserved.
+ #
+ # Redistribution and use in source and binary forms, with or without
+ # modification, are permitted (subject to the limitations in the
+ # disclaimer below) provided that the following conditions are met:
+ #
+ #  * Redistributions of source code must retain the above copyright
+ #  notice, this list of conditions and the following disclaimer.
+ #
+ #  * Redistributions in binary form must reproduce the above
+ #  copyright notice, this list of conditions and the following
+ #  disclaimer in the documentation and/or other materials provided
+ #  with the distribution.
+ #
+ #  * Neither the name of Pacific Biosciences nor the names of its
+ #  contributors may be used to endorse or promote products derived
+ #  from this software without specific prior written permission.
+ #
+ # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE
+ # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC
+ # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
+ # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS
+ # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ # SUCH DAMAGE.
+ #################################################################################$$
+ */
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <limits.h>
+#include <string.h>
+#include <assert.h>
+#include "common.h"
+
+typedef struct {
+    seq_coor_t t_pos;
+    unsigned int delta;
+    char q_base;
+    unsigned int q_id;
+} align_tag_t;
+
+typedef struct {
+    seq_coor_t len;
+    align_tag_t * align_tags;
+} align_tags_t;
+
+
+typedef struct {
+    seq_coor_t len;
+    char * name;
+    char * seq;
+
+} consensusn_seq_t;
+
+
+align_tags_t * get_align_tags( char * aln_q_seq, 
+                               char * aln_t_seq, 
+                               seq_coor_t aln_seq_len,
+                               aln_range * range,
+                               unsigned long q_id,
+                               unsigned long local_match_count_window,
+                               unsigned long local_match_count_threshold,
+                               seq_coor_t t_offset) {
+
+#define LONGEST_INDEL_ALLOWED 6 
+
+    char q_base;
+    char t_base;
+    align_tags_t * tags;
+    seq_coor_t i, j, jj, k;
+    seq_coor_t match_count;
+
+    tags = calloc( 1, sizeof(align_tags_t) );
+    tags->len = aln_seq_len; 
+    tags->align_tags = calloc( aln_seq_len + 1, sizeof(align_tag_t) );
+    i = range->s1 - 1;
+    j = range->s2 - 1;
+    match_count = 0;
+    jj = 0;
+    for (k = 0; k< local_match_count_window && k < aln_seq_len; k++) {
+        if (aln_q_seq[k]  == aln_t_seq[k] ) {
+            match_count ++;
+        }
+    }
+    for (k = 0; k < aln_seq_len; k++) {
+        if (aln_q_seq[k] != '-') {
+            i ++;
+            jj ++;
+        } 
+        if (aln_t_seq[k] != '-') {
+            j ++;
+            jj = 0;
+        }
+       
+        if (local_match_count_threshold > 0) {
+            if (k < aln_seq_len - local_match_count_window && aln_q_seq[k + local_match_count_window]  == aln_t_seq[k + local_match_count_window] ) {
+                match_count ++;
+            }
+
+            if (k > local_match_count_window && aln_q_seq[k - local_match_count_window] == aln_t_seq[k - local_match_count_window] ) {
+                match_count --;
+            }
+
+            if (match_count < 0) {
+                match_count = 0;
+            }
+        }
+       
+        if ( j + t_offset >= 0) {
+            (tags->align_tags[k]).t_pos = j + t_offset;
+            (tags->align_tags[k]).delta = jj;
+            if (local_match_count_threshold > 0 && jj == 0 && match_count < local_match_count_threshold) {
+                (tags->align_tags[k]).q_base = '*';
+            } else {
+                (tags->align_tags[k]).q_base = aln_q_seq[k];
+            }
+            (tags->align_tags[k]).q_id = q_id;
+        }
+        //if (jj > LONGEST_INDEL_ALLOWED) {
+        //   break;
+        //}
+    }
+    // sentinal at the end
+    //k = aln_seq_len;
+    tags->len = k; 
+    (tags->align_tags[k]).t_pos = -1;
+    (tags->align_tags[k]).delta = -1;
+    (tags->align_tags[k]).q_base = ' ';
+    (tags->align_tags[k]).q_id = UINT_MAX;
+    return tags;
+}
+
+void free_align_tags( align_tags_t * tags) {
+    free( tags->align_tags );
+    free( tags );
+}
+
+
+int compare_tags(const void * a, const void * b)
+{
+    const align_tag_t * arg1 = a;
+    const align_tag_t * arg2 = b;
+    if (arg1->delta - arg2->delta == 0) {
+        return  arg1->q_base - arg2->q_base;
+    } else {
+        return arg1->delta - arg2->delta;
+    }
+}
+
+consensus_data * get_cns_from_align_tags( align_tags_t ** tag_seqs, unsigned long n_tag_seqs, unsigned t_len, unsigned min_cov ) {
+
+    seq_coor_t i, j, t_pos, tmp_pos;
+    unsigned int * coverage;
+    unsigned int * local_nbase;
+    unsigned int * aux_index;
+
+    unsigned int cur_delta;
+    unsigned int counter[5] = {0, 0, 0, 0, 0};
+    unsigned int k;
+    unsigned int max_count;
+    unsigned int max_count_index;
+    seq_coor_t consensus_index;
+    seq_coor_t c_start, c_end, max_start;
+    unsigned int cov_score, max_cov_score;
+    consensus_data * consensus;
+    //char * consensus;
+
+
+
+    align_tag_t ** tag_seq_index;
+
+    coverage = calloc( t_len, sizeof(unsigned int) );
+    local_nbase = calloc( t_len, sizeof(unsigned int) );
+    aux_index = calloc( t_len, sizeof(unsigned int) );
+    tag_seq_index = calloc( t_len, sizeof(align_tag_t *) );
+
+    for (i = 0; i < n_tag_seqs; i++) {
+        for (j = 0; j < tag_seqs[i]->len; j++) {
+            if (tag_seqs[i]->align_tags[j].delta == 0 && tag_seqs[i]->align_tags[j].q_base != '*') {
+                t_pos = tag_seqs[i]->align_tags[j].t_pos;
+                coverage[ t_pos ] ++;
+            }
+            local_nbase[ tag_seqs[i]->align_tags[j].t_pos ] ++;
+        }
+    }
+
+
+    for (i = 0; i < t_len; i++) {
+        tag_seq_index[i] = calloc( local_nbase[i] + 1, sizeof(align_tag_t) );
+    }
+
+    for (i = 0; i < n_tag_seqs; i++) {
+        for (j = 0; j < tag_seqs[i]->len; j++) {
+            t_pos = tag_seqs[i]->align_tags[j].t_pos;
+            tag_seq_index[ t_pos ][ aux_index[ t_pos ] ] = tag_seqs[i]->align_tags[j];
+            aux_index[ t_pos ] ++;
+        }
+    }
+
+
+    consensus_index = 0;
+
+    
+    consensus = calloc( 1, sizeof(consensus_data) );
+    consensus->sequence = calloc( t_len * 2 + 1, sizeof(char) );
+    consensus->eff_cov = calloc( t_len * 2 + 1, sizeof(unsigned int) );
+
+    for (i = 0; i < t_len; i++) {
+        qsort(tag_seq_index[i], local_nbase[i], sizeof(align_tag_t), compare_tags);
+        cur_delta = 0;
+        for (j = 0; j <= local_nbase[i]; j++) {
+            max_count = 0;
+            max_count_index = 0;
+            if (j == local_nbase[i] || tag_seq_index[i][j].delta != cur_delta) {
+                for (k = 0; k < 5; k ++) {
+                    if (counter[k] > max_count) {
+                        max_count = counter[k];
+                        max_count_index = k;
+                    }
+                    //reset counter
+                    counter[k] = 0;
+                    cur_delta = tag_seq_index[i][j].delta;
+                }
+                if (max_count > coverage[i] * 0.5) { 
+                    switch (max_count_index) {
+                        case 0:
+                            if (coverage[i] < min_cov + 1) {
+                                consensus->sequence[consensus_index] = 'a';
+                            } else {
+                                consensus->sequence[consensus_index] = 'A';
+                            }
+                            consensus->eff_cov[consensus_index] = coverage[i] ;
+                            consensus_index ++;
+                            break;
+                        case 1:
+                            if (coverage[i] < min_cov + 1) {
+                                consensus->sequence[consensus_index] = 'c';
+                            } else {
+                                consensus->sequence[consensus_index] = 'C';
+                            }
+                            consensus->eff_cov[consensus_index] = coverage[i] ;
+                            consensus_index ++;
+                            break;
+                        case 2:
+                            if (coverage[i] < min_cov + 1) {
+                                consensus->sequence[consensus_index] = 'g';
+                            } else {
+                                consensus->sequence[consensus_index] = 'G';
+                            }
+                            consensus->eff_cov[consensus_index] = coverage[i] ;
+                            consensus_index ++;
+                            break;
+                        case 3:
+                            if (coverage[i] < min_cov + 1) {
+                                consensus->sequence[consensus_index] = 't';
+                            } else {
+                                consensus->sequence[consensus_index] = 'T';
+                            }
+                            consensus->eff_cov[consensus_index] = coverage[i] ;
+                            consensus_index ++;
+                            break;
+                        default:
+                            break;
+                    }
+                    //printf("c:%c\n", consensus[consensus_index-1]);
+                }
+
+            } 
+
+            if (j == local_nbase[i]) break;
+
+            switch (tag_seq_index[i][j].q_base) {
+                case 'A':
+                    counter[0] ++;
+                    break;
+                case 'C':
+                    counter[1] ++;
+                    break;
+                case 'G':
+                    counter[2] ++;
+                    break;
+                case 'T':
+                    counter[3] ++;
+                    break;
+                case '-':
+                    counter[4] ++;
+                    break;
+                default:
+                    break;
+            }
+            /*
+            printf("%ld %ld %ld %u %c %u\n", i, j, tag_seq_index[i][j].t_pos,
+                                                   tag_seq_index[i][j].delta,
+                                                   tag_seq_index[i][j].q_base,
+                                                   tag_seq_index[i][j].q_id);
+            */
+        }
+    }
+   
+    //printf("%s\n", consensus);
+
+    for (i = 0; i < t_len; i++) {
+        free(tag_seq_index[i]);
+    }
+    free(tag_seq_index);
+    free(aux_index);
+    free(coverage);
+    free(local_nbase);
+    return consensus;
+}
+
+//const unsigned int K = 8;
+
+consensus_data * generate_consensus( char ** input_seq, 
+                           unsigned int n_seq, 
+                           unsigned min_cov, 
+                           unsigned K,
+                           unsigned long local_match_count_window,
+                           unsigned long local_match_count_threshold,
+                           double min_idt) {
+
+    unsigned int i, j, k;
+    unsigned int seq_count;
+    unsigned int aligned_seq_count;
+    kmer_lookup * lk_ptr;
+    seq_array sa_ptr;
+    seq_addr_array sda_ptr;
+    kmer_match * kmer_match_ptr;
+    aln_range * arange_;
+    aln_range * arange;
+    alignment * aln;
+    align_tags_t * tags;
+    align_tags_t ** tags_list;
+    //char * consensus;
+    consensus_data * consensus;
+    double max_diff;
+    max_diff = 1.0 - min_idt;
+
+    seq_count = n_seq;
+    //for (j=0; j < seq_count; j++) {
+    //    printf("seq_len: %u %u\n", j, strlen(input_seq[j]));
+    //};
+    fflush(stdout);
+
+    tags_list = calloc( seq_count, sizeof(align_tags_t *) );
+    lk_ptr = allocate_kmer_lookup( 1 << (K * 2) );
+    sa_ptr = allocate_seq( (seq_coor_t) strlen( input_seq[0]) );
+    sda_ptr = allocate_seq_addr( (seq_coor_t) strlen( input_seq[0]) );
+    add_sequence( 0, K, input_seq[0], strlen(input_seq[0]), sda_ptr, sa_ptr, lk_ptr);
+    //mask_k_mer(1 << (K * 2), lk_ptr, 16);
+
+    aligned_seq_count = 0;
+    for (j=1; j < seq_count; j++) {
+
+        //printf("seq_len: %ld %u\n", j, strlen(input_seq[j]));
+
+        kmer_match_ptr = find_kmer_pos_for_seq(input_seq[j], strlen(input_seq[j]), K, sda_ptr, lk_ptr);
+#define INDEL_ALLOWENCE_0 6
+
+        arange = find_best_aln_range(kmer_match_ptr, K, K * INDEL_ALLOWENCE_0, 5);  // narrow band to avoid aligning through big indels
+
+        //printf("1:%ld %ld %ld %ld\n", arange_->s1, arange_->e1, arange_->s2, arange_->e2);
+
+        //arange = find_best_aln_range2(kmer_match_ptr, K, K * INDEL_ALLOWENCE_0, 5);  // narrow band to avoid aligning through big indels
+
+        //printf("2:%ld %ld %ld %ld\n\n", arange->s1, arange->e1, arange->s2, arange->e2);
+        
+#define INDEL_ALLOWENCE_1 400
+        if (arange->e1 - arange->s1 < 100 || arange->e2 - arange->s2 < 100 ||
+            abs( (arange->e1 - arange->s1 ) - (arange->e2 - arange->s2) ) > INDEL_ALLOWENCE_1) {
+            free_kmer_match( kmer_match_ptr);
+            free_aln_range(arange);
+            continue;
+        }
+        //printf("%ld %s\n", strlen(input_seq[j]), input_seq[j]);
+        //printf("%ld %s\n\n", strlen(input_seq[0]), input_seq[0]);
+        
+        
+#define INDEL_ALLOWENCE_2 150
+
+        aln = align(input_seq[j]+arange->s1, arange->e1 - arange->s1 ,
+                    input_seq[0]+arange->s2, arange->e2 - arange->s2 , 
+                    INDEL_ALLOWENCE_2, 1);
+        if (aln->aln_str_size > 500 && ((double) aln->dist / (double) aln->aln_str_size) < max_diff) {
+            tags_list[aligned_seq_count] = get_align_tags( aln->q_aln_str, 
+                                                           aln->t_aln_str, 
+                                                           aln->aln_str_size, 
+                                                           arange, j, 
+                                                           local_match_count_window,
+                                                           local_match_count_threshold,
+                                                           0); 
+            aligned_seq_count ++;
+        }
+        /***
+        for (k = 0; k < tags_list[j]->len; k++) {
+            printf("%ld %d %c\n", tags_list[j]->align_tags[k].t_pos,
+                                   tags_list[j]->align_tags[k].delta,
+                                   tags_list[j]->align_tags[k].q_base);
+        }
+        ***/
+        free_aln_range(arange);
+        free_alignment(aln);
+        free_kmer_match( kmer_match_ptr);
+    }
+
+    consensus = get_cns_from_align_tags( tags_list, aligned_seq_count, strlen(input_seq[0]), min_cov );
+    //free(consensus);
+    free_seq_addr_array(sda_ptr);
+    free_seq_array(sa_ptr);
+    free_kmer_lookup(lk_ptr);
+    for (j=0; j < aligned_seq_count; j++) {
+        free_align_tags(tags_list[j]);
+    }
+    free(tags_list);
+    return consensus;
+}
+
+consensus_data * generate_utg_consensus( char ** input_seq, 
+                           seq_coor_t *offset,
+                           unsigned int n_seq, 
+                           unsigned min_cov, 
+                           unsigned K,
+                           double min_idt) {
+
+    unsigned int i, j, k;
+    unsigned int seq_count;
+    unsigned int aligned_seq_count;
+    aln_range * arange;
+    alignment * aln;
+    align_tags_t * tags;
+    align_tags_t ** tags_list;
+    //char * consensus;
+    consensus_data * consensus;
+    double max_diff;
+    seq_coor_t utg_len;
+    seq_coor_t r_len;
+    max_diff = 1.0 - min_idt;
+    
+
+    seq_count = n_seq;
+    /***
+    for (j=0; j < seq_count; j++) {
+        printf("seq_len: %u %u\n", j, strlen(input_seq[j]));
+    };
+    fflush(stdout);
+    ***/
+    tags_list = calloc( seq_count+1, sizeof(align_tags_t *) );
+    utg_len =  strlen(input_seq[0]);
+    aligned_seq_count = 0;
+    arange = calloc( 1, sizeof(aln_range) );
+
+    arange->s1 = 0;
+    arange->e1 = strlen(input_seq[0]);
+    arange->s2 = 0;
+    arange->e2 = strlen(input_seq[0]); 
+    tags_list[aligned_seq_count] = get_align_tags( input_seq[0], input_seq[0], 
+                                                   strlen(input_seq[0]), arange, 0, 
+                                                   12, 0, 0); 
+    aligned_seq_count += 1;
+    for (j=1; j < seq_count; j++) {
+        arange->s1 = 0;
+        arange->e1 = strlen(input_seq[j])-1;
+        arange->s2 = 0;
+        arange->e2 = strlen(input_seq[j])-1; 
+
+        r_len = strlen(input_seq[j]);
+        //printf("seq_len: %u %u\n", j, r_len);
+        if ( offset[j] < 0) {
+            if ((r_len + offset[j]) < 128) {
+                continue;
+            }
+            if ( r_len + offset[j] < utg_len ) {
+
+                //printf("1: %ld %u %u\n", offset[j], r_len, utg_len);
+                aln = align(input_seq[j] - offset[j], r_len + offset[j] ,
+                            input_seq[0], r_len + offset[j] , 
+                            500, 1);
+            } else {
+                //printf("2: %ld %u %u\n", offset[j], r_len, utg_len);
+                aln = align(input_seq[j] - offset[j], utg_len ,
+                            input_seq[0], utg_len , 
+                            500, 1);
+            }
+            offset[j] = 0;
+
+        } else {
+            if ( offset[j] > utg_len - 128) {
+                continue;
+            }
+            if ( offset[j] + r_len > utg_len ) {
+                //printf("3: %ld %u %u\n", offset[j], r_len, utg_len);
+                aln = align(input_seq[j], utg_len - offset[j] ,
+                            input_seq[0]+offset[j], utg_len - offset[j], 
+                            500, 1);
+            } else {
+                //printf("4: %ld %u %u\n", offset[j], r_len, utg_len);
+                aln = align(input_seq[j], r_len ,
+                            input_seq[0]+offset[j], r_len , 
+                            500, 1);
+            }
+        }
+        if (aln->aln_str_size > 500 && ((double) aln->dist / (double) aln->aln_str_size) < max_diff) {
+            tags_list[aligned_seq_count] = get_align_tags( aln->q_aln_str, aln->t_aln_str, 
+                                                           aln->aln_str_size, arange, j, 
+                                                           12, 0, offset[j]); 
+            aligned_seq_count ++;
+        }
+        free_alignment(aln);
+    }
+    free_aln_range(arange);
+    consensus = get_cns_from_align_tags( tags_list, aligned_seq_count, utg_len, 0 );
+    //free(consensus);
+    for (j=0; j < aligned_seq_count; j++) {
+        free_align_tags(tags_list[j]);
+    }
+    free(tags_list);
+    return consensus;
+}
+
+
+void free_consensus_data( consensus_data * consensus ){
+    free(consensus->sequence);
+    free(consensus->eff_cov);
+    free(consensus);
+}
+
+/***
+void main() {
+    unsigned int j;
+    char small_buffer[1024];
+    char big_buffer[65536];
+    char ** input_seq;
+    char ** seq_id;
+    int seq_count;
+    char * consensus;
+
+    input_seq = calloc( 501, sizeof(char *));
+    seq_id = calloc( 501, sizeof(char *));
+    
+    while(1) {
+        seq_count = 0;
+        while (1) {
+
+            scanf("%s", small_buffer);
+            seq_id[seq_count] = calloc( strlen(small_buffer) + 1, sizeof(char));
+            strcpy(seq_id[seq_count], small_buffer);
+
+            scanf("%s", big_buffer);
+            input_seq[seq_count] = calloc( strlen(big_buffer) + 1 , sizeof(char));
+            strcpy(input_seq[seq_count], big_buffer);
+
+            if (strcmp(seq_id[seq_count], "+") == 0) {
+                break;
+            }
+            if (strcmp(seq_id[seq_count], "-") == 0) {
+                break;
+            }
+            //printf("%s\n", seq_id[seq_count]);
+            seq_count += 1;
+            if (seq_count > 500) break;
+        }
+        //printf("sc: %d\n", seq_count);
+        if (seq_count < 10 && strcmp(seq_id[seq_count], "-") != 0 ) continue;
+        if (seq_count < 10 && strcmp(seq_id[seq_count], "-") == 0 ) break;
+
+            consensus = generate_consensus(input_seq, seq_count, 8, 8);
+        if (strlen(consensus) > 500) {
+            printf(">%s\n%s\n", seq_id[0], consensus);
+        }
+        fflush(stdout);
+        free(consensus);
+        for (j=0; j < seq_count; j++) {
+            free(seq_id[j]);
+            free(input_seq[j]);
+        };
+
+    }
+    for (j=0; j < seq_count; j++) {
+        free(seq_id[j]);
+        free(input_seq[j]);
+    };
+    free(seq_id);
+    free(input_seq);
+}
+***/
diff --git a/src/c/kmer_lookup.c b/src/c/kmer_lookup.c
new file mode 100755
index 0000000..d901b03
--- /dev/null
+++ b/src/c/kmer_lookup.c
@@ -0,0 +1,594 @@
+/*
+ * =====================================================================================
+ *
+ *       Filename:  kmer_count.c
+ *
+ *    Description:  
+ *
+ *        Version:  0.1
+ *        Created:  07/20/2013 17:00:00
+ *       Revision:  none
+ *       Compiler:  gcc
+ *
+ *         Author:  Jason Chin, 
+ *        Company:  
+ *
+ * =====================================================================================
+
+ #################################################################################$$
+ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc.
+ #
+ # All rights reserved.
+ #
+ # Redistribution and use in source and binary forms, with or without
+ # modification, are permitted (subject to the limitations in the
+ # disclaimer below) provided that the following conditions are met:
+ #
+ #  * Redistributions of source code must retain the above copyright
+ #  notice, this list of conditions and the following disclaimer.
+ #
+ #  * Redistributions in binary form must reproduce the above
+ #  copyright notice, this list of conditions and the following
+ #  disclaimer in the documentation and/or other materials provided
+ #  with the distribution.
+ #
+ #  * Neither the name of Pacific Biosciences nor the names of its
+ #  contributors may be used to endorse or promote products derived
+ #  from this software without specific prior written permission.
+ #
+ # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE
+ # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC
+ # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
+ # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS
+ # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ # SUCH DAMAGE.
+ #################################################################################$$
+ */
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <limits.h>
+#include "common.h"
+
+
+const unsigned int KMERMATCHINC = 10000;
+
+int compare_seq_coor(const void * a, const void * b) {
+    const seq_coor_t * arg1 = a;
+    const seq_coor_t * arg2 = b;
+    return  (* arg1) - (* arg2);
+}
+
+
+kmer_lookup * allocate_kmer_lookup ( seq_coor_t size ) {
+    kmer_lookup * kl;
+    seq_coor_t i;
+
+    //printf("%lu is allocated for kmer lookup\n", size);
+    kl = (kmer_lookup *)  malloc( size * sizeof(kmer_lookup) );
+    init_kmer_lookup( kl, size);
+    return kl;
+}
+
+void init_kmer_lookup ( kmer_lookup * kl,  seq_coor_t size ) {
+    seq_coor_t i;
+    //printf("%lu is allocated for kmer lookup\n", size);
+    for (i=0; i<size; i++) {
+        kl[i].start = LONG_MAX;
+        kl[i].last = LONG_MAX;
+        kl[i].count = 0;
+    }
+}
+
+
+void free_kmer_lookup( kmer_lookup *  ptr) {
+    free(ptr);
+}
+
+seq_array allocate_seq(seq_coor_t size) {
+    seq_array sa;
+    sa  = (seq_array) malloc( size * sizeof(base) ); 
+    init_seq_array( sa, size);
+    return sa;
+}
+
+void init_seq_array( seq_array sa, seq_coor_t size) {
+    seq_coor_t i;
+    for (i=0; i++; i<size) {
+        sa[i] = 0xff;
+    }
+}
+
+void free_seq_array( seq_array sa) {
+    free(sa);
+}
+
+seq_addr_array allocate_seq_addr(seq_coor_t size) {
+    return (seq_addr_array) calloc( size, sizeof(seq_addr));
+}
+
+void free_seq_addr_array(seq_addr_array sda) {
+    free(sda);
+}
+
+seq_coor_t get_kmer_bitvector(seq_array sa, unsigned int K) {
+    unsigned int i;
+    seq_coor_t kmer_bv = 0;
+    seq_coor_t kmer_mask;
+
+    kmer_mask = 0;
+    for (i = 0; i < K; i++) {
+        kmer_mask <<= 2;
+        kmer_mask |= 0x00000003;
+    }
+
+    for (i = 0; i < K; i++) {
+        kmer_bv <<= 2;
+        kmer_bv |= (unsigned int) sa[i];
+    }
+
+    return kmer_bv;
+}
+
+void add_sequence ( seq_coor_t start, 
+                    unsigned int K, 
+                    char * seq, 
+                    seq_coor_t seq_len,
+                    seq_addr_array sda, 
+                    seq_array sa, 
+                    kmer_lookup * lk ) {
+
+    seq_coor_t i;
+    seq_coor_t kmer_bv;
+    seq_coor_t kmer_mask;
+
+    kmer_mask = 0;
+    for (i = 0; i < K; i++) {
+        kmer_mask <<= 2;
+        kmer_mask |= 0x00000003;
+    }
+
+    for (i = 0; i < seq_len; i++) {
+        switch ( seq[i] ) {
+            case 'A':
+                sa[ start + i ] = 0;
+                break;
+            case 'C':
+                sa[ start + i ] = 1;
+                break;
+            case 'G':
+                sa[ start + i ] = 2;
+                break;
+            case 'T':
+                sa[ start + i ] = 3;
+        }
+    }
+    kmer_bv = get_kmer_bitvector( sa + start, K);
+    for (i = 0; i < seq_len - K;  i++) {
+        //printf("%lu %lu\n", i, kmer_bv);
+        //printf("lk before init: %lu %lu %lu\n", kmer_bv, lk[kmer_bv].start, lk[kmer_bv].last);
+        if (lk[kmer_bv].start == LONG_MAX) {
+            lk[kmer_bv].start = start + i;
+            lk[kmer_bv].last = start + i;
+            lk[kmer_bv].count += 1;
+            //printf("lk init: %lu %lu %lu\n", kmer_bv, lk[kmer_bv].start, lk[kmer_bv].last);
+        } else {
+            sda[ lk[kmer_bv].last ] = start + i;
+            lk[kmer_bv].count += 1;
+            lk[kmer_bv].last = start + i;
+            //printf("lk change: %lu %lu %lu\n", kmer_bv, lk[kmer_bv].start, lk[kmer_bv].last);
+        }
+        kmer_bv <<= 2;
+        kmer_bv |= sa[ start + i + K];
+        kmer_bv &= kmer_mask;
+    }
+}
+
+
+void mask_k_mer(seq_coor_t size, kmer_lookup * kl, seq_coor_t threshold) {
+    seq_coor_t i;
+    for (i=0; i<size; i++) {
+        if (kl[i].count > threshold) {
+            kl[i].start = LONG_MAX;
+            kl[i].last = LONG_MAX;
+            //kl[i].count = 0;
+        }
+    }
+}
+
+
+kmer_match * find_kmer_pos_for_seq( char * seq, seq_coor_t seq_len, unsigned int K,
+                    seq_addr_array sda, 
+                    kmer_lookup * lk) {
+    seq_coor_t i;
+    seq_coor_t kmer_bv;
+    seq_coor_t kmer_mask;
+    seq_coor_t kmer_pos;
+    seq_coor_t next_kmer_pos;
+    unsigned int half_K;
+    seq_coor_t kmer_match_rtn_allocation_size = KMERMATCHINC;
+    kmer_match * kmer_match_rtn;
+    base * sa;
+
+    kmer_match_rtn = (kmer_match *) malloc( sizeof(kmer_match) );
+    kmer_match_rtn->count = 0;
+    kmer_match_rtn->query_pos = (seq_coor_t *) calloc( kmer_match_rtn_allocation_size, sizeof( seq_coor_t ) );
+    kmer_match_rtn->target_pos = (seq_coor_t *) calloc( kmer_match_rtn_allocation_size, sizeof( seq_coor_t ) );
+
+    sa = calloc( seq_len, sizeof(base) );
+
+    kmer_mask = 0;
+    for (i = 0; i < K; i++) {
+        kmer_mask <<= 2;
+        kmer_mask |= 0x00000003;
+    }
+
+    for (i = 0; i < seq_len; i++) {
+        switch ( seq[i] ) {
+            case 'A':
+                sa[ i ] = 0;
+                break;
+            case 'C':
+                sa[ i ] = 1;
+                break;
+            case 'G':
+                sa[ i ] = 2;
+                break;
+            case 'T':
+                sa[ i ] = 3;
+        }
+    }
+
+
+    kmer_bv = get_kmer_bitvector(sa, K);
+    half_K = K >> 1;
+    for (i = 0; i < seq_len - K;  i += half_K) {
+        kmer_bv = get_kmer_bitvector(sa + i, K);
+        if (lk[kmer_bv].start == LONG_MAX) {  //for high count k-mers
+            continue;
+        }
+        kmer_pos = lk[ kmer_bv ].start;
+        next_kmer_pos = sda[ kmer_pos ];
+        kmer_match_rtn->query_pos[ kmer_match_rtn->count ] = i;
+        kmer_match_rtn->target_pos[ kmer_match_rtn->count ] = kmer_pos;
+        kmer_match_rtn->count += 1;
+        if (kmer_match_rtn->count > kmer_match_rtn_allocation_size - 1000) {
+            kmer_match_rtn_allocation_size += KMERMATCHINC;
+            kmer_match_rtn->query_pos = (seq_coor_t *) realloc( kmer_match_rtn->query_pos, 
+                                                                   kmer_match_rtn_allocation_size  * sizeof(seq_coor_t) );
+            kmer_match_rtn->target_pos = (seq_coor_t *) realloc( kmer_match_rtn->target_pos, 
+                                                                    kmer_match_rtn_allocation_size  * sizeof(seq_coor_t) );
+        }
+        while ( next_kmer_pos > kmer_pos ){
+            kmer_pos = next_kmer_pos;
+            next_kmer_pos = sda[ kmer_pos ];
+            kmer_match_rtn->query_pos[ kmer_match_rtn->count ] = i;
+            kmer_match_rtn->target_pos[ kmer_match_rtn->count ] = kmer_pos;
+            kmer_match_rtn->count += 1;
+            if (kmer_match_rtn->count > kmer_match_rtn_allocation_size - 1000) {
+                kmer_match_rtn_allocation_size += KMERMATCHINC;
+                kmer_match_rtn->query_pos = (seq_coor_t *) realloc( kmer_match_rtn->query_pos, 
+                                                                       kmer_match_rtn_allocation_size  * sizeof(seq_coor_t) );
+                kmer_match_rtn->target_pos = (seq_coor_t *) realloc( kmer_match_rtn->target_pos, 
+                                                                        kmer_match_rtn_allocation_size  * sizeof(seq_coor_t) );
+            }
+        }
+    }
+    free(sa);
+    return kmer_match_rtn;
+}
+
+void free_kmer_match( kmer_match * ptr) {
+    free(ptr->query_pos);
+    free(ptr->target_pos);
+    free(ptr);
+}
+
+aln_range* find_best_aln_range(kmer_match * km_ptr, 
+                              seq_coor_t K, 
+                              seq_coor_t bin_size, 
+                              seq_coor_t count_th) {
+    seq_coor_t i;
+    seq_coor_t j;
+    seq_coor_t q_min, q_max, t_min, t_max;
+    seq_coor_t * d_count;
+    seq_coor_t * q_coor;
+    seq_coor_t * t_coor;
+    aln_range * arange;
+
+    long int d, d_min, d_max;
+    long int cur_score;
+    long int max_score;
+    long int max_k_mer_count;
+    long int max_k_mer_bin;
+    seq_coor_t cur_start;
+    seq_coor_t cur_pos;
+    seq_coor_t max_start;
+    seq_coor_t max_end;
+    seq_coor_t kmer_dist;
+
+    arange = calloc(1 , sizeof(aln_range));
+
+    q_min = LONG_MAX;
+    q_max = 0;
+    t_min = LONG_MAX;
+    t_max = 0;
+
+    d_min = LONG_MAX;
+    d_max = LONG_MIN;
+
+    for (i = 0; i <  km_ptr->count; i++ ) {
+        if ( km_ptr -> query_pos[i] < q_min) {
+            q_min =  km_ptr->query_pos[i];
+        }
+        if ( km_ptr -> query_pos[i] > q_max) {
+            q_max =  km_ptr->query_pos[i];
+        }
+        if ( km_ptr -> target_pos[i] < t_min) {
+            t_min =  km_ptr->target_pos[i];
+        }
+        if ( km_ptr -> query_pos[i] > t_max) {
+            t_max =  km_ptr->target_pos[i];
+        }
+        d = (long int) km_ptr->query_pos[i] - (long int) km_ptr->target_pos[i];
+        if ( d < d_min ) {
+            d_min = d;
+        }
+        if ( d > d_max ) {
+            d_max = d;
+        }
+    }
+
+    //printf("%lu %ld %ld\n" , km_ptr->count, d_min, d_max);
+    d_count = calloc( (d_max - d_min)/bin_size + 1, sizeof(seq_coor_t) );
+    q_coor = calloc( km_ptr->count, sizeof(seq_coor_t) );
+    t_coor = calloc( km_ptr->count, sizeof(seq_coor_t) );
+
+    for (i = 0; i <  km_ptr->count; i++ ) {
+        d = (long int) (km_ptr->query_pos[i]) - (long int) (km_ptr->target_pos[i]);
+        d_count[ (d - d_min)/ (long int) bin_size ] += 1;
+        q_coor[i] = LONG_MAX;
+        t_coor[i] = LONG_MAX;
+    }
+
+    j = 0;
+    max_k_mer_count = 0;
+    max_k_mer_bin = LONG_MAX;
+    for (i = 0; i <  km_ptr->count; i++ ) {
+        d = (long int) (km_ptr->query_pos[i]) - (long int) (km_ptr->target_pos[i]);
+        if ( d_count[ (d - d_min)/ (long int) bin_size ] > max_k_mer_count) {
+            max_k_mer_count =  d_count[ (d - d_min)/ (long int) bin_size ];
+            max_k_mer_bin = (d - d_min)/ (long int) bin_size;
+        }
+    }
+    //printf("k_mer: %lu %lu\n" , max_k_mer_count, max_k_mer_bin);
+    
+    if ( max_k_mer_bin != LONG_MAX && max_k_mer_count > count_th ) {
+        for (i = 0; i <  km_ptr->count; i++ ) {
+            d = (long int) (km_ptr->query_pos[i]) - (long int) (km_ptr->target_pos[i]);
+            if ( abs( ( (d - d_min)/ (long int) bin_size ) - max_k_mer_bin ) > 5 ) {
+                continue;
+            }
+            if (d_count[ (d - d_min)/ (long int) bin_size ] > count_th) {
+                q_coor[j] = km_ptr->query_pos[i];  
+                t_coor[j] = km_ptr->target_pos[i];
+                //printf("d_count: %lu %lu\n" ,i, d_count[(d - d_min)/ (long int) bin_size]);
+                //printf("coor: %lu %lu\n" , q_coor[j], t_coor[j]);
+                j ++;
+            }
+        }
+    }
+
+    if (j > 1) {
+        arange->s1 = q_coor[0];
+        arange->e1 = q_coor[0];
+        arange->s2 = t_coor[0];
+        arange->e2 = t_coor[0];
+        arange->score = 0;
+
+        max_score = 0;
+        cur_score = 0;
+        cur_start = 0;
+
+        for (i = 1; i < j; i++) {
+            cur_score += 32 - (q_coor[i] - q_coor[i-1]);
+            //printf("deltaD, %lu %ld\n", q_coor[i] - q_coor[i-1], cur_score);
+            if (cur_score < 0) {
+                cur_score = 0;
+                cur_start = i;
+            } else if (cur_score > max_score) {
+                arange->s1 = q_coor[cur_start];
+                arange->s2 = t_coor[cur_start];
+                arange->e1 = q_coor[i];
+                arange->e2 = t_coor[i];
+                max_score = cur_score;
+                arange->score = max_score;
+                //printf("%lu %lu %lu %lu\n", arange.s1, arange.e1, arange.s2, arange.e2);
+            }
+        }
+
+    } else {
+        arange->s1 = 0;
+        arange->e1 = 0;
+        arange->s2 = 0;
+        arange->e2 = 0;
+        arange->score = 0;
+    }
+
+    // printf("free\n");
+
+    free(d_count);
+    free(q_coor);
+    free(t_coor);
+    return arange;
+}
+
+aln_range* find_best_aln_range2(kmer_match * km_ptr, 
+                                seq_coor_t K, 
+                                seq_coor_t bin_width, 
+                                seq_coor_t count_th) {
+
+    seq_coor_t * d_coor;
+    seq_coor_t * hit_score;
+    seq_coor_t * hit_count;
+    seq_coor_t * last_hit;
+    seq_coor_t max_q, max_t;
+    seq_coor_t s, e, max_s, max_e, max_span, d_s, d_e, delta, d_len;
+    seq_coor_t px, py, cx, cy;
+    seq_coor_t max_hit_idx;
+    seq_coor_t max_hit_score, max_hit_count;
+    seq_coor_t i, j;
+    seq_coor_t candidate_idx, max_d, d;
+
+    aln_range * arange;
+
+    arange = calloc(1 , sizeof(aln_range));
+
+    d_coor = calloc( km_ptr->count, sizeof(seq_coor_t) );
+
+    max_q = -1;
+    max_t = -1;
+
+    for (i = 0; i <  km_ptr->count; i++ ) {
+        d_coor[i] = km_ptr->query_pos[i] - km_ptr->target_pos[i];
+        max_q = max_q > km_ptr->query_pos[i] ? max_q : km_ptr->query_pos[i];
+        max_t = max_t > km_ptr->target_pos[i] ? max_q : km_ptr->target_pos[i];
+
+    }
+
+    qsort(d_coor, km_ptr->count, sizeof(seq_coor_t), compare_seq_coor);
+
+
+    s = 0;
+    e = 0;
+    max_s = -1;
+    max_e = -1;
+    max_span = -1;
+    delta = (long int) ( 0.05 * ( max_q + max_t ) );
+    d_len =  km_ptr->count;
+    d_s = -1;
+    d_e = -1;
+    while (1) {
+        d_s = d_coor[s];
+        d_e = d_coor[e];
+        while (d_e < d_s + delta && e < d_len-1) {
+            e += 1;
+            d_e = d_coor[e];
+        }
+        if ( max_span == -1 || e - s > max_span ) {
+            max_span = e - s;
+            max_s = s;
+            max_e = e;
+        }
+        s += 1;
+        if (s == d_len || e == d_len) {
+            break;
+        }
+    }
+
+    if (max_s == -1 || max_e == -1 || max_e - max_s < 32) {
+        arange->s1 = 0;
+        arange->e1 = 0;
+        arange->s2 = 0;
+        arange->e2 = 0;
+        arange->score = 0;
+        free(d_coor);
+        return arange;
+    }
+
+    last_hit = calloc( km_ptr->count, sizeof(seq_coor_t) );
+    hit_score = calloc( km_ptr->count, sizeof(seq_coor_t) );
+    hit_count = calloc( km_ptr->count, sizeof(seq_coor_t) );
+
+    for (i = 0; i <  km_ptr->count; i++ ) {
+        last_hit[i] = -1;
+        hit_score[i] = 0;
+        hit_count[i] = 0;
+    }
+    max_hit_idx = -1;
+    max_hit_score = 0;
+    for (i = 0; i < km_ptr->count; i ++)  {
+        cx = km_ptr->query_pos[i];
+        cy = km_ptr->target_pos[i];
+        d = cx - cy; 
+        if ( d < d_coor[max_s] || d > d_coor[max_e] ) continue;
+
+        j = i - 1;
+        candidate_idx = -1;
+        max_d = 65535;
+        while (1) {
+            if ( j < 0 ) break;
+            px = km_ptr->query_pos[j];
+            py = km_ptr->target_pos[j];
+            d = px - py;
+            if ( d < d_coor[max_s] || d > d_coor[max_e] ) {
+                j--;
+                continue;
+            }
+            if (cx - px > 320) break; //the number here controling how big alignment gap to be considered
+            if (cy > py && cx - px + cy - py < max_d && cy - py <= 320 ) {
+                max_d = cx - px + cy - py;
+                candidate_idx = j;
+            }
+            j--;
+        }
+        if (candidate_idx != -1) {
+            last_hit[i] = candidate_idx;
+            hit_score[i] = hit_score[candidate_idx] + (64 - max_d);
+            hit_count[i] = hit_count[candidate_idx] + 1;
+            if (hit_score[i] < 0) {
+                hit_score[i] = 0;
+                hit_count[i] = 0;
+            }
+        } else {
+            hit_score[i] = 0;
+            hit_count[i] = 0;
+        }
+        if (hit_score[i] > max_hit_score) {
+            max_hit_score = hit_score[i];
+            max_hit_count = hit_count[i];
+            max_hit_idx = i;
+        }
+
+    }
+    if (max_hit_idx == -1) {
+        arange->s1 = 0;
+        arange->e1 = 0;
+        arange->s2 = 0;
+        arange->e2 = 0;
+        arange->score = 0;
+        free(d_coor);
+        free(last_hit);
+        free(hit_score);
+        free(hit_count);
+        return arange;
+    }
+
+    arange->score = max_hit_count + 1;
+    arange->e1 = km_ptr->query_pos[max_hit_idx];
+    arange->e2 = km_ptr->target_pos[max_hit_idx];
+    i = max_hit_idx;
+    while (last_hit[i] != -1) {
+        i = last_hit[i];
+    }
+    arange->s1 = km_ptr->query_pos[i];
+    arange->s2 = km_ptr->target_pos[i];
+
+    free(d_coor);
+    free(last_hit);
+    free(hit_score);
+    free(hit_count);
+    return arange;
+}
+
+void free_aln_range( aln_range * arange) {
+    free(arange);
+}
diff --git a/src/py/__init__.py b/src/py/__init__.py
new file mode 100644
index 0000000..2e1685f
--- /dev/null
+++ b/src/py/__init__.py
@@ -0,0 +1,39 @@
+
+#################################################################################$$
+# Copyright (c) 2011-2014, Pacific Biosciences of California, Inc.
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted (subject to the limitations in the
+# disclaimer below) provided that the following conditions are met:
+#
+#  * Redistributions of source code must retain the above copyright
+#  notice, this list of conditions and the following disclaimer.
+#
+#  * Redistributions in binary form must reproduce the above
+#  copyright notice, this list of conditions and the following
+#  disclaimer in the documentation and/or other materials provided
+#  with the distribution.
+#
+#  * Neither the name of Pacific Biosciences nor the names of its
+#  contributors may be used to endorse or promote products derived
+#  from this software without specific prior written permission.
+#
+# NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE
+# GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC
+# BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
+# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+#################################################################################$$
+
+from .falcon_kit import *
diff --git a/src/py/falcon_kit.py b/src/py/falcon_kit.py
new file mode 100644
index 0000000..46b776e
--- /dev/null
+++ b/src/py/falcon_kit.py
@@ -0,0 +1,193 @@
+
+#################################################################################$$
+# Copyright (c) 2011-2014, Pacific Biosciences of California, Inc.
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted (subject to the limitations in the
+# disclaimer below) provided that the following conditions are met:
+#
+#  * Redistributions of source code must retain the above copyright
+#  notice, this list of conditions and the following disclaimer.
+#
+#  * Redistributions in binary form must reproduce the above
+#  copyright notice, this list of conditions and the following
+#  disclaimer in the documentation and/or other materials provided
+#  with the distribution.
+#
+#  * Neither the name of Pacific Biosciences nor the names of its
+#  contributors may be used to endorse or promote products derived
+#  from this software without specific prior written permission.
+#
+# NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE
+# GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC
+# BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
+# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+#################################################################################$$
+
+
+from ctypes import *
+import os
+module_path = os.path.split(__file__)[0]
+
+
+seq_coor_t = c_long
+base_t = c_uint8
+
+class KmerLookup(Structure):
+    _fields_ = [("start", seq_coor_t),
+                ("last", seq_coor_t),
+                ("count", seq_coor_t)]
+
+class KmerMatch(Structure):
+    _fields_ = [ ("count", seq_coor_t),
+                ("query_pos", POINTER(seq_coor_t)),
+                ("target_pos", POINTER(seq_coor_t)) ]
+
+class AlnRange(Structure):
+    _fields_ = [ ("s1", seq_coor_t),
+                 ("e1", seq_coor_t),
+                 ("s2", seq_coor_t),
+                 ("e2", seq_coor_t),
+                 ("score", c_long) ]
+
+class ConsensusData(Structure):
+    _fields_ = [ ("sequence", c_char_p),
+                 ("eff_cov", POINTER(c_uint)) ]
+
+kup = CDLL(os.path.join(module_path, "kmer_lookup.so"))
+
+kup.allocate_kmer_lookup.argtypes =  [seq_coor_t] 
+kup.allocate_kmer_lookup.restype = POINTER(KmerLookup)
+kup.init_kmer_lookup.argtypes = [POINTER(KmerLookup), seq_coor_t]
+kup.free_kmer_lookup.argtypes = [POINTER(KmerLookup)]
+
+kup.allocate_seq.argtypes = [seq_coor_t]
+kup.allocate_seq.restype = POINTER(base_t)
+kup.init_seq_array.argtypes = [POINTER(base_t), seq_coor_t]
+kup.free_seq_array.argtypes = [POINTER(base_t)]
+
+kup.allocate_seq_addr.argtypes = [seq_coor_t]
+kup.allocate_seq_addr.restype = POINTER(seq_coor_t)
+kup.free_seq_addr_array.argtypes = [POINTER(seq_coor_t)]
+
+kup.add_sequence.argtypes = [ seq_coor_t, c_uint, POINTER(c_char), seq_coor_t, POINTER(seq_coor_t), 
+                              POINTER(c_uint8), POINTER(KmerLookup) ]
+kup.mask_k_mer.argtypes =[ c_long, POINTER(KmerLookup), c_long ]
+kup.find_kmer_pos_for_seq.argtypes = [ POINTER(c_char), seq_coor_t, c_uint, POINTER(seq_coor_t), 
+                                       POINTER(KmerLookup)]
+kup.find_kmer_pos_for_seq.restype = POINTER(KmerMatch)
+kup.free_kmer_match.argtypes = [ POINTER(KmerMatch) ]
+
+
+kup.find_best_aln_range.argtypes = [POINTER(KmerMatch), seq_coor_t, seq_coor_t, seq_coor_t]
+kup.find_best_aln_range.restype = POINTER(AlnRange)
+kup.find_best_aln_range2.argtypes = [POINTER(KmerMatch), seq_coor_t, seq_coor_t, seq_coor_t]
+kup.find_best_aln_range2.restype = POINTER(AlnRange)
+kup.free_aln_range.argtypes = [POINTER(AlnRange)]
+
+
+class Alignment(Structure):
+    """
+    typedef struct {    
+        seq_coor_t aln_str_size ;
+        seq_coor_t dist ;
+        seq_coor_t aln_q_s;
+        seq_coor_t aln_q_e;
+        seq_coor_t aln_t_s;
+        seq_coor_t aln_t_e;
+        char * q_aln_str;
+        char * t_aln_str;
+    } alignment;
+    """
+    _fields_ = [ ("aln_str_size", seq_coor_t),
+                 ("dist", seq_coor_t),
+                 ("aln_q_s", seq_coor_t),
+                 ("aln_q_e", seq_coor_t),
+                 ("aln_t_s", seq_coor_t),
+                 ("aln_t_e", seq_coor_t),
+                 ("q_aln_str", c_char_p),
+                 ("t_aln_str", c_char_p)]
+
+
+DWA = CDLL(os.path.join(module_path, "DW_align.so"))
+DWA.align.argtypes = [ POINTER(c_char), c_long, POINTER(c_char), c_long, c_long, c_int ] 
+DWA.align.restype = POINTER(Alignment)
+DWA.free_alignment.argtypes = [POINTER(Alignment)]
+
+
+
+falcon = CDLL(os.path.join(module_path,"falcon.so"))
+
+falcon.generate_consensus.argtypes = [POINTER(c_char_p), c_uint, c_uint, c_uint, c_uint, c_uint, c_double  ]
+falcon.generate_consensus.restype = POINTER(ConsensusData)
+falcon.free_consensus_data.argtypes = [ POINTER(ConsensusData) ]
+
+
+
+
+def get_alignment(seq1, seq0):
+    K = 8
+    lk_ptr = kup.allocate_kmer_lookup( 1 << (K * 2) )
+    sa_ptr = kup.allocate_seq( len(seq0) )
+    sda_ptr = kup.allocate_seq_addr( len(seq0) )
+    kup.add_sequence( 0, K, seq0, len(seq0), sda_ptr, sa_ptr, lk_ptr)
+
+    kmer_match_ptr = kup.find_kmer_pos_for_seq(seq1, len(seq1), K, sda_ptr, lk_ptr)
+    kmer_match = kmer_match_ptr[0]
+    aln_range_ptr = kup.find_best_aln_range(kmer_match_ptr, K, K*10, 50)
+    #x,y = zip( * [ (kmer_match.query_pos[i], kmer_match.target_pos[i]) for i in range(kmer_match.count )] )
+    kup.free_kmer_match(kmer_match_ptr)
+    aln_range = aln_range_ptr[0]
+    s1, e1, s2, e2 = aln_range.s1, aln_range.e1, aln_range.s2, aln_range.e2
+    kup.free_aln_range(aln_range_ptr)
+
+    if e1 - s1 > 500:
+        #s1 = 0 if s1 < 14 else s1 - 14
+        #s2 = 0 if s2 < 14 else s2 - 14
+        e1 = len(seq1) if e1 >= len(seq1)-2*K else e1 + K*2
+        e2 = len(seq0) if e2 >= len(seq0)-2*K else e2 + K*2
+        
+        alignment = DWA.align(seq1[s1:e1], e1-s1,
+                              seq0[s2:e2], e2-s2,
+                              100,
+                              0)
+        #print seq1[s1:e1]
+        #print seq0[s2:e2]
+        #if alignment[0].aln_str_size > 500:
+
+        #aln_str1 = alignment[0].q_aln_str
+        #aln_str0 = alignment[0].t_aln_str
+        aln_size = alignment[0].aln_str_size
+        aln_dist = alignment[0].dist
+        aln_q_s = alignment[0].aln_q_s
+        aln_q_e = alignment[0].aln_q_e
+        aln_t_s = alignment[0].aln_t_s
+        aln_t_e = alignment[0].aln_t_e
+        
+        #print "X,",alignment[0].aln_q_s, alignment[0].aln_q_e
+        #print "Y,",alignment[0].aln_t_s, alignment[0].aln_t_e
+        
+        #print aln_str1
+        #print aln_str0
+    
+        DWA.free_alignment(alignment)
+
+    kup.free_seq_addr_array(sda_ptr)
+    kup.free_seq_array(sa_ptr)
+    kup.free_kmer_lookup(lk_ptr)
+    if e1 - s1 > 500 and aln_size > 500:
+        return s1, s1+aln_q_e-aln_q_s, s2, s2+aln_t_e-aln_t_s, aln_size, aln_dist
+    else:
+        return None
diff --git a/src/py_scripts/falcon_asm.py b/src/py_scripts/falcon_asm.py
new file mode 100755
index 0000000..1534b44
--- /dev/null
+++ b/src/py_scripts/falcon_asm.py
@@ -0,0 +1,1154 @@
+#!/usr/bin/env python
+
+#################################################################################$$
+# Copyright (c) 2011-2014, Pacific Biosciences of California, Inc.
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted (subject to the limitations in the
+# disclaimer below) provided that the following conditions are met:
+#
+#  * Redistributions of source code must retain the above copyright
+#  notice, this list of conditions and the following disclaimer.
+#
+#  * Redistributions in binary form must reproduce the above
+#  copyright notice, this list of conditions and the following
+#  disclaimer in the documentation and/or other materials provided
+#  with the distribution.
+#
+#  * Neither the name of Pacific Biosciences nor the names of its
+#  contributors may be used to endorse or promote products derived
+#  from this software without specific prior written permission.
+#
+# NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE
+# GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC
+# BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
+# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+#################################################################################$$
+
+from pbcore.io import FastaReader
+import networkx as nx
+import os
+import shlex
+import sys
+import subprocess
+
+DEBUG_LOG_LEVEL = 0
+
+class SGNode(object):
+    """
+    class representing a node in the string graph
+    """
+    def __init__(self, node_name):
+        self.name = node_name
+        self.out_edges = []
+        self.in_edges = []
+    def add_out_edge(self, out_edge):
+        self.out_edges.append(out_edge)
+    def add_in_edge(self, in_edge):
+        self.in_edges.append(in_edge)
+
+class SGEdge(object):
+    """
+    class representing an edge in the string graph
+    """
+    def __init__(self, in_node, out_node):
+        self.in_node = in_node
+        self.out_node = out_node
+        self.attr = {}
+    def set_attribute(self, attr, value):
+        self.attr[attr] = value
+
+def reverse_end( node_id ):
+    node_id, end = node_id.split(":")
+    new_end = "B" if end == "E" else "E"
+    return node_id + ":" + new_end
+
+class StringGraph(object):
+    """
+    class representing the string graph
+    """
+    def __init__(self):
+        self.nodes = {}
+        self.edges = {}
+        self.n_mark = {}
+        self.e_reduce = {}
+        self.repeat_overlap = {}
+        
+    def add_node(self, node_name):
+        """ 
+        add a node into the graph by given a node name
+        """
+        if node_name not in self.nodes:
+            self.nodes[node_name] = SGNode(node_name)
+    
+    def add_edge(self, in_node_name, out_node_name, **attributes):
+        """ 
+        add an edge into the graph by given a pair of nodes
+        """
+        if (in_node_name, out_node_name) not in self.edges:
+        
+            self.add_node(in_node_name)
+            self.add_node(out_node_name)
+            in_node = self.nodes[in_node_name]
+            out_node = self.nodes[out_node_name]    
+            
+            edge = SGEdge(in_node, out_node)
+            self.edges[ (in_node_name, out_node_name) ] = edge
+            in_node.add_out_edge(edge)
+            out_node.add_in_edge(edge)
+        edge =  self.edges[ (in_node_name, out_node_name) ]
+        for k, v in attributes.items():
+            edge.attr[k] = v
+
+    def init_reduce_dict(self):
+        for e in self.edges:
+            self.e_reduce[e] = False
+
+    def mark_chimer_edge(self):
+
+        for e_n, e in self.edges.items():
+            v = e_n[0]
+            w = e_n[1]
+            overlap_count = 0
+            for w_out_e in self.nodes[w].out_edges:
+                w_out_n = w_out_e.out_node.name
+                if (v, w_out_n) in self.edges:
+                    overlap_count += 1
+            for v_in_e in self.nodes[v].in_edges:
+                v_in_n = v_in_e.in_node.name
+                if (v_in_n, w) in self.edges:
+                    overlap_count += 1
+            if self.e_reduce[ (v, w) ] != True:
+                if overlap_count == 0:
+                    self.e_reduce[(v, w)] = True
+                    #print "XXX: chimer edge %s %s removed" % (v, w)
+                    v, w = reverse_end(w), reverse_end(v)
+                    self.e_reduce[(v, w)] = True
+                    #print "XXX: chimer edge %s %s removed" % (v, w)
+
+
+
+    def mark_spur_edge(self):
+
+        for  v in self.nodes:
+            if len(self.nodes[v].out_edges) > 1:
+                for out_edge in self.nodes[v].out_edges:
+                    w = out_edge.out_node.name
+                    
+                    if len(self.nodes[w].out_edges) == 0 and self.e_reduce[ (v, w) ] != True:
+                        #print "XXX: spur edge %s %s removed" % (v, w)
+                        self.e_reduce[(v, w)] = True
+                        v2, w2 = reverse_end(w), reverse_end(v)
+                        #print "XXX: spur edge %s %s removed" % (v2, w2)
+                        self.e_reduce[(v, w)] = True
+
+            if len(self.nodes[v].in_edges) > 1:
+                for in_edge in self.nodes[v].in_edges:
+                    w = in_edge.in_node.name
+                    if len(self.nodes[w].in_edges) == 0 and self.e_reduce[ (w, v) ] != True:
+                        #print "XXX: spur edge %s %s removed" % (w, v)
+                        self.e_reduce[(w, v)] = True
+                        v2, w2 = reverse_end(w), reverse_end(v)
+                        #print "XXX: spur edge %s %s removed" % (w2, v2)
+                        self.e_reduce[(w, v)] = True
+
+
+    def mark_tr_edges(self):
+        """
+        transitive reduction
+        """
+        n_mark = self.n_mark
+        e_reduce = self.e_reduce
+        FUZZ = 500
+        for n in self.nodes:
+            n_mark[n] = "vacant"
+    
+        for n_name, node in self.nodes.items():
+
+            out_edges = node.out_edges
+            if len(out_edges) == 0:
+                continue
+            
+            out_edges.sort(key=lambda x: x.attr["length"])
+            
+            for e in out_edges:
+                w = e.out_node
+                n_mark[ w.name ] = "inplay"
+            
+            max_len = out_edges[-1].attr["length"]
+                
+            max_len += FUZZ
+            
+            for e in out_edges:
+                e_len = e.attr["length"]
+                w = e.out_node
+                if n_mark[w.name] == "inplay":
+                    w.out_edges.sort( key=lambda x: x.attr["length"] )
+                    for e2 in w.out_edges:
+                        if e2.attr["length"] + e_len < max_len:
+                            x = e2.out_node
+                            if n_mark[x.name] == "inplay":
+                                n_mark[x.name] = "eliminated"
+            
+            for e in out_edges:
+                e_len = e.attr["length"]
+                w = e.out_node
+                w.out_edges.sort( key=lambda x: x.attr["length"] )
+                if len(w.out_edges) > 0:
+                    x = w.out_edges[0].out_node
+                    if n_mark[x.name] == "inplay":
+                        n_mark[x.name] = "eliminated"
+                for e2 in w.out_edges:
+                    if e2.attr["length"] < FUZZ:
+                        x = e2.out_node
+                        if n_mark[x.name] == "inplay":
+                            n_mark[x.name] = "eliminated"
+                            
+            for out_edge in out_edges:
+                v = out_edge.in_node
+                w = out_edge.out_node
+                if n_mark[w.name] == "eliminated":
+                    e_reduce[ (v.name, w.name) ] = True
+                    #print "XXX: tr edge %s %s removed" % (v.name, w.name)
+                    v_name, w_name = reverse_end(w.name), reverse_end(v.name)
+                    e_reduce[(v_name, w_name)] = True
+                    #print "XXX: tr edge %s %s removed" % (v_name, w_name)
+                n_mark[w.name] = "vacant"
+                
+
+    def mark_best_overlap(self):
+        """
+        find the best overlapped edges
+        """
+
+        best_edges = set()
+
+        for v in self.nodes:
+
+            out_edges = self.nodes[v].out_edges
+            if len(out_edges) > 0:
+                out_edges.sort(key=lambda e: e.attr["score"])
+                e = out_edges[-1]
+                best_edges.add( (e.in_node.name, e.out_node.name) )
+
+            in_edges = self.nodes[v].in_edges
+            if len(in_edges) > 0:
+                in_edges.sort(key=lambda e: e.attr["score"])
+                e = in_edges[-1]
+                best_edges.add( (e.in_node.name, e.out_node.name) )
+
+        if DEBUG_LOG_LEVEL > 1:
+            print "X", len(best_edges)
+
+        for e_n, e in self.edges.items():
+            v = e_n[0]
+            w = e_n[1]
+            if self.e_reduce[ (v, w) ] != True:
+                if (v, w) not in best_edges:
+                    self.e_reduce[(v, w)] = True
+                    #print "XXX: in best edge %s %s removed" % (v, w)
+                    v2, w2 = reverse_end(w), reverse_end(v)
+                    #print "XXX: in best edge %s %s removed" % (v2, w2)
+                    self.e_reduce[(v2, w2)] = True
+                
+    def get_out_edges_for_node(self, name, mask=True):
+        rtn = []
+        for e in self.nodes[name].out_edges:
+            v = e.in_node
+            w = e.out_node
+            if self.e_reduce[ (v.name, w.name) ] == False:
+                rtn.append(e)
+        return rtn
+        
+        
+    def get_in_edges_for_node(self, name, mask=True):
+        rtn = []
+        for e in self.nodes[name].in_edges:
+            v = e.in_node
+            w = e.out_node
+            if self.e_reduce[ (v.name, w.name) ] == False:
+                rtn.append(e)
+        return rtn
+
+    def get_best_out_edge_for_node(self, name, mask=True):
+        rtn = []
+        for e in self.nodes[name].out_edges:
+            v = e.in_node
+            w = e.out_node
+            if self.e_reduce[ (v.name, w.name) ] == False:
+                rtn.append(e)
+        rtn.sort(key=lambda e: e.attr["score"])
+
+        return rtn[-1]
+
+    def get_best_in_edge_for_node(self, name, mask=True):
+        rtn = []
+        for e in self.nodes[name].in_edges:
+            v = e.in_node
+            w = e.out_node
+            if self.e_reduce[ (v.name, w.name) ] == False:
+                rtn.append(e)
+        rtn.sort(key=lambda e: e.attr["score"])
+        return rtn[-1]
+        
+
+RCMAP = dict(zip("ACGTacgtNn-","TGCAtgcaNn-"))
+def generate_seq_from_path(sg, seqs, path):
+    subseqs = []
+    r_id, end = path[0].split(":")
+    
+    count = 0
+    for i in range( len( path ) -1 ):
+        w_n, v_n = path[i:i+2]
+        edge = sg.edges[ (w_n, v_n ) ]
+        read_id, coor = edge.attr["label"].split(":")
+        b,e = coor.split("-")
+        b = int(b)
+        e = int(e)
+        if b < e:
+            subseqs.append( seqs[read_id][b:e] )
+        else:
+            subseqs.append( "".join( [RCMAP[c] for c in seqs[read_id][b:e:-1]] ) )
+
+    return "".join(subseqs)
+
+
+def reverse_path( path ):
+    new_path = []
+    for n in list(path[::-1]):
+        rid, end = n.split(":")
+        new_end = "B" if end == "E" else "E"
+        new_path.append( rid+":"+new_end)
+    return new_path
+
+
+def generate_unitig(sg, seqs, out_fn, connected_nodes = None):
+
+    """
+    given a string graph:sg and the sequences: seqs, write the unitig fasta file into out_fn
+    the funtion return a reduct graph representing the reduce string graph where the edges are unitigs
+    
+    some extra files generated: 
+        unit_edges.dat : an easy to parse file for unitig data
+        unit_edge_paths : the file contains the information of the path of all unitigs
+        uni_graph.gexf: the unitig graph in gexf format for visulization
+    """
+
+    G = SGToNXG(sg)
+    if connected_nodes != None:
+        connected_nodes = set(sg.nodes)
+    out_fasta = open(out_fn, "w")
+    nodes_for_tig = set()
+    sg_edges = set()
+    for v, w in sg.edges:
+        if sg.e_reduce[(v, w)] != True:
+            sg_edges.add( (v, w) )
+    count = 0
+    edges_in_tigs = set()
+
+    uni_edges = {}
+    path_f = open("unit_edge_paths","w")
+    uni_edge_f = open("unit_edges.dat", "w")
+    while len(sg_edges) > 0:
+        v, w = sg_edges.pop()
+
+        #nodes_for_tig.remove(n)
+        upstream_nodes = []
+        
+        c_node = v
+        p_in_edges = sg.get_in_edges_for_node(c_node)
+        p_out_edges = sg.get_out_edges_for_node(c_node)
+        while len(p_in_edges) == 1 and len(p_out_edges) == 1:
+            p_node = p_in_edges[0].in_node
+            upstream_nodes.append(p_node.name)
+            if (p_node.name, c_node) not in  sg_edges:
+                break
+            p_in_edges = sg.get_in_edges_for_node(p_node.name)
+            p_out_edges = sg.get_out_edges_for_node(p_node.name)
+            c_node = p_node.name
+
+        upstream_nodes.reverse()  
+            
+        downstream_nodes = []
+        c_node = w 
+        n_out_edges = sg.get_out_edges_for_node(c_node)
+        n_in_edges = sg.get_in_edges_for_node(c_node)
+        while len(n_out_edges) == 1 and len(n_in_edges) == 1:
+            n_node = n_out_edges[0].out_node
+            downstream_nodes.append(n_node.name)
+            if (c_node, n_node.name) not in  sg_edges:
+                break
+            n_out_edges = sg.get_out_edges_for_node(n_node.name)
+            n_in_edges = sg.get_in_edges_for_node(n_node.name)
+            c_node = n_node.name 
+        
+        whole_path = upstream_nodes + [v, w] + downstream_nodes
+        count += 1
+        subseq = generate_seq_from_path(sg, seqs, whole_path) 
+        uni_edges.setdefault( (whole_path[0], whole_path[-1]), [] )
+        uni_edges[(whole_path[0], whole_path[-1])].append(  ( whole_path, subseq ) )
+        print >> uni_edge_f, whole_path[0], whole_path[-1], "-".join(whole_path), subseq
+        print >>path_f, ">%05dc-%s-%s-%d %s" % (count, whole_path[0], whole_path[-1], len(whole_path), " ".join(whole_path))
+        print >>out_fasta, ">%05dc-%s-%s-%d" % (count, whole_path[0], whole_path[-1], len(whole_path))
+        print >>out_fasta, subseq
+        for i in range( len( whole_path ) -1 ):
+            w_n, v_n = whole_path[i:i+2]
+            try:
+                sg_edges.remove( (w_n, v_n) )
+            except KeyError: #if an edge is already deleted, ignore it
+                pass
+
+        r_whole_path = reverse_path( whole_path )
+        count += 1
+        subseq = generate_seq_from_path(sg, seqs, r_whole_path) 
+        uni_edges.setdefault( (r_whole_path[0], r_whole_path[-1]), [] )
+        uni_edges[(r_whole_path[0], r_whole_path[-1])].append(  ( r_whole_path, subseq ) )
+        print >> uni_edge_f, r_whole_path[0], r_whole_path[-1], "-".join(r_whole_path), subseq
+        print >>path_f, ">%05dc-%s-%s-%d %s" % (count, r_whole_path[0], r_whole_path[-1], len(r_whole_path), " ".join(r_whole_path))
+        print >>out_fasta, ">%05dc-%s-%s-%d" % (count, r_whole_path[0], r_whole_path[-1], len(r_whole_path))
+        print >>out_fasta, subseq
+        for i in range( len( r_whole_path ) -1 ):
+            w_n, v_n = r_whole_path[i:i+2]
+            try:
+                sg_edges.remove( (w_n, v_n) )
+            except KeyError: #if an edge is already deleted, ignore it
+                pass
+
+
+    path_f.close()
+    uni_edge_f.close()
+    #uni_graph = nx.DiGraph()
+    #for n1, n2 in uni_edges.keys():
+    #    uni_graph.add_edge(n1, n2, count = len( uni_edges[ (n1,n2) ] ))
+    #nx.write_gexf(uni_graph, "uni_graph.gexf")
+
+    out_fasta.close()
+    return uni_edges
+
+def neighbor_bound(G, v, w, radius):
+    """
+    test if the node v and the node w are connected within a radius in graph G
+    """
+    g1 = nx.ego_graph(G, v, radius=radius, undirected=False)
+    g2 = nx.ego_graph(G, w, radius=radius, undirected=False)
+    if len(set(g1.edges()) & set(g2.edges())) > 0:
+        return True
+    else:
+        return False
+
+
+def is_branch_node(G, n):
+    """
+    test whether the node n is a "branch node" which the paths from any of two of 
+    its offsprings do not intersect within a given radius
+    """
+    out_edges = G.out_edges([n])
+    n2 = [ e[1] for e in out_edges ]
+    is_branch = False
+    for i in range(len(n2)):
+        for j in range(i+1, len(n2)):
+            v = n2[i]
+            w = n2[j]
+            if neighbor_bound(G, v, w, 10) == False:
+                is_branch = True
+                break
+        if is_branch == True:
+            break
+    return is_branch
+
+
+def get_bundle( path, u_graph ):
+
+    """ 
+    find a sub-graph contain the nodes between the start and the end of the path
+    inputs: 
+        u_graph : a unitig graph
+    returns:
+        bundle_graph: the whole bundle graph 
+        bundle_paths: the paths in the bundle graph 
+        sub_graph2_edges: all edges of the bundle graph
+    
+    """
+
+    p_start, p_end = path[0], path[-1]
+    p_nodes = set(path)
+    p_edges = set(zip(path[:-1], path[1:]))
+
+    u_graph_r = u_graph.reverse()
+    down_path = nx.ego_graph(u_graph, p_start, radius=len(p_nodes), undirected=False)
+    up_path = nx.ego_graph(u_graph_r, p_end, radius=len(p_nodes), undirected=False)
+    subgraph_nodes = set(down_path) & set(up_path)
+    
+
+    sub_graph = nx.DiGraph()
+    for v, w in u_graph.edges_iter():
+        if v in subgraph_nodes and w in subgraph_nodes:            
+            if (v, w) in p_edges:
+                sub_graph.add_edge(v, w, color = "red")
+            else:
+                sub_graph.add_edge(v, w, color = "black")
+
+    sub_graph2 = nx.DiGraph()
+    tips = set()
+    tips.add(path[0])
+    sub_graph_r = sub_graph.reverse()
+    visited = set()
+    ct = 0
+    is_branch = is_branch_node(sub_graph, path[0]) #if the start node is a branch node
+    if is_branch:
+        n = tips.pop()
+        e = sub_graph.out_edges([n])[0] #pick one path the build the subgraph
+        sub_graph2.add_edge(e[0], e[1], n_weight = u_graph[e[0]][e[1]]["n_weight"])
+        if e[1] not in visited:
+            last_node = e[1]
+            visited.add(e[1])
+            r_id, orientation = e[1].split(":")
+            orientation = "E" if orientation == "B" else "E"
+            visited.add( r_id +":" + orientation)
+            if not is_branch_node(sub_graph_r, e[1]): 
+                tips.add(e[1])
+        
+    while len(tips) != 0:
+        n = tips.pop()
+        out_edges = sub_graph.out_edges([n])
+        if len(out_edges) == 1:
+            e = out_edges[0]
+            sub_graph2.add_edge(e[0], e[1], n_weight = u_graph[e[0]][e[1]]["n_weight"])
+            last_node = e[1]
+            if e[1] not in visited:                       
+                visited.add(e[1])
+                r_id, orientation = e[1].split(":")
+                orientation = "E" if orientation == "B" else "E"
+                visited.add( r_id +":" + orientation)
+                if not is_branch_node(sub_graph_r, e[1]): 
+                    tips.add(e[1])
+        else:
+        
+            is_branch = is_branch_node(sub_graph, n)
+            if not is_branch:
+                for e in out_edges:
+                    sub_graph2.add_edge(e[0], e[1], n_weight = u_graph[e[0]][e[1]]["n_weight"])
+                    last_node = e[1]
+                    if e[1] not in visited:
+                        r_id, orientation = e[1].split(":")
+                        visited.add(e[1])
+                        orientation = "E" if orientation == "B" else "E"
+                        visited.add( r_id +":" + orientation)
+                        if not is_branch_node(sub_graph_r, e[1]):
+                            tips.add(e[1])
+        ct += 1
+    last_node = None
+    longest_len = 0
+        
+    sub_graph2_nodes = sub_graph2.nodes()
+    sub_graph2_edges = sub_graph2.edges()
+
+
+    new_path = [path[0]]
+    for n in sub_graph2_nodes:
+        if len(sub_graph2.out_edges(n)) == 0 :
+            path_t = nx.shortest_path(sub_graph2, source = path[0], target = n, weight = "n_weight")
+            path_len = len(path_t)
+            if path_len > longest_len:
+                last_node = n
+                longest_len = path_len
+                new_path = path_t
+
+    if last_node == None:
+        for n in sub_graph2_nodes:
+            path_t = nx.shortest_path(sub_graph2, source = path[0], target = n, weight = "n_weight")
+            path_len = len(path_t)
+            if path_len > longest_len:
+                last_node = n
+                longest_len = path_len
+                new_path = path_t
+
+
+    path = new_path
+
+    # clean up sub_graph2 according to new begin and end
+    sub_graph2_r = sub_graph2.reverse()
+    down_path = nx.ego_graph(sub_graph2, path[0], radius=len(path), undirected=False)
+    up_path = nx.ego_graph(sub_graph2_r, path[-1], radius=len(path), undirected=False)
+    subgraph_nodes = set(down_path) & set(up_path)
+    for v in sub_graph2_nodes:
+        if v not in subgraph_nodes:
+            sub_graph2.remove_node(v)
+    
+    if DEBUG_LOG_LEVEL > 1:
+        print "new_path", path[0], last_node, len(sub_graph2_nodes), path
+
+
+    bundle_paths = [path]
+    p_nodes = set(path)
+    p_edges = set(zip(path[:-1], path[1:]))
+
+    sub_graph2_nodes = sub_graph2.nodes()
+    sub_graph2_edges = sub_graph2.edges()
+
+    nodes_idx = dict( [ (n[1], n[0]) for n in enumerate(path) ]  )
+    
+         
+    # create a list of subpath that has no branch
+    non_branch_subpaths = []
+    wi = 0
+    vi = 0
+    v = path[0]
+    while v != path[-1] and wi < len(path)-1:
+        wi += 1
+        w = path[wi]
+        while len( sub_graph2.successors(w) ) == 1 and len( sub_graph2.predecessors(w) ) == 1 and wi < len(path)-1:
+            wi += 1
+            w = path[wi]
+        if  len( sub_graph2.successors(v) )!= 1 or len( sub_graph2.predecessors(w) )!= 1:
+            branched = True
+        else:
+            branched = False
+
+        if not branched:
+            non_branch_subpaths.append( path[vi:wi+1] )
+        v = w
+        vi = wi
+
+    # create the accompany_graph that has the path of the alternative subpaths
+    
+    associate_graph = nx.DiGraph()
+    for v, w in sub_graph2.edges_iter():
+        if (v, w) not in p_edges:
+            associate_graph.add_edge(v, w, n_weight = sub_graph2[v][w]["n_weight"])
+
+    if DEBUG_LOG_LEVEL > 1:
+        print "associate_graph size:", len(associate_graph)           
+        print "non_branch_subpaths",len(non_branch_subpaths), non_branch_subpaths
+
+    # construct the bundle graph                
+    associate_graph_nodes = set(associate_graph.nodes())
+    bundle_graph = nx.DiGraph()
+    bundle_graph.add_path( path )
+    for i in range(len(non_branch_subpaths)-1):
+        if len(non_branch_subpaths[i]) == 0 or len( non_branch_subpaths[i+1] ) == 0:
+            continue
+        e1, e2 = non_branch_subpaths[i: i+2]
+        v = e1[-1]
+        w = e2[0]
+        if v == w:
+            continue
+        in_between_node_count = nodes_idx[w] - nodes_idx[v] 
+        if v in associate_graph_nodes and w in associate_graph_nodes:
+            try:
+                a_path = nx.shortest_path(associate_graph, v, w, "n_weight")    
+            except nx.NetworkXNoPath:
+                continue
+            bundle_graph.add_path( a_path )      
+            bundle_paths.append( a_path )
+
+    return bundle_graph, bundle_paths, sub_graph2_edges
+            
+def get_bundles(u_edges):
+    
+    """
+    input: all unitig edges
+    output: the assembled primary_tigs.fa and all_tigs.fa
+    """
+
+    ASM_graph = nx.DiGraph()
+    out_f = open("primary_tigs.fa", "w")
+    main_tig_paths = open("primary_tigs_paths","w")
+    sv_tigs = open("all_tigs.fa","w")
+    sv_tig_paths = open("all_tigs_paths","w")
+    max_weight = 0 
+    for v, w in u_edges:
+        x = max( [len(s[1]) for s in u_edges[ (v,w) ] ] )
+        if DEBUG_LOG_LEVEL > 1:
+            print "W", v, w, x
+        if x > max_weight:
+            max_weight = x
+            
+    in_edges = {}
+    out_edges = {}
+    for v, w in u_edges:
+        in_edges.setdefault(w, []) 
+        out_edges.setdefault(w, []) 
+        in_edges[w].append( (v, w) )
+
+        out_edges.setdefault(v, [])
+        in_edges.setdefault(v, [])
+        out_edges[v].append( (v, w) )
+
+    u_graph = nx.DiGraph()
+    for v,w in u_edges:
+
+        u_graph.add_edge(v, w, n_weight = max_weight - max( [len(s[1]) for s in  u_edges[ (v,w) ] ] ) )
+    
+    bundle_edge_out = open("bundle_edges","w")
+    bundle_index = 0
+    G = u_graph.copy()
+    visited_u_edges = set()
+    while len(G) > 0:
+        
+        root_nodes = set() 
+        for n in G: 
+            if G.in_degree(n) == 0: 
+                root_nodes.add(n) 
+
+        if len(root_nodes) == 0:
+            if G.in_degree(n) != 1: 
+                root_nodes.add(n) 
+        
+        if len(root_nodes) == 0:  
+            root_nodes.add( G.nodes()[0] ) 
+        
+        candidates = [] 
+        
+        for n in list(root_nodes): 
+            sp =nx.single_source_shortest_path_length(G, n) 
+            sp = sp.items() 
+            sp.sort(key=lambda x : x[1]) 
+            longest = sp[-1] 
+            if DEBUG_LOG_LEVEL > 2:
+                print "L", n, longest[0]
+            if longest[0].split(":")[0] == n.split(":")[0]: #avoid a big loop 
+                continue
+            candidates.append ( (longest[1], n, longest[0]) ) 
+
+        if len(candidates) == 0:
+            print "no more candiate", len(G.edges()), len(G.nodes())
+            if len(G.edges()) > 0:
+                path = G.edges()[0] 
+                print path
+            else:
+                break
+        else:
+            candidates.sort() 
+            
+            candidate = candidates[-1] 
+            
+            if candidate[1] == candidate[2]: 
+                G.remove_node(candidate[1]) 
+                continue 
+         
+            path = nx.shortest_path(G, candidate[1], candidate[2], "n_weight") 
+
+        if DEBUG_LOG_LEVEL > 1:
+            print "X", path[0], path[-1], len(path)
+        
+        cmp_edges = set()
+        g_edges = set(G.edges())
+        new_path = []  
+        tail = True
+        # avioid confusion due to long palindrome sequence
+        if len(path) > 2:
+            for i in range( 0, len( path ) - 1 ):
+                v_n, w_n = path[i:i+2]
+                new_path.append(v_n)
+                # the comment out code below might be useful for filter out some high connectivity nodes
+                #if (v_n, w_n) in cmp_edges or\
+                #    len(u_graph.out_edges(w_n)) > 5 or\
+                #    len(u_graph.in_edges(w_n)) > 5:
+                if (v_n, w_n) in cmp_edges: 
+                    tail = False
+                    break
+
+                r_id, end = v_n.split(":")
+                end = "E" if end == "B" else "B" 
+                v_n2 = r_id + ":" + end 
+
+                r_id, end = w_n.split(":")
+                end = "E" if end == "B" else "B" 
+                w_n2 = r_id + ":" + end 
+
+                if (w_n2, v_n2) in g_edges:
+                    cmp_edges.add( (w_n2, v_n2) )
+
+            if tail:
+                new_path.append(w_n)
+        else:
+            new_path = path[:]
+                
+        
+        if len(new_path) > 1:
+            path = new_path
+            
+            if DEBUG_LOG_LEVEL > 2:
+                print "Y", path[0], path[-1], len(path)
+
+            bundle_graph, bundle_paths, bundle_graph_edges = get_bundle( path, G )
+            for bg_edge in bundle_graph_edges:
+                print >> bundle_edge_out, bundle_index, "edge", bg_edge[0], bg_edge[1]
+            for path_ in bundle_paths:
+                print >>bundle_edge_out, "path", bundle_index, " ".join(path_) 
+
+            edges_to_be_removed = set()
+            if DEBUG_LOG_LEVEL > 2:
+                print "Z", bundle_paths[0][0], bundle_paths[0][-1]
+                print bundle_index, len(path), len(bundle_paths[0]), len(bundle_paths), len(bundle_graph_edges)
+
+            if len(bundle_graph_edges) > 0:
+
+                ASM_graph.add_path(bundle_paths[0], ctg="%04d" % bundle_index)
+                extra_u_edges = []
+                
+                print >> main_tig_paths, ">%04d %s" % ( bundle_index, " ".join(bundle_paths[0]) )
+                subseqs = []
+            
+                for i in range(len(bundle_paths[0]) - 1): 
+                    v, w = bundle_paths[0][i:i+2]
+                    edges_to_be_removed.add( (v,w) )
+                    uedges = u_edges[ (v,w) ]
+                    uedges.sort( key= lambda x: len(x[0]) )
+                    subseqs.append( uedges[-1][1] )
+                    visited_u_edges.add( "-".join(uedges[-1][0]) ) 
+                    for ue in uedges:
+                        if "-".join(ue[0]) not in visited_u_edges:
+                            visited_u_edges.add("-".join(ue[0]))
+                            extra_u_edges.append(ue)
+                seq = "".join(subseqs)        
+                sv_tig_idx = 0
+                print >> sv_tig_paths, ">%04d-%04d %s" % ( bundle_index, sv_tig_idx, " ".join(bundle_paths[0]) )
+                if len(seq) > 0:
+                    print >> out_f, ">%04d %s-%s" % (bundle_index, bundle_paths[0][0], bundle_paths[0][-1])
+                    print >> out_f, seq
+                    print >> sv_tigs, ">%04d-%04d %s-%s" % (bundle_index, sv_tig_idx, bundle_paths[0][0], bundle_paths[0][-1])
+                    print >> sv_tigs, "".join(subseqs)
+
+                sv_tig_idx += 1
+
+                for sv_path in bundle_paths[1:]:
+                    print >> sv_tig_paths, ">%04d-%04d %s" % ( bundle_index, sv_tig_idx, " ".join(sv_path) )
+                    ASM_graph.add_path(sv_path, ctg="%04d" % bundle_index)
+                    subseqs = []
+                    for i in range(len(sv_path) - 1): 
+                        v, w = sv_path[i:i+2]
+                        edges_to_be_removed.add( (v,w) )
+                        uedges = u_edges[ (v,w) ]
+                        uedges.sort( key= lambda x: len(x[0]) )
+                        subseqs.append( uedges[-1][1] )
+                        visited_u_edges.add( "-".join(uedges[-1][0]) ) 
+                        for ue in uedges:
+                            if "-".join(ue[0]) not in visited_u_edges:
+                                visited_u_edges.add("-".join(ue[0]))
+                                extra_u_edges.append(ue)
+                    seq = "".join(subseqs)        
+                    if len(seq) > 0: 
+                        print >> sv_tigs, ">%04d-%04d %s-%s" % (bundle_index, sv_tig_idx, sv_path[0], sv_path[-1])
+                        print >> sv_tigs, "".join(subseqs)
+                    sv_tig_idx += 1
+                for u_path, seq in extra_u_edges:
+                    #u_path = u_path.split("-")
+                    ASM_graph.add_edge(u_path[0], u_path[-1], ctg="%04d" % bundle_index)
+                    print >> sv_tig_paths, ">%04d-%04d-u %s" % ( bundle_index, sv_tig_idx, " ".join(u_path) )
+                    print >> sv_tigs, ">%04d-%04d-u %s-%s" % (bundle_index, sv_tig_idx, u_path[0], u_path[-1])
+                    print >> sv_tigs, seq
+                    sv_tig_idx += 1
+                    
+                
+                bundle_index += 1
+        else:
+            #TODO, consolidate code here
+            v,w = path
+            uedges = u_edges[ (v,w) ]
+            uedges.sort( key= lambda x: len(x[0]) )
+            subseqs.append( uedges[-1][1] )
+            seq = "".join(subseqs)
+            print >> sv_tig_paths, ">%04d-%04d %s" % ( bundle_index, sv_tig_idx, " ".join(paths) )
+            print >> sv_tigs, ">%04d-%04d-u %s-%s" % (bundle_index, sv_tig_idx, path[0], path[-1])
+            print >> sv_tigs, seq
+            sv_tig_idx += 1
+            bundle_index += 1
+            bundle_graph_edges = zip(path[:-1],path[1:])
+        
+        #clean up the graph
+
+        edges = set(G.edges())
+        edges_to_be_removed |= set(bundle_graph_edges)
+
+        if DEBUG_LOG_LEVEL > 2:
+            print "BGE",bundle_graph_edges
+        
+        edge_remove_count = 0
+        for v, w in edges_to_be_removed:
+            if (v, w) in edges:
+                G.remove_edge( v, w )
+                edge_remove_count += 1
+                if DEBUG_LOG_LEVEL > 2:
+                    print "remove edge", bundle_index, w, v
+                
+        edges = set(G.edges())
+        for v, w in edges_to_be_removed:
+
+            r_id, end = v.split(":")
+            end = "E" if end == "B" else "B"
+            v = r_id + ":" + end
+
+            r_id, end = w.split(":")
+            end = "E" if end == "B" else "B"
+            w = r_id + ":" + end
+
+            if (w, v) in edges:
+                G.remove_edge( w, v )
+                edge_remove_count += 1
+                if DEBUG_LOG_LEVEL > 2:
+                    print "remove edge", bundle_index, w, v
+
+        if edge_remove_count == 0:
+            break
+            
+        nodes = G.nodes()
+        for n in nodes:
+            if G.in_degree(n) == 0 and G.out_degree(n) == 0:
+                G.remove_node(n)
+                if DEBUG_LOG_LEVEL > 2:
+                    print "remove node", n 
+
+    sv_tig_paths.close()
+    sv_tigs.close()
+    main_tig_paths.close()
+    out_f.close()
+    bundle_edge_out.close()
+    return ASM_graph
+
+
+
+def SGToNXG(sg):
+    G=nx.DiGraph()
+
+    max_score = max([ sg.edges[ e ].attr["score"] for e in sg.edges if sg.e_reduce[e] != True ])
+    out_f = open("edges_list","w")
+    for v, w in sg.edges:
+        if sg.e_reduce[(v, w)] != True:
+        ##if 1:
+            out_degree = len(sg.nodes[v].out_edges)
+            G.add_node( v, size = out_degree )
+            G.add_node( w, size = out_degree )
+            label = sg.edges[ (v, w) ].attr["label"]
+            score = sg.edges[ (v, w) ].attr["score"]
+            print >>out_f, v, w, label, score 
+            G.add_edge( v, w, label = label, weight = 0.001*score, n_weight = max_score - score )
+            #print in_node_name, out_node_name
+    out_f.close()
+    return G
+
+if __name__ == "__main__":
+
+    import argparse
+    
+    parser = argparse.ArgumentParser(description='a example string graph assembler that is desinged for handling diploid genomes')
+    parser.add_argument('overlap_file', help='a file that contains the overlap information.')
+    parser.add_argument('read_fasta', help='the file that contains the sequence to be assembled')
+    parser.add_argument('--min_len', type=int, default=4000, 
+                        help='minimum length of the reads to be considered for assembling')
+    parser.add_argument('--min_idt', type=float, default=96,
+                        help='minimum alignment identity of the reads to be considered for assembling')
+    parser.add_argument('--disable_chimer_prediction', action="store_true", default=False,
+                        help='you may want to disable this as some reads can be falsely identified as chimers in low coverage case')
+
+    args = parser.parse_args()
+
+
+    overlap_file = args.overlap_file
+    read_fasta = args.read_fasta
+
+    seqs = {}
+    # load all p-reads into memory
+    f = FastaReader(read_fasta)
+    for r in f:
+        seqs[r.name] = r.sequence.upper()
+
+    G=nx.Graph()
+    edges =set()
+    overlap_data = []
+    contained_reads = set()
+    overlap_count = {}
+
+
+    # loop through the overlapping data to load the data in the a python array
+    # contained reads are identified 
+
+    with open(overlap_file) as f:
+        for l in f:
+            l = l.strip().split()
+
+            #work around for some ill formed data recored
+            if len(l) != 13:
+                continue
+            
+            f_id, g_id, score, identity = l[:4]
+            if f_id == g_id:  # don't need self-self overlapping
+                continue
+
+            if g_id not in seqs: 
+                continue
+
+            if f_id not in seqs:
+                continue
+
+            score = int(score)
+            identity = float(identity)
+            contained = l[12]
+            if contained == "contained":
+                contained_reads.add(f_id)
+                continue
+            if contained == "contains":
+                contained_reads.add(g_id)
+                continue
+            if contained == "none":
+                continue
+
+            if identity < args.min_idt: # only take record with >96% identity as overlapped reads
+                continue
+            #if score > -2000:
+            #    continue
+            f_strain, f_start, f_end, f_len = (int(c) for c in l[4:8])
+            g_strain, g_start, g_end, g_len = (int(c) for c in l[8:12])
+
+            # only used reads longer than the 4kb for assembly
+            if f_len < args.min_len: continue
+            if g_len < args.min_len: continue
+            
+            # double check for proper overlap
+            if f_start > 24 and f_len - f_end > 24:  # allow 24 base tolerance on both sides of the overlapping
+                continue
+            
+            if g_start > 24 and g_len - g_end > 24:
+                continue
+            
+            if g_strain == 0:
+                if f_start < 24 and g_len - g_end > 24:
+                    continue
+                if g_start < 24 and f_len - f_end > 24:
+                    continue
+            else:
+                if f_start < 24 and g_start > 24:
+                    continue
+                if g_start < 24 and f_start > 24:
+                    continue
+
+            overlap_data.append( (f_id, g_id, score, identity,
+                                  f_strain, f_start, f_end, f_len,
+                                  g_strain, g_start, g_end, g_len) )
+
+            overlap_count[f_id] = overlap_count.get(f_id,0)+1
+            overlap_count[g_id] = overlap_count.get(g_id,0)+1
+
+    overlap_set = set()
+    sg = StringGraph()
+    for od in overlap_data:
+        f_id, g_id, score, identity = od[:4]
+        if f_id in contained_reads:
+            continue
+        if g_id in contained_reads:
+            continue
+        f_s, f_b, f_e, f_l = od[4:8]
+        g_s, g_b, g_e, g_l = od[8:12]
+        overlap_pair = [f_id, g_id]
+        overlap_pair.sort()
+        overlap_pair = tuple( overlap_pair )
+        if overlap_pair in overlap_set:  # don't allow duplicated records
+            continue
+        else:
+            overlap_set.add(overlap_pair)
+
+        
+        if g_s == 1: # revered alignment, swapping the begin and end coordinates
+            g_b, g_e = g_e, g_b
+        
+        # build the string graph edges for each overlap
+        if f_b > 24:
+            if g_b < g_e:
+                """
+                     f.B         f.E
+                  f  ----------->
+                  g         ------------->
+                            g.B           g.E
+                """
+                if f_b == 0 or g_e - g_l == 0:
+                    continue
+                sg.add_edge( "%s:B" % g_id, "%s:B" % f_id, label = "%s:%d-%d" % (f_id, f_b, 0), 
+                                                           length = abs(f_b-0),
+                                                           score = -score)
+                sg.add_edge( "%s:E" % f_id, "%s:E" % g_id, label = "%s:%d-%d" % (g_id, g_e, g_l), 
+                                                           length = abs(g_e-g_l),
+                                                           score = -score)
+            else:
+                """
+                     f.B         f.E
+                  f  ----------->
+                  g         <-------------
+                            g.E           g.B           
+                """
+                if f_b == 0 or g_e == 0:
+                    continue
+                sg.add_edge( "%s:E" % g_id, "%s:B" % f_id, label = "%s:%d-%d" % (f_id, f_b, 0), 
+                                                           length = abs(f_b -0),
+                                                           score = -score)
+                sg.add_edge( "%s:E" % f_id, "%s:B" % g_id, label = "%s:%d-%d" % (g_id, g_e, 0), 
+                                                           length = abs(g_e- 0),
+                                                           score = -score)
+        else:
+            if g_b < g_e:
+                """
+                                    f.B         f.E
+                  f                 ----------->
+                  g         ------------->
+                            g.B           g.E
+                """
+                if g_b == 0 or f_e - f_l == 0:
+                    continue
+                sg.add_edge( "%s:B" % f_id, "%s:B" % g_id, label = "%s:%d-%d" % (g_id, g_b, 0), 
+                                                           length = abs(g_b - 0),
+                                                           score = -score)
+                sg.add_edge( "%s:E" % g_id, "%s:E" % f_id, label = "%s:%d-%d" % (f_id, f_e, f_l), 
+                                                           length = abs(f_e-f_l),
+                                                           score = -score)
+            else:
+                """
+                                    f.B         f.E
+                  f                 ----------->
+                  g         <-------------
+                            g.E           g.B           
+                """
+                if g_b - g_l == 0 or f_e - f_l ==0:
+                    continue
+                sg.add_edge( "%s:B" % f_id, "%s:E" % g_id, label = "%s:%d-%d" % (g_id, g_b, g_l), 
+                                                           length = abs(g_b - g_l),
+                                                           score = -score)
+                sg.add_edge( "%s:B" % g_id, "%s:E" % f_id, label = "%s:%d-%d" % (f_id, f_e, f_l), 
+                                                           length = abs(f_e - f_l),
+                                                           score = -score)
+
+    
+    sg.init_reduce_dict()
+    if not args.disable_chimer_prediction:
+        sg.mark_chimer_edge()
+    sg.mark_spur_edge()
+    sg.mark_tr_edges() # mark those edges that transitive redundant
+
+    if DEBUG_LOG_LEVEL > 1:
+        print sum( [1 for c in sg.e_reduce.values() if c == True] )
+        print sum( [1 for c in sg.e_reduce.values() if c == False] )
+
+    sg.mark_best_overlap() # mark those edges that are best overlap edges
+
+    if DEBUG_LOG_LEVEL > 1:
+        print sum( [1 for c in sg.e_reduce.values() if c == False] )
+
+
+    G = SGToNXG(sg)
+    #nx.write_gexf(G, "string_graph.gexf") # output the raw string string graph for visuliation
+    nx.write_adjlist(G, "string_graph.adj") # write out the whole adjacent list of the string graph
+
+    u_edges = generate_unitig(sg, seqs, out_fn = "unitgs.fa") # reduct to string graph into unitig graph
+    ASM_graph = get_bundles(u_edges )  # get the assembly
+    #nx.write_gexf(ASM_graph, "asm_graph.gexf")
diff --git a/src/py_scripts/falcon_asm_dev.py b/src/py_scripts/falcon_asm_dev.py
new file mode 100755
index 0000000..610a89f
--- /dev/null
+++ b/src/py_scripts/falcon_asm_dev.py
@@ -0,0 +1,1015 @@
+#!/usr/bin/env python
+
+#################################################################################$$
+# Copyright (c) 2011-2014, Pacific Biosciences of California, Inc.
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted (subject to the limitations in the
+# disclaimer below) provided that the following conditions are met:
+#
+#  * Redistributions of source code must retain the above copyright
+#  notice, this list of conditions and the following disclaimer.
+#
+#  * Redistributions in binary form must reproduce the above
+#  copyright notice, this list of conditions and the following
+#  disclaimer in the documentation and/or other materials provided
+#  with the distribution.
+#
+#  * Neither the name of Pacific Biosciences nor the names of its
+#  contributors may be used to endorse or promote products derived
+#  from this software without specific prior written permission.
+#
+# NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE
+# GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC
+# BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
+# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+#################################################################################$$
+
+
+from pbcore.io import FastaReader
+import networkx as nx
+import os
+import shlex
+import sys
+import subprocess
+
+class SGNode(object):
+    def __init__(self, node_name):
+        self.name = node_name
+        self.out_edges = []
+        self.in_edges = []
+    def add_out_edge(self, out_edge):
+        self.out_edges.append(out_edge)
+    def add_in_edge(self, in_edge):
+        self.in_edges.append(in_edge)
+
+class SGEdge(object):
+    def __init__(self, in_node, out_node):
+        self.in_node = in_node
+        self.out_node = out_node
+        self.attr = {}
+    def set_attribute(self, attr, value):
+        self.attr[attr] = value
+
+class StringGraph(object):
+    def __init__(self):
+        self.nodes = {}
+        self.edges = {}
+        self.n_mark = {}
+        self.e_reduce = {}
+        self.repeat_overlap = {}
+        
+    def add_node(self, node_name):
+        if node_name not in self.nodes:
+            self.nodes[node_name] = SGNode(node_name)
+    
+    def add_edge(self, in_node_name, out_node_name, **attributes):
+        if (in_node_name, out_node_name) not in self.edges:
+        
+            self.add_node(in_node_name)
+            self.add_node(out_node_name)
+            in_node = self.nodes[in_node_name]
+            out_node = self.nodes[out_node_name]    
+            
+            edge = SGEdge(in_node, out_node)
+            self.edges[ (in_node_name, out_node_name) ] = edge
+            in_node.add_out_edge(edge)
+            out_node.add_in_edge(edge)
+        edge =  self.edges[ (in_node_name, out_node_name) ]
+        for k, v in attributes.items():
+            edge.attr[k] = v
+            
+    def mark_tr_edges(self):
+        n_mark = self.n_mark
+        e_reduce = self.e_reduce
+        FUZZ = 500
+        for n in self.nodes:
+            n_mark[n] = "vacant"
+        for e in self.edges:
+            e_reduce[e] = False
+    
+        for n_name, node in self.nodes.items():
+
+            out_edges = node.out_edges
+            if len(out_edges) == 0:
+                continue
+            
+            out_edges.sort(key=lambda x: x.attr["length"])
+            
+            for e in out_edges:
+                w = e.out_node
+                n_mark[ w.name ] = "inplay"
+            
+            max_len = out_edges[-1].attr["length"]
+            #longest_edge = out_edges[-1]
+                
+            max_len += FUZZ
+            
+            for e in out_edges:
+                e_len = e.attr["length"]
+                w = e.out_node
+                if n_mark[w.name] == "inplay":
+                    w.out_edges.sort( key=lambda x: x.attr["length"] )
+                    for e2 in w.out_edges:
+                        if e2.attr["length"] + e_len < max_len:
+                            x = e2.out_node
+                            if n_mark[x.name] == "inplay":
+                                n_mark[x.name] = "eliminated"
+            
+            for e in out_edges:
+                e_len = e.attr["length"]
+                w = e.out_node
+                w.out_edges.sort( key=lambda x: x.attr["length"] )
+                if len(w.out_edges) > 0:
+                    x = w.out_edges[0].out_node
+                    if n_mark[x.name] == "inplay":
+                        n_mark[x.name] = "eliminated"
+                for e2 in w.out_edges:
+                    if e2.attr["length"] < FUZZ:
+                        x = e2.out_node
+                        if n_mark[x.name] == "inplay":
+                            n_mark[x.name] = "eliminated"
+                            
+            for out_edge in out_edges:
+                v = out_edge.in_node
+                w = out_edge.out_node
+                if n_mark[w.name] == "eliminated":
+                    e_reduce[ (v.name, w.name) ] = True
+                n_mark[w.name] = "vacant"
+                
+    def mark_repeat_overlap(self):
+        repeat_overlap = self.repeat_overlap
+        in_degree = {}
+        for n in self.nodes:
+            c = 0
+            for e in self.nodes[n].in_edges:
+                v = e.in_node
+                w = e.out_node
+                if self.e_reduce[(v.name, w.name)] == False:
+                    c += 1
+            in_degree[n] = c
+            #print n,c
+        #print len([x for x in in_degree.items() if x[1]>1])
+         
+        for e_n, e in self.edges.items():
+            v = e.in_node
+            w = e.out_node
+            if self.e_reduce[(v.name, w.name)] == False:
+                repeat_overlap[ (v.name, w.name) ] = False
+            else:
+                repeat_overlap[ (v.name, w.name) ] = True
+            
+        for n in self.nodes:
+            if len(self.nodes[n].out_edges) < 2:
+                continue
+            min_in_deg = None
+            for e in self.nodes[n].out_edges:
+                v = e.in_node
+                w = e.out_node
+                #print n, v.name, w.name
+                if self.e_reduce[ (v.name, w.name) ] == True:
+                    continue
+                if min_in_deg == None:
+                    min_in_deg = in_degree[w.name]
+                    continue
+                if in_degree[w.name] < min_in_deg:
+                    min_in_deg = in_degree[w.name]
+                #print n, w.name, in_degree[w.name]
+            for e in self.nodes[n].out_edges:
+                v = e.in_node
+                w = e.out_node
+                assert (v.name, w.name) in self.edges
+                if in_degree[w.name] > min_in_deg:
+                    if self.e_reduce[(v.name, w.name)] == False:
+                        repeat_overlap[ (v.name, w.name) ] = True
+                        
+                    
+        for e_n, e in self.edges.items():
+            v = e.in_node
+            w = e.out_node
+            if repeat_overlap[ (v.name, w.name) ] == True:
+                self.e_reduce[(v.name, w.name)] == True
+
+    def mark_best_overlap(self):
+
+        best_edges = set()
+
+        for v in self.nodes:
+
+            out_edges = self.nodes[v].out_edges
+            if len(out_edges) > 0:
+                out_edges.sort(key=lambda e: e.attr["score"])
+                e = out_edges[-1]
+                best_edges.add( (e.in_node.name, e.out_node.name) )
+
+            in_edges = self.nodes[v].in_edges
+            if len(in_edges) > 0:
+                in_edges.sort(key=lambda e: e.attr["score"])
+                e = in_edges[-1]
+                best_edges.add( (e.in_node.name, e.out_node.name) )
+
+        print "X", len(best_edges)
+
+        for e_n, e in self.edges.items():
+            v = e_n[0]
+            w = e_n[1]
+            if self.e_reduce[ (v, w) ] != True:
+                if (v, w) not in best_edges:
+                    self.e_reduce[(v, w)] = True
+
+    def mark_best_overlap_2(self):
+        best_edges = set()
+        for e in self.edges:
+            v, w = e
+            if w == self.get_best_out_edge_for_node(v).out_node.name and\
+               v == self.get_best_in_edge_for_node(w).in_node.name:
+                   best_edges.add( (v, w) )
+
+        for e_n, e in self.edges.items():
+            v = e_n[0]
+            w = e_n[1]
+            if self.e_reduce[ (v, w) ] != True:
+                if (v, w) not in best_edges:
+                    self.e_reduce[(v, w)] = True
+                    #print sum( [1 for e_n in self.edges if self.e_reduce[ e_n ] == False] )
+                
+    def get_out_edges_for_node(self, name, mask=True):
+        rtn = []
+        for e in self.nodes[name].out_edges:
+            v = e.in_node
+            w = e.out_node
+            if self.e_reduce[ (v.name, w.name) ] == False:
+                rtn.append(e)
+        return rtn
+        
+        
+    def get_in_edges_for_node(self, name, mask=True):
+        rtn = []
+        for e in self.nodes[name].in_edges:
+            v = e.in_node
+            w = e.out_node
+            if self.e_reduce[ (v.name, w.name) ] == False:
+                rtn.append(e)
+        return rtn
+
+    def get_best_out_edge_for_node(self, name, mask=True):
+        rtn = []
+        for e in self.nodes[name].out_edges:
+            v = e.in_node
+            w = e.out_node
+            if self.e_reduce[ (v.name, w.name) ] == False:
+                rtn.append(e)
+        rtn.sort(key=lambda e: e.attr["score"])
+
+        return rtn[-1]
+
+    def get_best_in_edge_for_node(self, name, mask=True):
+        rtn = []
+        for e in self.nodes[name].in_edges:
+            v = e.in_node
+            w = e.out_node
+            if self.e_reduce[ (v.name, w.name) ] == False:
+                rtn.append(e)
+        rtn.sort(key=lambda e: e.attr["score"])
+        return rtn[-1]
+        
+
+RCMAP = dict(zip("ACGTacgtNn-","TGCAtgcaNn-"))
+def generate_contig_from_path(sg, seqs, path):
+    subseqs = []
+    r_id, end = path[0].split(":")
+    if end == "B":
+        subseqs= [ "".join( [RCMAP[c] for c in seqs[r_id][::-1]] ) ]
+    else:
+        subseqs=[ seqs[r_id] ]
+    
+    count = 0
+    for i in range( len( path ) -1 ):
+        w_n, v_n = path[i:i+2]
+        edge = sg.edges[ (w_n, v_n ) ]
+        read_id, coor = edge.attr["label"].split(":")
+        b,e = coor.split("-")
+        b = int(b)
+        e = int(e)
+        if b < e:
+            subseqs.append( seqs[read_id][b:e] )
+        else:
+            subseqs.append( "".join( [RCMAP[c] for c in seqs[read_id][b:e:-1]] ) )
+
+    return "".join(subseqs)
+
+
+def generate_unitig(sg, seqs, out_fn, connected_nodes = None):
+    G = SGToNXG(sg)
+    if connected_nodes != None:
+        connected_nodes = set(sg.nodes)
+    out_fasta = open(out_fn, "w")
+    nodes_for_tig = set()
+    sg_edges = set()
+    for v, w in sg.edges:
+        if sg.e_reduce[(v, w)] != True:
+            sg_edges.add( (v, w) )
+    count = 0
+    edges_in_tigs = set()
+
+    uni_edges = {}
+    path_f = open("paths","w")
+    uni_edge_f = open("unit_edges.dat", "w")
+    while len(sg_edges) > 0:
+        v, w = sg_edges.pop()
+
+        #nodes_for_tig.remove(n)
+        upstream_nodes = []
+        
+        c_node = v
+        p_in_edges = sg.get_in_edges_for_node(c_node)
+        p_out_edges = sg.get_out_edges_for_node(c_node)
+        while len(p_in_edges) == 1 and len(p_out_edges) == 1:
+            p_node = p_in_edges[0].in_node
+            upstream_nodes.append(p_node.name)
+            if (p_node.name, c_node) not in  sg_edges:
+                break
+            sg_edges.remove( (p_node.name, c_node) )
+            p_in_edges = sg.get_in_edges_for_node(p_node.name)
+            p_out_edges = sg.get_out_edges_for_node(p_node.name)
+            c_node = p_node.name
+
+        upstream_nodes.reverse()  
+            
+        downstream_nodes = []
+        c_node = w 
+        n_out_edges = sg.get_out_edges_for_node(c_node)
+        n_in_edges = sg.get_in_edges_for_node(c_node)
+        while len(n_out_edges) == 1 and len(n_in_edges) == 1:
+            n_node = n_out_edges[0].out_node
+            downstream_nodes.append(n_node.name)
+            if (c_node, n_node.name) not in  sg_edges:
+                break
+            sg_edges.remove( (c_node, n_node.name) )
+            n_out_edges = sg.get_out_edges_for_node(n_node.name)
+            n_in_edges = sg.get_in_edges_for_node(n_node.name)
+            c_node = n_node.name 
+        
+        whole_path = upstream_nodes + [v, w] + downstream_nodes
+        #print len(whole_path)
+        count += 1
+        subseqs = []
+        for i in range( len( whole_path ) - 1):
+            v_n, w_n = whole_path[i:i+2]
+            
+            edge = sg.edges[ (v_n, w_n ) ]
+            edges_in_tigs.add( (v_n, w_n ) )
+            #print n, next_node.name, e.attr["label"]
+            
+            read_id, coor = edge.attr["label"].split(":")
+            b,e = coor.split("-")
+            b = int(b)
+            e = int(e)
+            if b < e:
+                subseqs.append( seqs[read_id][b:e] )
+            else:
+                try:
+                    subseqs.append( "".join( [RCMAP[c] for c in seqs[read_id][b:e:-1]] ) )
+                except:
+                    print seqs[read_id]
+            
+        uni_edges.setdefault( (whole_path[0], whole_path[-1]), [] )
+        uni_edges[(whole_path[0], whole_path[-1])].append(  ( whole_path, "".join(subseqs) ) )
+        print >> uni_edge_f, whole_path[0], whole_path[-1], "-".join(whole_path), "".join(subseqs)
+
+        print >>path_f, ">%05dc-%s-%s-%d %s" % (count, whole_path[0], whole_path[-1], len(whole_path), " ".join(whole_path))
+
+        print >>out_fasta, ">%05dc-%s-%s-%d" % (count, whole_path[0], whole_path[-1], len(whole_path))
+        print >>out_fasta,"".join(subseqs)
+    path_f.close()
+    uni_edge_f.close()
+    uni_graph = nx.DiGraph()
+    for n1, n2 in uni_edges.keys():
+        uni_graph.add_edge(n1, n2, weight = len( uni_edges[ (n1,n2) ] ))
+    nx.write_gexf(uni_graph, "uni_graph.gexf")
+
+    out_fasta.close()
+    return uni_edges
+
+def neighbor_bound(G, v, w, radius):
+    g1 = nx.ego_graph(G, v, radius=radius, undirected=False)
+    g2 = nx.ego_graph(G, w, radius=radius, undirected=False)
+    if len(set(g1.edges()) & set(g2.edges())) > 0:
+        return True
+    else:
+        return False
+
+
+def is_branch_node(G, n):
+    out_edges = G.out_edges([n])
+    n2 = [ e[1] for e in out_edges ]
+    is_branch = False
+    for i in range(len(n2)):
+        for j in range(i+1, len(n2)):
+            v = n2[i]
+            w = n2[j]
+            if neighbor_bound(G, v, w, 10) == False:
+                is_branch = True
+                break
+        if is_branch == True:
+            break
+    return is_branch
+
+
+def get_bundle( path, u_graph, u_edges ):
+    
+    # find a sub-graph contain the nodes between the start and the end of the path
+    
+    p_start, p_end = path[0], path[-1]
+    p_nodes = set(path)
+    p_edges = set(zip(path[:-1], path[1:]))
+    u_graph_r = u_graph.reverse()
+    down_path = nx.ego_graph(u_graph, p_start, radius=len(p_nodes), undirected=False)
+    up_path = nx.ego_graph(u_graph_r, p_end, radius=len(p_nodes), undirected=False)
+    subgraph_nodes = set(down_path) & set(up_path)
+    #print len(path), len(down_path), len(up_path), len(bundle_nodes)
+    
+
+    sub_graph = nx.DiGraph()
+    for v, w in u_graph.edges_iter():
+        if v in subgraph_nodes and w in subgraph_nodes:            
+            if (v, w) in p_edges:
+                sub_graph.add_edge(v, w, color = "red")
+            else:
+                sub_graph.add_edge(v, w, color = "black")
+
+    sub_graph2 = nx.DiGraph()
+    tips = set()
+    tips.add(path[0])
+    sub_graph_r = sub_graph.reverse()
+    visited = set()
+    ct = 0
+    is_branch = is_branch_node(sub_graph, path[0]) #if the start node is a branch node
+    if is_branch:
+        n = tips.pop()
+        e = sub_graph.out_edges([n])[0] #pick one path the build the subgraph
+        sub_graph2.add_edge(e[0], e[1], n_weight = u_graph[e[0]][e[1]]["n_weight"])
+        if e[1] not in visited:
+            last_node = e[1]
+            visited.add(e[1])
+            r_id, orientation = e[1].split(":")
+            orientation = "E" if orientation == "B" else "E"
+            visited.add( r_id +":" + orientation)
+            if not is_branch_node(sub_graph_r, e[1]): 
+                tips.add(e[1])
+        
+    while len(tips) != 0:
+        n = tips.pop()
+        #print "n", n
+        out_edges = sub_graph.out_edges([n])
+        #out_edges = u_graph.out_edges([n])
+        #print out_edges 
+        if len(out_edges) == 1:
+            e = out_edges[0]
+            sub_graph2.add_edge(e[0], e[1], n_weight = u_graph[e[0]][e[1]]["n_weight"])
+            last_node = e[1]
+            if e[1] not in visited:                       
+                visited.add(e[1])
+                r_id, orientation = e[1].split(":")
+                orientation = "E" if orientation == "B" else "E"
+                visited.add( r_id +":" + orientation)
+                if not is_branch_node(sub_graph_r, e[1]): 
+                #if not is_branch_node(u_graph_r, e[1]): 
+                    tips.add(e[1])
+        else:
+        
+            is_branch = is_branch_node(sub_graph, n)
+            #is_branch = is_branch_node(u_graph, n)
+            if not is_branch:
+                for e in out_edges:
+                    sub_graph2.add_edge(e[0], e[1], n_weight = u_graph[e[0]][e[1]]["n_weight"])
+                    last_node = e[1]
+                    if e[1] not in visited:
+                        r_id, orientation = e[1].split(":")
+                        visited.add(e[1])
+                        orientation = "E" if orientation == "B" else "E"
+                        visited.add( r_id +":" + orientation)
+                        if not is_branch_node(sub_graph_r, e[1]):
+                        #if not is_branch_node(u_graph_r, e[1]):
+                            tips.add(e[1])
+        ct += 1
+        #print ct, len(tips)
+    last_node = None
+    longest_len = 0
+    sub_graph2_nodes = sub_graph2.nodes()
+    sub_graph2_edges = sub_graph2.edges()
+        
+
+
+    new_path = [path[0]]
+    for n in sub_graph2_nodes:
+        if len(sub_graph2.out_edges(n)) == 0 :
+            path_t = nx.shortest_path(sub_graph2, source = path[0], target = n, weight = "n_weight")
+            path_len = len(path_t)
+            if path_len > longest_len:
+                last_node = n
+                longest_len = path_len
+                new_path = path_t
+
+    if last_node == None:
+        for n in sub_graph2_nodes:
+            path_t = nx.shortest_path(sub_graph2, source = path[0], target = n, weight = "n_weight")
+            path_len = len(path_t)
+            if path_len > longest_len:
+                last_node = n
+                longest_len = path_len
+                new_path = path_t
+
+
+    #new_path = nx.shortest_path(sub_graph2, path[0], last_node, "n_weight")
+    path = new_path
+    print "new_path", path[0], last_node, len(sub_graph2_nodes), path
+
+
+    bundle_paths = [path]
+    p_nodes = set(path)
+    p_edges = set(zip(path[:-1], path[1:]))
+    nodes_idx = dict( [ (n[1], n[0]) for n in enumerate(path) ]  )
+    
+         
+    # create a list of subpath that has no branch
+    non_branch_subpaths = [ [] ]
+    non_branch_edges = set()
+    mtg_edges = set()
+    
+    for i in range(len(path)-1):
+        v, w = path[i:i+2]
+        if len(sub_graph2.successors(v)) == 1 and len(sub_graph2.predecessors(w)) == 1:
+            non_branch_subpaths[-1].append( (v, w) )
+            non_branch_edges.add( (v, w) )
+        else:
+            if len(non_branch_subpaths[-1]) != 0:
+                non_branch_subpaths.append([])
+                
+    # create the accompany_graph that has the path of the alternative subpaths
+    
+    associate_graph = nx.DiGraph()
+    for v, w in sub_graph2.edges_iter():
+        if (v, w) not in p_edges:
+            associate_graph.add_edge(v, w, n_weight = sub_graph2[v][w]["n_weight"])
+    #print "associate_graph size:", len(associate_graph)           
+    #print "non_branch_subpaths", non_branch_subpaths
+    # construct the bundle graph                
+    associate_graph_nodes = set(associate_graph.nodes())
+    bundle_graph = nx.DiGraph()
+    bundle_graph.add_path( path )
+    for i in range(len(non_branch_subpaths)-1):
+        if len(non_branch_subpaths[i]) == 0 or len( non_branch_subpaths[i+1] ) == 0:
+            continue
+        e1, e2 = non_branch_subpaths[i: i+2]
+        v = e1[-1][-1]
+        w = e2[0][0]
+        if v == w:
+            continue
+        #print v, w
+        in_between_node_count = nodes_idx[w] - nodes_idx[v] 
+        if v in associate_graph_nodes and w in associate_graph_nodes:
+            try:
+                #print "p2",v, w, nx.shortest_path(accommpany_graph, v, w)
+                #print "p1",v, w, nx.shortest_path(bundle_graph, v, w)
+                a_path = nx.shortest_path(associate_graph, v, w, "n_weight")    
+            except nx.NetworkXNoPath:
+                continue
+            bundle_graph.add_path( a_path )      
+            bundle_paths.append( a_path )
+    #bundle_graph_nodes = bundle_graph.nodes()
+    return bundle_graph, bundle_paths, sub_graph2_edges
+            
+def get_bundles(u_edges):
+    
+    ASM_graph = nx.DiGraph()
+    out_f = open("primary_tigs.fa", "w")
+    main_tig_paths = open("primary_tigs_paths","w")
+    sv_tigs = open("all_tigs.fa","w")
+    sv_tig_paths = open("all_tigs_paths","w")
+    max_weight = 0 
+    for v, w in u_edges:
+        x = max( [len(s[1]) for s in u_edges[ (v,w) ] ] )
+        print "W", v, w, x
+        if x > max_weight:
+            max_weight = x
+            
+    in_edges = {}
+    out_edges = {}
+    for v, w in u_edges:
+        in_edges.setdefault(w, []) 
+        out_edges.setdefault(w, []) 
+        in_edges[w].append( (v, w) )
+
+        out_edges.setdefault(v, [])
+        in_edges.setdefault(v, [])
+        out_edges[v].append( (v, w) )
+
+    u_graph = nx.DiGraph()
+    for v,w in u_edges:
+
+        u_graph.add_edge(v, w, n_weight = max_weight - max( [len(s[1]) for s in  u_edges[ (v,w) ] ] ) )
+    
+    bundle_index = 0
+    G = u_graph.copy()
+    visited_u_edges = set()
+    while len(G) > 0:
+        
+        root_nodes = set() 
+        for n in G: 
+            if G.in_degree(n) != 1 or G.out_degree(n) !=1 : 
+                root_nodes.add(n) 
+        
+        if len(root_nodes) == 0:  
+            root_nodes.add( G.nodes()[0] ) 
+        
+        candidates = [] 
+        
+        for n in list(root_nodes): 
+            sp =nx.single_source_shortest_path_length(G, n) 
+            sp = sp.items() 
+            sp.sort(key=lambda x : x[1]) 
+            longest = sp[-1] 
+            print "L", n, longest[0]
+            if longest[0].split(":")[0] == n.split(":")[0]: #avoid a big loop 
+                continue
+            candidates.append ( (longest[1], n, longest[0]) ) 
+
+        if len(candidates) == 0:
+            print "no more candiate", len(G.edges()), len(G.nodes())
+            if len(G.edges()) > 0:
+                path = G.edges()[0] 
+            else:
+                break
+        else:
+            candidates.sort() 
+            
+            candidate = candidates[-1] 
+            
+            if candidate[1] == candidate[2]: 
+                G.remove_node(candidate[1]) 
+                continue 
+         
+            path = nx.shortest_path(G, candidate[1], candidate[2], "n_weight") 
+        print "X", path[0], path[-1], len(path)
+        
+        cmp_edges = set()
+        g_edges = set(G.edges())
+        new_path = []  
+        tail = True
+        # avioid confusion due to long palindrome sequence
+        for i in range( 0, len( path ) - 1 ):
+            v_n, w_n = path[i:i+2]
+            new_path.append(v_n)
+            #if (v_n, w_n) in cmp_edges or\
+            #    len(u_graph.out_edges(w_n)) > 5 or\
+            #    len(u_graph.in_edges(w_n)) > 5:
+            if (v_n, w_n) in cmp_edges: 
+                tail = False
+                break
+
+            r_id, end = v_n.split(":")
+            end = "E" if end == "B" else "B" 
+            v_n2 = r_id + ":" + end 
+
+            r_id, end = w_n.split(":")
+            end = "E" if end == "B" else "B" 
+            w_n2 = r_id + ":" + end 
+
+            if (w_n2, v_n2) in g_edges:
+                cmp_edges.add( (w_n2, v_n2) )
+        if tail:
+            new_path.append(w_n)
+                
+        
+        if len(new_path) > 1:
+            path = new_path
+            
+            print "Y", path[0], path[-1], len(path)
+            #bundle_graph, bundle_paths, bundle_graph_edges = get_bundle( path, u_graph, u_edges )
+
+            bundle_graph, bundle_paths, bundle_graph_edges = get_bundle( path, G, G.edges() )
+            print "Z", bundle_paths[0][0], bundle_paths[0][-1]
+            print bundle_index, len(path), len(bundle_paths[0]), len(bundle_paths), len(bundle_graph_edges)
+            if len(bundle_graph_edges) > 0:
+
+                #ASM_graph.add_path(bundle_paths[0], ctg="%04d" % bundle_index)
+                extra_u_edges = []
+                
+                print >> main_tig_paths, ">%04d %s" % ( bundle_index, " ".join(bundle_paths[0]) )
+                subseqs = []
+            
+                for i in range(len(bundle_paths[0]) - 1): 
+                    v, w = bundle_paths[0][i:i+2]
+                    uedges = u_edges[ (v,w) ]
+                    uedges.sort( key= lambda x: len(x[0]) )
+                    subseqs.append( uedges[-1][1] )
+                    visited_u_edges.add( "-".join(uedges[-1][0]) ) 
+                    for ue in uedges:
+                        if "-".join(ue[0]) not in visited_u_edges:
+                            visited_u_edges.add("-".join(ue[0]))
+                            extra_u_edges.append(ue)
+                seq = "".join(subseqs)        
+                if len(seq) > 0:
+                    print >> out_f, ">%04d %s-%s" % (bundle_index, bundle_paths[0][0], bundle_paths[0][-1])
+                    print >> out_f, seq
+                
+                sv_tig_idx = 0
+                for sv_path in bundle_paths:
+                    print >> sv_tig_paths, ">%04d-%04d %s" % ( bundle_index, sv_tig_idx, " ".join(sv_path) )
+                    ASM_graph.add_path(sv_path, ctg="%04d" % bundle_index)
+                    subseqs = []
+                    for i in range(len(sv_path) - 1): 
+                        v, w = sv_path[i:i+2]
+                        uedges = u_edges[ (v,w) ]
+                        uedges.sort( key= lambda x: len(x[0]) )
+                        subseqs.append( uedges[-1][1] )
+                        visited_u_edges.add( "-".join(uedges[-1][0]) ) 
+                        for ue in uedges:
+                            if "-".join(ue[0]) not in visited_u_edges:
+                                visited_u_edges.add("-".join(ue[0]))
+                                extra_u_edges.append(ue)
+                    seq = "".join(subseqs)        
+                    if len(seq) > 0: 
+                        print >> sv_tigs, ">%04d-%04d %s-%s" % (bundle_index, sv_tig_idx, sv_path[0], sv_path[-1])
+                        print >> sv_tigs, "".join(subseqs)
+                    sv_tig_idx += 1
+                for u_path, seq in extra_u_edges:
+                    #u_path = u_path.split("-")
+                    ASM_graph.add_edge(u_path[0], u_path[-1], ctg="%04d" % bundle_index)
+                    print >> sv_tig_paths, ">%04d-%04d-u %s" % ( bundle_index, sv_tig_idx, " ".join(u_path) )
+                    print >> sv_tigs, ">%04d-%04d-u %s-%s" % (bundle_index, sv_tig_idx, u_path[0], u_path[-1])
+                    print >> sv_tigs, seq
+                    sv_tig_idx += 1
+                    
+                
+                bundle_index += 1
+            else:
+                bundle_graph_edges = zip(path[:-1],path[1:])
+        else:
+            bundle_graph_edges = zip(path[:-1],path[1:])
+        
+        #clean up the graph
+
+        edges = set(G.edges())
+        edges_to_be_removed = list(set(bundle_graph_edges))
+        print "BGE",bundle_graph_edges
+        
+        edge_remove_count = 0
+        for v, w in edges_to_be_removed:
+            if (v, w) in edges:
+                G.remove_edge( v, w )
+                edge_remove_count += 1
+                print "remove edge", w, v
+                
+        edges = set(G.edges())
+        for v, w in edges_to_be_removed:
+
+            r_id, end = v.split(":")
+            end = "E" if end == "B" else "B"
+            v = r_id + ":" + end
+
+            r_id, end = w.split(":")
+            end = "E" if end == "B" else "B"
+            w = r_id + ":" + end
+
+            if (w, v) in edges:
+                G.remove_edge( w, v )
+                edge_remove_count += 1
+                print "remove edge", w, v
+
+        if edge_remove_count == 0:
+            print "premature termination", len(edges), len(G.nodes())
+            break
+            
+        nodes = G.nodes()
+        for n in nodes:
+            if G.in_degree(n) == 0 and G.out_degree(n) == 0:
+                G.remove_node(n)
+                print "remove node", n 
+
+    sv_tig_paths.close()
+    sv_tigs.close()
+    main_tig_paths.close()
+    out_f.close()
+    return ASM_graph
+
+
+
+def SGToNXG(sg):
+    G=nx.DiGraph()
+
+    max_score = max([ sg.edges[ e ].attr["score"] for e in sg.edges if sg.e_reduce[e] != True ])
+    out_f = open("edges_list","w")
+    for v, w in sg.edges:
+        if sg.e_reduce[(v, w)] != True:
+        ##if 1:
+            out_degree = len(sg.nodes[v].out_edges)
+            G.add_node( v, size = out_degree )
+            G.add_node( w, size = out_degree )
+            label = sg.edges[ (v, w) ].attr["label"]
+            score = sg.edges[ (v, w) ].attr["score"]
+            print >>out_f, v, w, label, score 
+            G.add_edge( v, w, label = label, weight = 0.001*score, n_weight = max_score - score )
+            #print in_node_name, out_node_name
+    out_f.close()
+    return G
+
+if __name__ == "__main__":
+    
+    overlap_file = sys.argv[1]
+    read_fasta = sys.argv[2]
+
+    seqs = {}
+    #f = FastaReader("pre_assembled_reads.fa")
+    f = FastaReader(read_fasta)
+    for r in f:
+        seqs[r.name] = r.sequence.upper()
+
+    G=nx.Graph()
+    edges =set()
+    overlap_data = []
+    contained_reads = set()
+    overlap_count = {}
+    with open(overlap_file) as f:
+        for l in f:
+            l = l.strip().split()
+            if len(l) != 13:
+                continue
+            f_id, g_id, score, identity = l[:4]
+            if f_id == g_id:
+                continue
+            if g_id not in seqs:
+                continue
+            if f_id not in seqs:
+                continue
+            score = int(score)
+            identity = float(identity)
+            contained = l[12]
+            if contained == "contained":
+                contained_reads.add(f_id)
+                continue
+            if contained == "contains":
+                contained_reads.add(g_id)
+                continue
+            if contained == "none":
+                continue
+            if identity < 96:
+                continue
+            #if score > -2000:
+            #    continue
+            f_strain, f_start, f_end, f_len = (int(c) for c in l[4:8])
+            g_strain, g_start, g_end, g_len = (int(c) for c in l[8:12])
+            if f_len < 4000: continue
+            if g_len < 4000: continue
+            
+            # double check for proper overlap
+            if f_start > 24 and f_len - f_end > 24:
+                continue
+            
+            if g_start > 24 and g_len - g_end > 24:
+                continue
+            
+            if g_strain == 0:
+                if f_start < 24 and g_len - g_end > 24:
+                    continue
+                if g_start < 24 and f_len - f_end > 24:
+                    continue
+            else:
+                if f_start < 24 and g_start > 24:
+                    continue
+                if g_start < 24 and f_start > 24:
+                    continue
+
+            #if g_strain != 0:
+            #    continue
+            overlap_data.append( (f_id, g_id, score, identity,
+                                  f_strain, f_start, f_end, f_len,
+                                  g_strain, g_start, g_end, g_len) )
+
+            overlap_count[f_id] = overlap_count.get(f_id,0)+1
+            overlap_count[g_id] = overlap_count.get(g_id,0)+1
+
+    overlap_set = set()
+    sg = StringGraph()
+    #G=nx.Graph()
+    for od in overlap_data:
+        f_id, g_id, score, identity = od[:4]
+        if f_id in contained_reads:
+            continue
+        if g_id in contained_reads:
+            continue
+        #if overlap_count.get(f_id, 0) < 3 or overlap_count.get(f_id, 0) > 400:
+        #    continue
+        #if overlap_count.get(g_id, 0) < 3 or overlap_count.get(g_id, 0) > 400:
+        #    continue
+        f_s, f_b, f_e, f_l = od[4:8]
+        g_s, g_b, g_e, g_l = od[8:12]
+        overlap_pair = [f_id, g_id]
+        overlap_pair.sort()
+        overlap_pair = tuple( overlap_pair )
+        if overlap_pair in overlap_set:
+            continue
+        else:
+            overlap_set.add(overlap_pair)
+
+        
+        if g_s == 1:
+            g_b, g_e = g_e, g_b
+        if f_b > 24:
+            if g_b < g_e:
+                """
+                     f.B         f.E
+                  f  ----------->
+                  g         ------------->
+                            g.B           g.E
+                """
+                if f_b == 0 or g_e - g_l == 0:
+                    continue
+                sg.add_edge( "%s:B" % g_id, "%s:B" % f_id, label = "%s:%d-%d" % (f_id, f_b, 0), 
+                                                           length = abs(f_b-0),
+                                                           score = -score)
+                sg.add_edge( "%s:E" % f_id, "%s:E" % g_id, label = "%s:%d-%d" % (g_id, g_e, g_l), 
+                                                           length = abs(g_e-g_l),
+                                                           score = -score)
+            else:
+                """
+                     f.B         f.E
+                  f  ----------->
+                  g         <-------------
+                            g.E           g.B           
+                """
+                if f_b == 0 or g_e == 0:
+                    continue
+                sg.add_edge( "%s:E" % g_id, "%s:B" % f_id, label = "%s:%d-%d" % (f_id, f_b, 0), 
+                                                           length = abs(f_b -0),
+                                                           score = -score)
+                sg.add_edge( "%s:E" % f_id, "%s:B" % g_id, label = "%s:%d-%d" % (g_id, g_e, 0), 
+                                                           length = abs(g_e- 0),
+                                                           score = -score)
+        else:
+            if g_b < g_e:
+                """
+                                    f.B         f.E
+                  f                 ----------->
+                  g         ------------->
+                            g.B           g.E
+                """
+                if g_b == 0 or f_e - f_l == 0:
+                    continue
+                sg.add_edge( "%s:B" % f_id, "%s:B" % g_id, label = "%s:%d-%d" % (g_id, g_b, 0), 
+                                                           length = abs(g_b - 0),
+                                                           score = -score)
+                sg.add_edge( "%s:E" % g_id, "%s:E" % f_id, label = "%s:%d-%d" % (f_id, f_e, f_l), 
+                                                           length = abs(f_e-f_l),
+                                                           score = -score)
+            else:
+                """
+                                    f.B         f.E
+                  f                 ----------->
+                  g         <-------------
+                            g.E           g.B           
+                """
+                if g_b - g_l == 0 or f_e - f_l ==0:
+                    continue
+                sg.add_edge( "%s:B" % f_id, "%s:E" % g_id, label = "%s:%d-%d" % (g_id, g_b, g_l), 
+                                                           length = abs(g_b - g_l),
+                                                           score = -score)
+                sg.add_edge( "%s:B" % g_id, "%s:E" % f_id, label = "%s:%d-%d" % (f_id, f_e, f_l), 
+                                                           length = abs(f_e - f_l),
+                                                           score = -score)
+        
+    sg.mark_tr_edges()
+    print sum( [1 for c in sg.e_reduce.values() if c == True] )
+    print sum( [1 for c in sg.e_reduce.values() if c == False] )
+    G = SGToNXG(sg)
+    nx.write_adjlist(G, "full_string_graph.adj")
+    sg.mark_best_overlap()
+    print sum( [1 for c in sg.e_reduce.values() if c == False] )
+    #sg.mark_repeat_overlap()
+    #print sum( [1 for c in sg.repeat_overlap.values() if c == True] )
+    #print sum( [1 for c in sg.repeat_overlap.values() if c == False] )
+    #print len(sg.e_reduce), len(sg.repeat_overlap)
+
+
+
+    G = SGToNXG(sg)
+    nx.write_gexf(G, "string_graph.gexf")
+    nx.write_adjlist(G, "string_graph.adj")
+
+    #generate_max_contig(sg, seqs, out_fn="max_tigs.fa")
+    u_edges = generate_unitig(sg, seqs, out_fn = "unitgs.fa")
+    ASM_graph = get_bundles(u_edges )
+    nx.write_gexf(ASM_graph, "asm_graph.gexf")
diff --git a/src/py_scripts/falcon_dedup.py b/src/py_scripts/falcon_dedup.py
new file mode 100644
index 0000000..cbf04aa
--- /dev/null
+++ b/src/py_scripts/falcon_dedup.py
@@ -0,0 +1,119 @@
+import subprocess
+from pbcore.io import FastaReader
+
+def get_matches(seq0, seq1):
+    with open("tmp_seq0.fa","w") as f:
+        print >>f, ">seq0"
+        print >>f, seq0
+    with open("tmp_seq1.fa","w") as f:
+        print >>f, ">seq1"
+        print >>f, seq1
+    mgaps_out=subprocess.check_output("mummer -maxmatch -c -b -l 24 tmp_seq0.fa tmp_seq1.fa | mgaps ", stderr = open("/dev/null"), shell=True)
+
+    matches = []
+    cluster = []
+    for l in mgaps_out.split("\n"):
+        l = l.strip().split()
+        if len(l) == 0:
+            continue
+        if l[0] == ">":
+            seq_id = l[1]
+            
+            if len(cluster) != 0:
+                matches.append(cluster)
+            
+            cluster = []
+            continue
+        if l[0] == "#":
+            if len(cluster) != 0:
+                matches.append(cluster)            
+            cluster = []
+            continue
+        len_ = int(l[2])
+        r_s = int(l[0])
+        q_s = int(l[1])
+        r_e = r_s + len_
+        q_e = q_s + len_
+        cluster.append( ((r_s, r_e), (q_s, q_e)) )
+    if len(cluster) != 0:
+        matches.append(cluster)
+    return matches
+
+
+u_edges = {}
+with open("./unit_edges.dat") as f:
+    for l in f:
+        v, w, path, seq = l.strip().split()
+        u_edges.setdefault( (v, w), [] )
+        u_edges[ (v, w) ].append( (path, seq) )
+        
+
+p_tig_path = {}
+a_tig_path = {}
+with open("primary_tigs_paths_c") as f:
+    for l in f:
+        l = l.strip().split()
+        id_ = l[0][1:]
+        path = l[1:]
+        p_tig_path[id_] = path
+
+with open("all_tigs_paths") as f:
+    for l in f:
+        l = l.strip().split()
+        id_ = l[0][1:]
+        path = l[1:]
+        a_tig_path[id_] = path
+
+p_tig_seqs = {}
+for r in FastaReader("primary_tigs_c.fa"):
+    p_tig_seqs[r.name] = r.sequence
+
+a_tig_seqs = {}
+for r in FastaReader("all_tigs.fa"):
+    a_tig_seqs[r.name.split()[0]] = r.sequence
+
+p_tig_to_node_pos = {}
+node_pos = []
+with open("primary_tigs_node_pos_c") as f:
+    for l in f:
+        l = l.strip().split()
+        p_tig_to_node_pos.setdefault( l[0], [])
+        p_tig_to_node_pos[l[0]].append( (l[1], int(l[2])))
+
+duplicate_a_tigs = []
+with open("a_nodup.fa","w") as out_f:
+    for p_tig_id in p_tig_path:
+        main_path = p_tig_path[p_tig_id]
+        main_path_nodes = set(main_path[:])
+        p_tig_seq = p_tig_seqs[p_tig_id]
+        a_node = []
+        a_node_range = []
+        a_node_range_map = {}
+        node_to_pos = dict( p_tig_to_node_pos[p_tig_id] )
+        for id_ in a_tig_path:
+            if id_[:4] != p_tig_id[:4]:
+                continue
+            if id_.split("-")[1] == "0000":
+                continue
+            
+            a_path = a_tig_path[id_]
+            if a_path[0] in main_path_nodes and a_path[-1] in main_path_nodes:
+                #print p_tig_id, id_, a_path[0], a_path[-1]
+                s, e = node_to_pos[a_path[0]], node_to_pos[a_path[-1]]
+                p_seq = p_tig_seq[s:e]
+                a_seq = a_tig_seqs[id_] 
+                seq_match = get_matches(p_seq, a_seq)
+                if len(seq_match) > 1:
+                    print >>out_f, ">"+id_
+                    print >>out_f,  a_seq
+                    continue
+                try:
+                    r_s, r_e = seq_match[0][0][0][0], seq_match[0][-1][0][1]
+                except:
+                    print "XXX", seq_match
+                if 1.0* (r_e - r_s) / (e - s) > 98:
+                    print >>out_f, ">"+id_
+                    print >>out_f, a_seq
+                    continue
+                duplicate_a_tigs.append(id_)
+
diff --git a/src/py_scripts/falcon_fixasm.py b/src/py_scripts/falcon_fixasm.py
new file mode 100644
index 0000000..33c1b8c
--- /dev/null
+++ b/src/py_scripts/falcon_fixasm.py
@@ -0,0 +1,213 @@
+#!/usr/bin/env python
+
+#################################################################################$$
+# Copyright (c) 2011-2014, Pacific Biosciences of California, Inc.
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted (subject to the limitations in the
+# disclaimer below) provided that the following conditions are met:
+#
+#  * Redistributions of source code must retain the above copyright
+#  notice, this list of conditions and the following disclaimer.
+#
+#  * Redistributions in binary form must reproduce the above
+#  copyright notice, this list of conditions and the following
+#  disclaimer in the documentation and/or other materials provided
+#  with the distribution.
+#
+#  * Neither the name of Pacific Biosciences nor the names of its
+#  contributors may be used to endorse or promote products derived
+#  from this software without specific prior written permission.
+#
+# NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE
+# GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC
+# BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
+# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+#################################################################################$$
+
+import networkx as nx
+from pbcore.io import FastaReader
+
+def neighbor_bound(G, v, w, radius):
+    g1 = nx.ego_graph(G, v, radius=radius, undirected=False)
+    g2 = nx.ego_graph(G, w, radius=radius, undirected=False)
+    if len(g1) < radius or len(g2) < radius:
+        return True
+    print v, len(g1), w, len(g2), radius
+    if len(set(g1.edges()) & set(g2.edges())) > 0:
+        return True
+    else:
+        return False
+    
+def is_branch_node(G, n):
+    out_edges = G.out_edges([n])
+    n2 = [ e[1] for e in out_edges ]
+    is_branch = False
+    for i in range(len(n2)):
+        for j in range(i+1, len(n2)):
+            v = n2[i]
+            w = n2[j]
+            if neighbor_bound(G, v, w, 20) == False:
+                is_branch = True
+                break
+        if is_branch == True:
+            break
+    return is_branch
+
+
+def get_r_path(r_edges, u_path):
+    tiling_path = []
+    pos = 0
+     
+    for i in range( len(u_path) - 1): 
+        v, w = u_path[i:i+2]
+        r_edge_label, overlap = r_edges[ (v, w) ]
+        r_edge_seq_id, range_ = r_edge_label.split(":")
+        range_ = range_.split("-")
+        s, e = int(range_[0]), int(range_[1])
+        pos += abs(e-s)
+        tiling_path.append( (pos, w, s, e) )
+    return tiling_path
+
+def get_seq(u_edges, r_edges, path):
+    subseqs = []
+    pos = []
+    cur_pos = 0
+    full_tiling_path = []
+
+    for i in range( len(path) - 1):
+        v, w = path[i:i+2]
+        pos.append( (v, cur_pos) )
+        uedges = u_edges[ (v, w) ]
+        uedges.sort( key= lambda x: len(x[0]) )
+        subseqs.append( uedges[-1][1] )
+        r_path = get_r_path( r_edges, uedges[-1][0].split("-") )
+        r_path = [ ( x[0] + cur_pos, x[1], x[2], x[3]) for x in r_path ]
+        full_tiling_path.extend( r_path )
+        cur_pos += len( uedges[-1][1] )
+    pos.append( (w, cur_pos) ) 
+    return "".join(subseqs), pos, full_tiling_path
+
+
+u_edges = {}
+with open("unit_edges.dat") as f:
+    for l in f:
+        v, w, path, seq = l.strip().split()
+        u_edges.setdefault( (v, w), [] )
+        u_edges[ (v, w) ].append( (path, seq) )
+len(u_edges)
+
+
+r_edges = {}
+with open("edges_list") as f:
+    for l in f:
+        v, w, edge_label, overlap = l.strip().split()
+        r_edges[ (v, w) ] = (edge_label, int(overlap) ) 
+
+
+primary_tigs_path = {}
+primary_path_graph = nx.DiGraph()
+begin_nodes = {}
+end_nodes ={}
+with open("primary_tigs_paths") as f:
+    for l in f:
+        l = l.strip().split()
+        name = l[0][1:]
+        path = l[1:]
+        primary_tigs_path[name] = path
+        if len(path) < 3:
+            continue
+        for i in range(len(path)-1):
+            n1 = path[i].split(":")[0]
+            n2 = path[i+1].split(":")[0]
+            primary_path_graph.add_edge( n1, n2)
+        begin_nodes.setdefault(path[0], [])
+        begin_nodes[path[0]].append( name )
+        end_nodes.setdefault(path[-1], [])
+        end_nodes[path[-1]].append( name )
+
+
+
+path_names = primary_tigs_path.keys()
+path_names.sort()
+primary_path_graph_r = primary_path_graph.reverse()
+path_f = open("primary_tigs_paths_c","w")
+pos_f = open("primary_tigs_node_pos_c", "w")
+tiling_path_f = open("all_tiling_path_c", "w")
+with open("primary_tigs_c.fa","w") as out_f:
+    for name in path_names:
+        sub_idx = 0
+        c_path = [ primary_tigs_path[name][0] ]
+        for v in primary_tigs_path[name][1:]:
+            break_path = False
+            
+            vn = v.split(":")[0]
+
+            if primary_path_graph.out_degree(vn) > 1:
+                break_path = is_branch_node(primary_path_graph, vn)
+            if primary_path_graph.in_degree(vn) > 1:
+                break_path = is_branch_node(primary_path_graph_r, vn)
+            if break_path:
+                c_path.append(v)
+                seq, pos, full_tiling_path = get_seq(u_edges, r_edges, c_path)
+                for p, w, s, e in full_tiling_path:
+                    print >> tiling_path_f, "%s_%02d" % (name, sub_idx), p, w, s, e
+                if len(full_tiling_path) <= 5:
+                    continue
+                print >>out_f, ">%s_%02d" % (name, sub_idx)
+                print >>out_f, seq
+                print >>path_f, ">%s_%02d" % (name, sub_idx), " ".join(c_path)
+                #print c_path
+                for node, p in pos:
+                    print >> pos_f, "%s_%02d %s %d" % (name, sub_idx, node, p)
+                c_path = [v]
+                sub_idx += 1
+            else:
+                c_path.append(v)
+                
+        if len(c_path) > 1:
+            seq, pos, full_tiling_path = get_seq(u_edges, r_edges, c_path)
+            for p, w, s, e in full_tiling_path:
+                print >> tiling_path_f, "%s_%02d" % (name, sub_idx), p, w, s, e
+            if len(full_tiling_path) <= 5:
+                continue
+            print >>out_f, ">%s_%02d" % (name, sub_idx)
+            print >>out_f, seq
+            print >>path_f, ">%s_%02d" % (name, sub_idx), " ".join(c_path)
+            for node, p in pos:
+                print >> pos_f, "%s_%02d %s %d" % (name, sub_idx, node, p)
+
+with open("all_tigs_paths") as f:
+    for l in f:
+        l = l.strip().split()
+        name = l[0][1:]
+        name = name.split("-")
+        if name[1] == "0000":
+            continue
+        if len(name) == 2:
+            path = l[1:]
+            seq, pos, full_tiling_path = get_seq(u_edges, r_edges, path)
+            for p, w, s, e in full_tiling_path:
+                print >> tiling_path_f, "%s" % ("-".join(name)), p, w, s, e
+        else:
+            path = l[1:]
+            full_tiling_path = get_r_path(r_edges, path)
+            for p, w, s, e in full_tiling_path:
+                print >> tiling_path_f, "%s" % ("-".join(name)), p, w, s, e
+
+            
+path_f.close()
+tiling_path_f.close()
+pos_f.close()
diff --git a/src/py_scripts/falcon_overlap.py b/src/py_scripts/falcon_overlap.py
new file mode 100755
index 0000000..c6ae2a5
--- /dev/null
+++ b/src/py_scripts/falcon_overlap.py
@@ -0,0 +1,328 @@
+#!/usr/bin/env python
+
+#################################################################################$$
+# Copyright (c) 2011-2014, Pacific Biosciences of California, Inc.
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted (subject to the limitations in the
+# disclaimer below) provided that the following conditions are met:
+#
+#  * Redistributions of source code must retain the above copyright
+#  notice, this list of conditions and the following disclaimer.
+#
+#  * Redistributions in binary form must reproduce the above
+#  copyright notice, this list of conditions and the following
+#  disclaimer in the documentation and/or other materials provided
+#  with the distribution.
+#
+#  * Neither the name of Pacific Biosciences nor the names of its
+#  contributors may be used to endorse or promote products derived
+#  from this software without specific prior written permission.
+#
+# NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE
+# GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC
+# BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
+# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+#################################################################################$$
+
+from falcon_kit import * 
+from pbcore.io import FastaReader
+import numpy as np
+import collections
+import sys
+import multiprocessing as mp
+from multiprocessing import sharedctypes
+from ctypes import *
+
+global sa_ptr, sda_ptr, lk_ptr
+global q_seqs, seqs
+RC_MAP = dict( zip("ACGTacgtNn-", "TGCAtgcaNn-") )
+
+def get_ovelap_alignment(seq1, seq0):
+
+    K = 8
+    lk_ptr = kup.allocate_kmer_lookup( 1 << (K * 2) )
+    sa_ptr = kup.allocate_seq( len(seq0) )
+    sda_ptr = kup.allocate_seq_addr( len(seq0) )
+    kup.add_sequence( 0, K, seq0, len(seq0), sda_ptr, sa_ptr, lk_ptr)
+
+    kmer_match_ptr = kup.find_kmer_pos_for_seq(seq1, len(seq1), K, sda_ptr, lk_ptr)
+    kmer_match = kmer_match_ptr[0]
+    aln_range_ptr = kup.find_best_aln_range(kmer_match_ptr, K, K*5, 50)
+    #x,y = zip( * [ (kmer_match.query_pos[i], kmer_match.target_pos[i]) for i in range(kmer_match.count )] )
+    aln_range = aln_range_ptr[0]
+    kup.free_kmer_match(kmer_match_ptr)
+    s1, e1, s0, e0 = aln_range.s1, aln_range.e1, aln_range.s2, aln_range.e2  
+    e1 += K + K/2
+    e0 += K + K/2
+    kup.free_aln_range(aln_range)
+    len_1 = len(seq1)
+    len_0 = len(seq0)
+    if e1 > len_1: 
+        e1 = len_1
+    if e0 > len_0:
+        e0 = len_0
+    do_aln = False
+    contain_status = "none" 
+    #print s0, e0, s1, e1 
+    if e1 - s1 > 500:
+        if s0 < s1 and s0 > 24:
+            do_aln = False
+        elif s1 <= s0 and s1 > 24:
+            do_aln = False
+        elif s1 < 24 and len_1 - e1 < 24:
+            do_aln = True
+            contain_status = "contains"
+            #print "X1"
+        elif s0 < 24 and len_0 - e0 < 24:
+            do_aln = True
+            contain_status = "contained"
+            #print "X2"
+        else:
+            do_aln = True
+            if s0 < s1:
+                s1 -= s0 #assert s1 > 0
+                s0 = 0
+                e1 = len_1
+                #if len_1 - s1 >= len_0:
+                #    do_aln = False
+                #    contain_status = "contains"
+                #    print "X3", s0, e0, len_0, s1, e1, len_1
+
+                
+            elif s1 <= s0:
+                s0 -= s1 #assert s1 > 0
+                s1 = 0
+                e0 = len_0
+                #print s0, e0, s1, e1
+                #if len_0 - s0 >= len_1:
+                #    do_aln = False
+                #    contain_status = "contained"
+                #    print "X4"
+        #if abs( (e1 - s1) - (e0 - s0 ) ) > 200:  #avoid overlap alignment for big indels
+        #    do_aln = False
+
+        if do_aln:
+            alignment = DWA.align(seq1[s1:e1], e1-s1,
+                                  seq0[s0:e0], e0-s0,
+                                  500, 0)
+            #print seq1[s1:e1]
+            #print seq0[s2:e2]
+            #if alignment[0].aln_str_size > 500:
+    
+            #aln_str1 = alignment[0].q_aln_str
+            #aln_str0 = alignment[0].t_aln_str
+            aln_size = alignment[0].aln_str_size
+            aln_dist = alignment[0].dist
+            aln_q_s = alignment[0].aln_q_s
+            aln_q_e = alignment[0].aln_q_e
+            aln_t_s = alignment[0].aln_t_s
+            aln_t_e = alignment[0].aln_t_e
+            assert aln_q_e- aln_q_s <= alignment[0].aln_str_size or aln_t_e- aln_t_s <= alignment[0].aln_str_size
+            #print aln_str1
+            #print aln_str0
+            if aln_size > 500 and contain_status == "none": 
+                contain_status = "overlap"            
+            DWA.free_alignment(alignment)
+        
+    kup.free_seq_addr_array(sda_ptr)
+    kup.free_seq_array(sa_ptr)
+    kup.free_kmer_lookup(lk_ptr)
+
+    if do_aln:
+        if s1 > 1000 and s0 > 1000:
+            return 0, 0, 0, 0, 0, 0, "none"
+        if len_1 - (s1+aln_q_e-aln_q_s) > 1000 and len_0 - (s0+aln_t_e-aln_t_s) > 1000:
+            return 0, 0, 0, 0, 0, 0, "none"
+
+
+
+
+    if e1 - s1 > 500 and do_aln and aln_size > 500:
+        #return s1, s1+aln_q_e-aln_q_s, s2, s2+aln_t_e-aln_t_s, aln_size, aln_dist, x, y
+        return s1, s1+aln_q_e-aln_q_s, s0, s0+aln_t_e-aln_t_s, aln_size, aln_dist, contain_status
+    else:
+        return 0, 0, 0, 0, 0, 0, contain_status 
+
+def get_candidate_aln(hit_input):
+
+    global q_seqs
+    q_name, hit_index_f, hit_index_r = hit_input
+    q_seq = q_seqs[q_name]
+
+    rtn = []
+
+    hit_index = hit_index_f 
+    c = collections.Counter(hit_index)
+    s = [c[0] for c in c.items() if c[1] >50]
+    #s.sort()
+    targets = set()
+    for p in s:
+        hit_id = seqs[p][0]
+        if hit_id in targets or hit_id == q_name:
+            continue
+        targets.add(hit_id)
+        seq1, seq0 = q_seq, q_seqs[hit_id]
+        aln_data = get_ovelap_alignment(seq1, seq0)
+        #rtn = get_alignment(seq1, seq0)
+        if rtn != None:
+            
+            s1, e1, s2, e2, aln_size, aln_dist, c_status = aln_data
+            #print >>f, name, 0, s1, e1, len(seq1), hit_id, 0, s2, e2, len(seq0),  aln_size, aln_dist
+            rtn.append( ( hit_id, q_name, aln_dist - aln_size, "%0.2f" % (100 - 100.0*aln_dist/(aln_size+1)), 
+                          0, s2, e2, len(seq0), 
+                          0, s1, e1, len(seq1), c_status ) )
+
+    r_q_seq = "".join([RC_MAP[c] for c in q_seq[::-1]])
+    
+    hit_index = hit_index_r 
+    c = collections.Counter(hit_index)
+    s = [c[0] for c in c.items() if c[1] >50]
+    #s.sort()
+    targets = set()
+    for p in s:
+        hit_id = seqs[p][0]
+        if hit_id in targets or hit_id == q_name:
+            continue
+        targets.add(hit_id)
+        seq1, seq0 = r_q_seq, q_seqs[hit_id]
+        aln_data = get_ovelap_alignment(seq1, seq0)
+        #rtn = get_alignment(seq1, seq0)
+        if rtn != None:
+            s1, e1, s2, e2, aln_size, aln_dist, c_status = aln_data
+            #print >>f, name, 1, s1, e1, len(seq1), hit_id, 0, s2, e2, len(seq0),  aln_size, aln_dist
+            rtn.append( ( hit_id, q_name, aln_dist - aln_size, "%0.2f" % (100 - 100.0*aln_dist/(aln_size+1)), 
+                          0, s2, e2, len(seq0), 
+                          1, len(seq1) - e1, len(seq1)- s1, len(seq1), c_status ) )
+
+    return rtn
+
+def build_look_up(seqs, K):
+    global sa_ptr, sda_ptr, lk_ptr
+
+    total_index_base = len(seqs) * 1000
+    sa_ptr = sharedctypes.RawArray(base_t, total_index_base)
+    c_sa_ptr = cast(sa_ptr, POINTER(base_t))
+    kup.init_seq_array(c_sa_ptr, total_index_base)
+
+    sda_ptr = sharedctypes.RawArray(seq_coor_t, total_index_base)
+    c_sda_ptr = cast(sda_ptr, POINTER(seq_coor_t))
+
+    lk_ptr = sharedctypes.RawArray(KmerLookup, 1 << (K*2))
+    c_lk_ptr = cast(lk_ptr, POINTER(KmerLookup))
+    kup.init_kmer_lookup(c_lk_ptr, 1 << (K*2))
+
+    start = 0
+    for r_name, seq in seqs:
+        kup.add_sequence( start, K, seq, 1000, c_sda_ptr, c_sa_ptr, c_lk_ptr)
+        start += 1000
+
+    kup.mask_k_mer(1 << (K * 2), c_lk_ptr, 512)
+    
+    #return sda_ptr, sa_ptr, lk_ptr
+
+
+
+def get_candidate_hits(q_name):
+
+    global sa_ptr, sda_ptr, lk_ptr
+    global q_seqs
+
+    K = 14
+    q_seq = q_seqs[q_name]
+
+    rtn = []
+
+    c_sda_ptr = cast(sda_ptr, POINTER(seq_coor_t))
+    c_sa_ptr = cast(sa_ptr, POINTER(base_t))
+    c_lk_ptr = cast(lk_ptr, POINTER(KmerLookup))
+
+    kmer_match_ptr = kup.find_kmer_pos_for_seq(q_seq, len(q_seq), K, c_sda_ptr, c_lk_ptr)
+    kmer_match = kmer_match_ptr[0]
+    count = kmer_match.count
+    hit_index_f = np.array(kmer_match.target_pos[0:count])/1000
+    kup.free_kmer_match(kmer_match_ptr)
+
+    r_q_seq = "".join([RC_MAP[c] for c in q_seq[::-1]])
+    
+    kmer_match_ptr = kup.find_kmer_pos_for_seq(r_q_seq, len(r_q_seq), K, c_sda_ptr, c_lk_ptr)
+    kmer_match = kmer_match_ptr[0]
+    count = kmer_match.count
+    hit_index_r = np.array(kmer_match.target_pos[0:count])/1000
+    kup.free_kmer_match(kmer_match_ptr)
+    return  q_name, hit_index_f, hit_index_r
+
+
+def q_names( q_seqs ):
+    for q_name, q_seq in q_seqs.items():
+        yield q_name
+
+
+def lookup_data_iterator( q_seqs, m_pool ):
+    for mr in m_pool.imap( get_candidate_hits, q_names(q_seqs)):
+        yield mr
+
+
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser(description='a simple multi-processor overlapper for sequence reads')
+    parser.add_argument('fasta_file', help='a fasta file for all pairwise overlapping of the reads within')
+    parser.add_argument('--min_len', type=int, default=4000, 
+                        help='minimum length of the reads to be considered for overlapping')
+    parser.add_argument('--n_core', type=int, default=1,
+                        help='number of processes used for detailed overlapping evalution')
+    parser.add_argument('--d_core', type=int, default=1, 
+                        help='number of processes used for k-mer matching')
+
+
+    args = parser.parse_args()
+
+    seqs = []
+    q_seqs = {}
+    f = FastaReader(args.fasta_file) # take one commnad line argument of the input fasta file name
+
+    if  args.min_len < 2200:
+         args.min_len = 2200
+
+    idx = 0
+    for r in f:
+        if len(r.sequence) < args.min_len:
+            continue
+        seq = r.sequence.upper()
+        for start in range(0, len(seq), 1000):
+            if start+1000 > len(seq):
+                break
+            seqs.append( (r.name, seq[start: start+1000]) )
+            idx += 1
+        
+        #seqs.append( (r.name, seq[:1000]) )
+        seqs.append( (r.name, seq[-1000:]) )
+        idx += 1
+
+        q_seqs[r.name] = seq
+
+
+    total_index_base = len(seqs) * 1000
+    pool = mp.Pool(args.n_core)
+    K = 14
+    build_look_up(seqs, K)
+    m_pool = mp.Pool(args.d_core)
+
+    
+    #for r in pool.imap(get_candidate_aln, lookup_data_iterator( q_seqs)):
+    for r in pool.imap(get_candidate_aln, lookup_data_iterator( q_seqs, m_pool)):
+        for h in r:
+            print " ".join([str(x) for x in h]) 
+
diff --git a/src/py_scripts/falcon_overlap2.py b/src/py_scripts/falcon_overlap2.py
new file mode 100755
index 0000000..9ffbf56
--- /dev/null
+++ b/src/py_scripts/falcon_overlap2.py
@@ -0,0 +1,337 @@
+#!/usr/bin/env python
+
+#################################################################################$$
+# Copyright (c) 2011-2014, Pacific Biosciences of California, Inc.
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted (subject to the limitations in the
+# disclaimer below) provided that the following conditions are met:
+#
+#  * Redistributions of source code must retain the above copyright
+#  notice, this list of conditions and the following disclaimer.
+#
+#  * Redistributions in binary form must reproduce the above
+#  copyright notice, this list of conditions and the following
+#  disclaimer in the documentation and/or other materials provided
+#  with the distribution.
+#
+#  * Neither the name of Pacific Biosciences nor the names of its
+#  contributors may be used to endorse or promote products derived
+#  from this software without specific prior written permission.
+#
+# NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE
+# GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC
+# BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
+# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+#################################################################################$$
+
+from falcon_kit import * 
+from pbcore.io import FastaReader
+import numpy as np
+import collections
+import sys
+import multiprocessing as mp
+from multiprocessing import sharedctypes
+from ctypes import *
+
+global sa_ptr, sda_ptr, lk_ptr
+global q_seqs,t_seqs, seqs
+RC_MAP = dict( zip("ACGTacgtNn-", "TGCAtgcaNn-") )
+
+def get_ovelap_alignment(seq1, seq0):
+
+    K = 8
+    lk_ptr = kup.allocate_kmer_lookup( 1 << (K * 2) )
+    sa_ptr = kup.allocate_seq( len(seq0) )
+    sda_ptr = kup.allocate_seq_addr( len(seq0) )
+    kup.add_sequence( 0, K, seq0, len(seq0), sda_ptr, sa_ptr, lk_ptr)
+
+    kmer_match_ptr = kup.find_kmer_pos_for_seq(seq1, len(seq1), K, sda_ptr, lk_ptr)
+    kmer_match = kmer_match_ptr[0]
+    aln_range_ptr = kup.find_best_aln_range(kmer_match_ptr, K, K*5, 50)
+    #x,y = zip( * [ (kmer_match.query_pos[i], kmer_match.target_pos[i]) for i in range(kmer_match.count )] )
+    aln_range = aln_range_ptr[0]
+    kup.free_kmer_match(kmer_match_ptr)
+    s1, e1, s0, e0 = aln_range.s1, aln_range.e1, aln_range.s2, aln_range.e2  
+    e1 += K + K/2
+    e0 += K + K/2
+    kup.free_aln_range(aln_range)
+    len_1 = len(seq1)
+    len_0 = len(seq0)
+    if e1 > len_1: 
+        e1 = len_1
+    if e0 > len_0:
+        e0 = len_0
+    do_aln = False
+    contain_status = "none" 
+    #print s0, e0, s1, e1 
+    if e1 - s1 > 500:
+        if s0 < s1 and s0 > 24:
+            do_aln = False
+        elif s1 <= s0 and s1 > 24:
+            do_aln = False
+        elif s1 < 24 and len_1 - e1 < 24:
+            do_aln = True
+            contain_status = "contains"
+            #print "X1"
+        elif s0 < 24 and len_0 - e0 < 24:
+            do_aln = True
+            contain_status = "contained"
+            #print "X2"
+        else:
+            do_aln = True
+            if s0 < s1:
+                s1 -= s0 #assert s1 > 0
+                s0 = 0
+                e1 = len_1
+                #if len_1 - s1 >= len_0:
+                #    do_aln = False
+                #    contain_status = "contains"
+                #    print "X3", s0, e0, len_0, s1, e1, len_1
+
+                
+            elif s1 <= s0:
+                s0 -= s1 #assert s1 > 0
+                s1 = 0
+                e0 = len_0
+                #print s0, e0, s1, e1
+                #if len_0 - s0 >= len_1:
+                #    do_aln = False
+                #    contain_status = "contained"
+                #    print "X4"
+        #if abs( (e1 - s1) - (e0 - s0 ) ) > 200:  #avoid overlap alignment for big indels
+        #    do_aln = False
+
+        if do_aln:
+            alignment = DWA.align(seq1[s1:e1], e1-s1,
+                                  seq0[s0:e0], e0-s0,
+                                  500, 0)
+            #print seq1[s1:e1]
+            #print seq0[s2:e2]
+            #if alignment[0].aln_str_size > 500:
+    
+            #aln_str1 = alignment[0].q_aln_str
+            #aln_str0 = alignment[0].t_aln_str
+            aln_size = alignment[0].aln_str_size
+            aln_dist = alignment[0].dist
+            aln_q_s = alignment[0].aln_q_s
+            aln_q_e = alignment[0].aln_q_e
+            aln_t_s = alignment[0].aln_t_s
+            aln_t_e = alignment[0].aln_t_e
+            assert aln_q_e- aln_q_s <= alignment[0].aln_str_size or aln_t_e- aln_t_s <= alignment[0].aln_str_size
+            #print aln_str1
+            #print aln_str0
+            if aln_size > 500 and contain_status == "none": 
+                contain_status = "overlap"            
+            DWA.free_alignment(alignment)
+        
+    kup.free_seq_addr_array(sda_ptr)
+    kup.free_seq_array(sa_ptr)
+    kup.free_kmer_lookup(lk_ptr)
+
+    if do_aln:
+        if s1 > 1000 and s0 > 1000:
+            return 0, 0, 0, 0, 0, 0, "none"
+        if len_1 - (s1+aln_q_e-aln_q_s) > 1000 and len_0 - (s0+aln_t_e-aln_t_s) > 1000:
+            return 0, 0, 0, 0, 0, 0, "none"
+
+    if e1 - s1 > 500 and do_aln and aln_size > 500:
+        #return s1, s1+aln_q_e-aln_q_s, s2, s2+aln_t_e-aln_t_s, aln_size, aln_dist, x, y
+        return s1, s1+aln_q_e-aln_q_s, s0, s0+aln_t_e-aln_t_s, aln_size, aln_dist, contain_status
+    else:
+        return 0, 0, 0, 0, 0, 0, contain_status 
+
+def get_candidate_aln(hit_input):
+
+    global q_seqs, seqs, t_seqs
+    q_name, hit_index_f, hit_index_r = hit_input
+    q_seq = q_seqs[q_name]
+
+    rtn = []
+
+    hit_index = hit_index_f 
+    c = collections.Counter(hit_index)
+    s = [c[0] for c in c.items() if c[1] >50]
+    #s.sort()
+    targets = set()
+    for p in s:
+        hit_id = seqs[p][0]
+        if hit_id in targets or hit_id == q_name:
+            continue
+        targets.add(hit_id)
+        seq1, seq0 = q_seq, t_seqs[hit_id]
+        aln_data = get_ovelap_alignment(seq1, seq0)
+        #rtn = get_alignment(seq1, seq0)
+        if rtn != None:
+             
+            s1, e1, s2, e2, aln_size, aln_dist, c_status = aln_data
+            if c_status == "none":
+                continue
+            #print >>f, name, 0, s1, e1, len(seq1), hit_id, 0, s2, e2, len(seq0),  aln_size, aln_dist
+            rtn.append( ( hit_id, q_name, aln_dist - aln_size, "%0.2f" % (100 - 100.0*aln_dist/(aln_size+1)), 
+                          0, s2, e2, len(seq0), 
+                          0, s1, e1, len(seq1), c_status ) )
+
+    r_q_seq = "".join([RC_MAP[c] for c in q_seq[::-1]])
+    
+    hit_index = hit_index_r 
+    c = collections.Counter(hit_index)
+    s = [c[0] for c in c.items() if c[1] >50]
+    #s.sort()
+    targets = set()
+    for p in s:
+        hit_id = seqs[p][0]
+        if hit_id in targets or hit_id == q_name:
+            continue
+        targets.add(hit_id)
+        seq1, seq0 = r_q_seq, t_seqs[hit_id]
+        aln_data = get_ovelap_alignment(seq1, seq0)
+        #rtn = get_alignment(seq1, seq0)
+        if rtn != None:
+            s1, e1, s2, e2, aln_size, aln_dist, c_status = aln_data
+            if c_status == "none":
+                continue
+            #print >>f, name, 1, s1, e1, len(seq1), hit_id, 0, s2, e2, len(seq0),  aln_size, aln_dist
+            rtn.append( ( hit_id, q_name, aln_dist - aln_size, "%0.2f" % (100 - 100.0*aln_dist/(aln_size+1)), 
+                          0, s2, e2, len(seq0), 
+                          1, len(seq1) - e1, len(seq1)- s1, len(seq1), c_status ) )
+
+    return rtn
+
+def build_look_up(seqs, K):
+    global sa_ptr, sda_ptr, lk_ptr
+
+    total_index_base = len(seqs) * 1000
+    sa_ptr = sharedctypes.RawArray(base_t, total_index_base)
+    c_sa_ptr = cast(sa_ptr, POINTER(base_t))
+    kup.init_seq_array(c_sa_ptr, total_index_base)
+
+    sda_ptr = sharedctypes.RawArray(seq_coor_t, total_index_base)
+    c_sda_ptr = cast(sda_ptr, POINTER(seq_coor_t))
+
+    lk_ptr = sharedctypes.RawArray(KmerLookup, 1 << (K*2))
+    c_lk_ptr = cast(lk_ptr, POINTER(KmerLookup))
+    kup.init_kmer_lookup(c_lk_ptr, 1 << (K*2))
+
+    start = 0
+    for r_name, seq in seqs:
+        kup.add_sequence( start, K, seq, 1000, c_sda_ptr, c_sa_ptr, c_lk_ptr)
+        start += 1000
+
+    kup.mask_k_mer(1 << (K * 2), c_lk_ptr, 256)
+    
+    #return sda_ptr, sa_ptr, lk_ptr
+
+
+
+def get_candidate_hits(q_name):
+
+    global sa_ptr, sda_ptr, lk_ptr
+    global q_seqs
+
+    K = 14
+    q_seq = q_seqs[q_name]
+
+    rtn = []
+
+    c_sda_ptr = cast(sda_ptr, POINTER(seq_coor_t))
+    c_sa_ptr = cast(sa_ptr, POINTER(base_t))
+    c_lk_ptr = cast(lk_ptr, POINTER(KmerLookup))
+
+    kmer_match_ptr = kup.find_kmer_pos_for_seq(q_seq, len(q_seq), K, c_sda_ptr, c_lk_ptr)
+    kmer_match = kmer_match_ptr[0]
+    count = kmer_match.count
+    hit_index_f = np.array(kmer_match.target_pos[0:count])/1000
+    kup.free_kmer_match(kmer_match_ptr)
+
+    r_q_seq = "".join([RC_MAP[c] for c in q_seq[::-1]])
+    
+    kmer_match_ptr = kup.find_kmer_pos_for_seq(r_q_seq, len(r_q_seq), K, c_sda_ptr, c_lk_ptr)
+    kmer_match = kmer_match_ptr[0]
+    count = kmer_match.count
+    hit_index_r = np.array(kmer_match.target_pos[0:count])/1000
+    kup.free_kmer_match(kmer_match_ptr)
+    return  q_name, hit_index_f, hit_index_r
+
+
+def q_names( q_seqs ):
+    for q_name, q_seq in q_seqs.items():
+        yield q_name
+
+
+def lookup_data_iterator( q_seqs, m_pool ):
+    for mr in m_pool.imap( get_candidate_hits, q_names(q_seqs)):
+        yield mr
+
+
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser(description='a simple multi-processor overlapper for sequence reads')
+    parser.add_argument('query_fa', help='a fasta file to be overlapped with sequence in target')
+    parser.add_argument('target_fa', help='a fasta file as the target sequences for overlapping')
+    parser.add_argument('--min_len', type=int, default=4000, 
+                        help='minimum length of the reads to be considered for overlapping')
+    parser.add_argument('--n_core', type=int, default=1,
+                        help='number of processes used for detailed overlapping evalution')
+    parser.add_argument('--d_core', type=int, default=1, 
+                        help='number of processes used for k-mer matching')
+
+
+    args = parser.parse_args()
+
+    seqs = []
+    q_seqs = {}
+    t_seqs = {}
+    f = FastaReader(args.target_fa) # take one commnad line argument of the input fasta file name
+
+    if  args.min_len < 2200:
+         args.min_len = 2200
+
+    idx = 0
+    for r in f:
+        if len(r.sequence) < args.min_len:
+            continue
+        seq = r.sequence.upper()
+        for start in range(0, len(seq), 1000):
+            if start+1000 > len(seq):
+                break
+            seqs.append( (r.name, seq[start: start+1000]) )
+            idx += 1
+        
+        seqs.append( (r.name, seq[-1000:]) )
+        idx += 1
+
+        t_seqs[r.name] = seq
+
+    f = FastaReader(args.query_fa) # take one commnad line argument of the input fasta file name
+    for r in f:
+        if len(r.sequence) < args.min_len:
+            continue
+        seq = r.sequence.upper()
+        q_seqs[r.name] = seq
+
+
+    total_index_base = len(seqs) * 1000
+    pool = mp.Pool(args.n_core)
+    K = 14
+    build_look_up(seqs, K)
+    m_pool = mp.Pool(args.d_core)
+
+    
+    #for r in pool.imap(get_candidate_aln, lookup_data_iterator( q_seqs)):
+    for r in pool.imap(get_candidate_aln, lookup_data_iterator( q_seqs, m_pool)):
+        for h in r:
+            print " ".join([str(x) for x in h]) 
+
diff --git a/src/py_scripts/falcon_qrm.py b/src/py_scripts/falcon_qrm.py
new file mode 100755
index 0000000..5196b65
--- /dev/null
+++ b/src/py_scripts/falcon_qrm.py
@@ -0,0 +1,370 @@
+#!/usr/bin/env python
+
+#################################################################################$$
+# Copyright (c) 2011-2014, Pacific Biosciences of California, Inc.
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted (subject to the limitations in the
+# disclaimer below) provided that the following conditions are met:
+#
+#  * Redistributions of source code must retain the above copyright
+#  notice, this list of conditions and the following disclaimer.
+#
+#  * Redistributions in binary form must reproduce the above
+#  copyright notice, this list of conditions and the following
+#  disclaimer in the documentation and/or other materials provided
+#  with the distribution.
+#
+#  * Neither the name of Pacific Biosciences nor the names of its
+#  contributors may be used to endorse or promote products derived
+#  from this software without specific prior written permission.
+#
+# NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE
+# GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC
+# BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
+# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+#################################################################################$$
+
+from falcon_kit import * 
+from pbcore.io import FastaReader
+import numpy as np
+import collections
+import sys
+import multiprocessing as mp
+from multiprocessing import sharedctypes
+from ctypes import *
+import math
+
+global sa_ptr, sda_ptr, lk_ptr
+global q_seqs,t_seqs, seqs
+global n_candidates, max_candidates
+
+seqs = []
+RC_MAP = dict( zip("ACGTacgtNn-", "TGCAtgcaNn-") )
+
+all_fivemers = []
+cmap = {0:"A", 1:"T", 2:"C", 3:"G"}
+for i in range(1024):
+    mer = []
+    for j in range(5):
+        mer.append( cmap[ i >> (2 *j) & 3 ])
+    all_fivemers.append("".join(mer))
+
+def fivemer_entropy(seq):
+    five_mer_count = {}
+
+    for i in range(len(seq)-5):
+        five_mer = seq[i:i+5]
+        five_mer_count.setdefault(five_mer, 0)
+        five_mer_count[five_mer] += 1
+    
+    entropy = 0.0
+    for five_mer in all_fivemers:
+        p = five_mer_count.get(five_mer, 0) + 1.0
+        p /= len(seq)
+        entropy += - p * math.log(p)
+
+    return entropy
+
+def get_alignment(seq1, seq0):
+
+    K = 8 
+    lk_ptr = kup.allocate_kmer_lookup( 1 << (K * 2) )
+    sa_ptr = kup.allocate_seq( len(seq0) )
+    sda_ptr = kup.allocate_seq_addr( len(seq0) )
+    kup.add_sequence( 0, K, seq0, len(seq0), sda_ptr, sa_ptr, lk_ptr)
+
+    kup.mask_k_mer(1 << (K * 2), lk_ptr, 16)
+    kmer_match_ptr = kup.find_kmer_pos_for_seq(seq1, len(seq1), K, sda_ptr, lk_ptr)
+    kmer_match = kmer_match_ptr[0]
+    aln_range_ptr = kup.find_best_aln_range2(kmer_match_ptr, K, K*50, 25)
+    #x,y = zip( * [ (kmer_match.query_pos[i], kmer_match.target_pos[i]) for i in range(kmer_match.count )] )
+    aln_range = aln_range_ptr[0]
+    kup.free_kmer_match(kmer_match_ptr)
+    s1, e1, s0, e0, km_score = aln_range.s1, aln_range.e1, aln_range.s2, aln_range.e2, aln_range.score  
+    e1 += K + K/2
+    e0 += K + K/2
+    kup.free_aln_range(aln_range)
+    len_1 = len(seq1)
+    len_0 = len(seq0)
+    if e1 > len_1: 
+        e1 = len_1
+    if e0 > len_0:
+        e0 = len_0
+
+    aln_size = 1
+    if e1 - s1 > 500:
+
+        aln_size = max( e1-s1, e0-s0 )
+        aln_score = int(km_score * 48)
+        aln_q_s = s1
+        aln_q_e = e1
+        aln_t_s = s0
+        aln_t_e = e0
+        
+    kup.free_seq_addr_array(sda_ptr)
+    kup.free_seq_array(sa_ptr)
+    kup.free_kmer_lookup(lk_ptr)
+
+    if s1 > 1000 and s0 > 1000:
+        return 0, 0, 0, 0, 0, 0, "none"
+
+    if len_1 - e1 > 1000 and len_0 - e0 > 1000:
+        return 0, 0, 0, 0, 0, 0, "none"
+
+
+    if e1 - s1 > 500 and aln_size > 500:
+        return s1, s1+aln_q_e-aln_q_s, s0, s0+aln_t_e-aln_t_s, aln_size, aln_score, "aln"
+    else:
+        return 0, 0, 0, 0, 0, 0, "none"
+
+def get_candidate_aln(hit_input):
+    
+    global q_seqs, seqs, t_seqs, q_len
+    global max_candidates
+    global n_candidates
+    q_name, hit_index_f, hit_index_r = hit_input
+    q_seq = q_seqs[q_name]
+
+    rtn = []
+    hit_index = hit_index_f
+    c = collections.Counter(hit_index)
+    s = [(c[0],c[1]) for c in c.items() if c[1] > 4]
+    
+    hit_data = {}
+    #hit_ids = set()
+
+    for p, hit_count in s:
+        hit_id = seqs[p][0]
+        hit_data.setdefault(hit_id, [0, 0 ,0])
+        hit_data[hit_id][0] += hit_count;
+        if hit_count > hit_data[hit_id][1]:
+            hit_data[hit_id][1] = hit_count
+        hit_data[hit_id][2] += 1
+
+    hit_data = hit_data.items()
+
+    hit_data.sort( key=lambda x:-x[1][0] )
+
+    target_count = {}
+    total_hit = 0
+
+    for hit in hit_data[:n_candidates]:
+        hit_id = hit[0]
+        hit_count = hit[1][0]
+        target_count.setdefault(hit_id, 0)
+        if target_count[hit_id] > max_candidates:
+            continue
+        if total_hit > max_candidates:
+            continue
+        seq1, seq0 = q_seq, t_seqs[hit_id]
+        aln_data = get_alignment(seq1, seq0)
+        if rtn != None:
+             
+            s1, e1, s2, e2, aln_size, aln_score, c_status = aln_data
+            if c_status == "none":
+                continue
+            target_count[hit_id] += 1
+            total_hit += 1
+            rtn.append( ( q_name, hit_id, -aln_score, "%0.2f" % (100.0*aln_score/(aln_size+1)), 
+                          0, s1, e1, len(seq1), 
+                          0, s2, e2, len(seq0), c_status + " %d" % hit_count ) )
+
+    r_q_seq = "".join([RC_MAP[c] for c in q_seq[::-1]])
+    
+    hit_index = hit_index_r 
+    c = collections.Counter(hit_index)
+    s = [(c[0],c[1]) for c in c.items() if c[1] > 4]
+
+    hit_data = {}
+    #hit_ids = set()
+
+    for p, hit_count in s:
+        hit_id = seqs[p][0]
+        hit_data.setdefault(hit_id, [0, 0 ,0])
+        hit_data[hit_id][0] += hit_count;
+        if hit_count > hit_data[hit_id][1]:
+            hit_data[hit_id][1] = hit_count
+        hit_data[hit_id][2] += 1
+
+    hit_data = hit_data.items()
+
+    hit_data.sort( key=lambda x:-x[1][0] )
+
+
+    target_count = {}
+    total_hit = 0
+
+    for hit in hit_data[:n_candidates]:
+        hit_id = hit[0] 
+        hit_count = hit[1][0]
+        target_count.setdefault(hit_id, 0)
+        if target_count[hit_id] > max_candidates:
+            continue
+        if total_hit > max_candidates:
+            continue
+        seq1, seq0 = r_q_seq, t_seqs[hit_id]
+        aln_data = get_alignment(seq1, seq0)
+        if rtn != None:
+            s1, e1, s2, e2, aln_size, aln_score, c_status = aln_data
+            if c_status == "none":
+                continue
+            target_count[hit_id] += 1
+            total_hit += 1
+            rtn.append( ( q_name, hit_id, -aln_score, "%0.2f" % (100.0*aln_score/(aln_size+1)), 
+                          0, len(seq1) - e1, len(seq1) - s1, len(seq1), 
+                          1, s2, e2, len(seq0), c_status + " %d" % hit_count ) )
+
+    return rtn
+
+def build_look_up(seqs, K):
+    global sa_ptr, sda_ptr, lk_ptr
+
+    total_index_base = len(seqs) * 1000
+    sa_ptr = sharedctypes.RawArray(base_t, total_index_base)
+    c_sa_ptr = cast(sa_ptr, POINTER(base_t))
+    kup.init_seq_array(c_sa_ptr, total_index_base)
+
+    sda_ptr = sharedctypes.RawArray(seq_coor_t, total_index_base)
+    c_sda_ptr = cast(sda_ptr, POINTER(seq_coor_t))
+
+    lk_ptr = sharedctypes.RawArray(KmerLookup, 1 << (K*2))
+    c_lk_ptr = cast(lk_ptr, POINTER(KmerLookup))
+    kup.init_kmer_lookup(c_lk_ptr, 1 << (K*2))
+
+    start = 0
+    for r_name, seq in seqs:
+        kup.add_sequence( start, K, seq, 1000, c_sda_ptr, c_sa_ptr, c_lk_ptr)
+        start += 1000
+
+    kup.mask_k_mer(1 << (K * 2), c_lk_ptr, 1024)
+    
+    #return sda_ptr, sa_ptr, lk_ptr
+
+
+
+def get_candidate_hits(q_name):
+
+    global sa_ptr, sda_ptr, lk_ptr
+    global q_seqs
+
+    K = 14
+    q_seq = q_seqs[q_name]
+
+    rtn = []
+
+    c_sda_ptr = cast(sda_ptr, POINTER(seq_coor_t))
+    c_sa_ptr = cast(sa_ptr, POINTER(base_t))
+    c_lk_ptr = cast(lk_ptr, POINTER(KmerLookup))
+
+    kmer_match_ptr = kup.find_kmer_pos_for_seq(q_seq, len(q_seq), K, c_sda_ptr, c_lk_ptr)
+    kmer_match = kmer_match_ptr[0]
+    count = kmer_match.count
+    hit_index_f = np.array(kmer_match.target_pos[0:count])/1000
+    kup.free_kmer_match(kmer_match_ptr)
+
+    r_q_seq = "".join([RC_MAP[c] for c in q_seq[::-1]])
+    
+    kmer_match_ptr = kup.find_kmer_pos_for_seq(r_q_seq, len(r_q_seq), K, c_sda_ptr, c_lk_ptr)
+    kmer_match = kmer_match_ptr[0]
+    count = kmer_match.count
+    hit_index_r = np.array(kmer_match.target_pos[0:count])/1000
+    kup.free_kmer_match(kmer_match_ptr)
+    return  q_name, hit_index_f, hit_index_r
+
+
+def q_names( q_seqs ):
+    for q_name, q_seq in q_seqs.items():
+        yield q_name
+
+
+def lookup_data_iterator( q_seqs, m_pool ):
+    for mr in m_pool.imap( get_candidate_hits, q_names(q_seqs)):
+        yield mr
+
+
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser(description='a simple multi-processor overlapper for sequence reads')
+    parser.add_argument('target_fofn', help='a fasta fofn as the target sequences for overlapping')
+    parser.add_argument('query_fofn', help='a fasta fofn  to be overlapped with sequence in target')
+    parser.add_argument('--min_len', type=int, default=4000, 
+                        help='minimum length of the reads to be considered for overlapping')
+    parser.add_argument('--n_core', type=int, default=1,
+                        help='number of processes used for detailed overlapping evalution')
+    parser.add_argument('--d_core', type=int, default=1, 
+                        help='number of processes used for k-mer matching')
+    parser.add_argument('--n_candidates', type=int, default=128, 
+                        help='number of candidates for read matching')
+    parser.add_argument('--max_candidates', type=int, default=64, 
+                        help='max number for read matching to output')
+
+
+
+    args = parser.parse_args()
+
+    max_candidates = args.max_candidates
+    n_candidates = args.n_candidates
+
+    q_seqs = {}
+    t_seqs = {}
+    if  args.min_len < 1200:
+         args.min_len = 1200
+
+    with open(args.target_fofn) as fofn:
+        for fn in fofn:
+            fn = fn.strip()
+            f = FastaReader(fn) # take one commnad line argument of the input fasta file name
+            for r in f:
+                if len(r.sequence) < args.min_len:
+                    continue
+                seq = r.sequence.upper()
+                for start in range(0, len(seq), 1000):
+                    if start+1000 > len(seq):
+                        break
+                    subseq = seq[start: start+1000]
+                    #if fivemer_entropy(subseq) < 4:
+                    #    continue
+                    seqs.append( (r.name, subseq) )
+                subseq = seq[-1000:]
+                #if fivemer_entropy(subseq) < 4:
+                #    continue
+                #seqs.append( (r.name, seq[:1000]) )
+                seqs.append( (r.name, subseq) )
+
+                t_seqs[r.name] = seq
+
+    with open(args.query_fofn) as fofn:
+        for fn in fofn:
+            fn = fn.strip()
+            f = FastaReader(fn) # take one commnad line argument of the input fasta file name
+            for r in f:
+                seq = r.sequence.upper()
+                #if fivemer_entropy(seq) < 4:
+                #    continue
+                q_seqs[r.name] = seq
+
+
+    pool = mp.Pool(args.n_core)
+    K = 14
+    build_look_up(seqs, K)
+    m_pool = mp.Pool(args.d_core)
+
+    
+    #for r in pool.imap(get_candidate_aln, lookup_data_iterator( q_seqs)):
+    for r in pool.imap(get_candidate_aln, lookup_data_iterator(q_seqs, m_pool)):
+        for h in r:
+            print " ".join([str(x) for x in h]) 
+
diff --git a/src/py_scripts/falcon_sense.py b/src/py_scripts/falcon_sense.py
new file mode 100644
index 0000000..c23b7bf
--- /dev/null
+++ b/src/py_scripts/falcon_sense.py
@@ -0,0 +1,243 @@
+#!/usr/bin/env python
+
+#################################################################################$$
+# Copyright (c) 2011-2014, Pacific Biosciences of California, Inc.
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted (subject to the limitations in the
+# disclaimer below) provided that the following conditions are met:
+#
+#  * Redistributions of source code must retain the above copyright
+#  notice, this list of conditions and the following disclaimer.
+#
+#  * Redistributions in binary form must reproduce the above
+#  copyright notice, this list of conditions and the following
+#  disclaimer in the documentation and/or other materials provided
+#  with the distribution.
+#
+#  * Neither the name of Pacific Biosciences nor the names of its
+#  contributors may be used to endorse or promote products derived
+#  from this software without specific prior written permission.
+#
+# NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE
+# GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC
+# BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
+# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+#################################################################################$$
+
+from ctypes import *
+import sys
+from multiprocessing import Pool
+import os
+import falcon_kit
+
+module_path = falcon_kit.__path__[0]
+
+falcon = CDLL(os.path.join(module_path, "falcon.so"))
+
+falcon.generate_consensus.argtypes = [ POINTER(c_char_p), c_uint, c_uint, c_uint, c_uint, c_uint, c_double ]
+falcon.generate_consensus.restype = POINTER(falcon_kit.ConsensusData)
+falcon.free_consensus_data.argtypes = [ POINTER(falcon_kit.ConsensusData) ]
+
+
+def get_alignment(seq1, seq0, edge_tolerance = 1000):
+
+    kup = falcon_kit.kup
+    K = 8 
+    lk_ptr = kup.allocate_kmer_lookup( 1 << (K * 2) )
+    sa_ptr = kup.allocate_seq( len(seq0) )
+    sda_ptr = kup.allocate_seq_addr( len(seq0) )
+    kup.add_sequence( 0, K, seq0, len(seq0), sda_ptr, sa_ptr, lk_ptr)
+
+    kup.mask_k_mer(1 << (K * 2), lk_ptr, 16)
+    kmer_match_ptr = kup.find_kmer_pos_for_seq(seq1, len(seq1), K, sda_ptr, lk_ptr)
+    kmer_match = kmer_match_ptr[0]
+    aln_range_ptr = kup.find_best_aln_range2(kmer_match_ptr, K, K*50, 25)
+    #x,y = zip( * [ (kmer_match.query_pos[i], kmer_match.target_pos[i]) for i in range(kmer_match.count )] )
+    aln_range = aln_range_ptr[0]
+    kup.free_kmer_match(kmer_match_ptr)
+    s1, e1, s0, e0, km_score = aln_range.s1, aln_range.e1, aln_range.s2, aln_range.e2, aln_range.score  
+    e1 += K + K/2
+    e0 += K + K/2
+    kup.free_aln_range(aln_range)
+    len_1 = len(seq1)
+    len_0 = len(seq0)
+    if e1 > len_1: 
+        e1 = len_1
+    if e0 > len_0:
+        e0 = len_0
+
+    aln_size = 1
+    if e1 - s1 > 500:
+
+        aln_size = max( e1-s1, e0-s0 )
+        aln_score = int(km_score * 48)
+        aln_q_s = s1
+        aln_q_e = e1
+        aln_t_s = s0
+        aln_t_e = e0
+        
+    kup.free_seq_addr_array(sda_ptr)
+    kup.free_seq_array(sa_ptr)
+    kup.free_kmer_lookup(lk_ptr)
+
+    if s1 > edge_tolerance and s0 > edge_tolerance:
+        return 0, 0, 0, 0, 0, 0, "none"
+
+    if len_1 - e1 > edge_tolerance and len_0 - e0 > edge_tolerance:
+        return 0, 0, 0, 0, 0, 0, "none"
+
+
+    if e1 - s1 > 500 and aln_size > 500:
+        return s1, s1+aln_q_e-aln_q_s, s0, s0+aln_t_e-aln_t_s, aln_size, aln_score, "aln"
+    else:
+        return 0, 0, 0, 0, 0, 0, "none"
+
+def get_consensus_without_trim( c_input ):
+    seqs, seed_id, config = c_input
+    min_cov, K, local_match_count_window, local_match_count_threshold, max_n_read, min_idt, edge_tolerance, trim_size = config
+    if len(seqs) > max_n_read:
+        seqs = seqs[:max_n_read]
+    seqs_ptr = (c_char_p * len(seqs))()
+    seqs_ptr[:] = seqs
+    consensus_data_ptr = falcon.generate_consensus( seqs_ptr, len(seqs), min_cov, K, 
+                                                    local_match_count_window, local_match_count_threshold, min_idt )
+
+    consensus = string_at(consensus_data_ptr[0].sequence)[:]
+    eff_cov = consensus_data_ptr[0].eff_cov[:len(consensus)]
+    falcon.free_consensus_data( consensus_data_ptr )
+    del seqs_ptr
+    return consensus, seed_id
+
+def get_consensus_with_trim( c_input ):
+    seqs, seed_id, config = c_input
+    min_cov, K, local_match_count_window, local_match_count_threshold, max_n_read, min_idt, edge_tolerance, trim_size = config
+    trim_seqs = []
+    seed = seqs[0]
+    for seq in seqs[1:]:
+        aln_data = get_alignment(seq, seed, edge_tolerance)
+        s1, e1, s2, e2, aln_size, aln_score, c_status = aln_data
+        if c_status == "none":
+            continue
+        if aln_score > 1000 and e1 - s1 > 500:
+            e1 -= trim_size
+            s1 += trim_size
+            trim_seqs.append( (e1-s1, seq[s1:e1]) )
+    trim_seqs.sort(key = lambda x:-x[0]) #use longest alignment first
+    trim_seqs = [x[1] for x in trim_seqs]
+        
+    if len(trim_seqs) > max_n_read:
+        trim_seqs = trim_seqs[:max_n_read]
+
+    trim_seqs = [seed] + trim_seqs
+
+
+    seqs_ptr = (c_char_p * len(trim_seqs))()
+    seqs_ptr[:] = trim_seqs
+    consensus_data_ptr = falcon.generate_consensus( seqs_ptr, len(trim_seqs), min_cov, K, 
+                                               local_match_count_window, local_match_count_threshold, min_idt )
+    consensus = string_at(consensus_data_ptr[0].sequence)[:]
+    eff_cov = consensus_data_ptr[0].eff_cov[:len(consensus)]
+    falcon.free_consensus_data( consensus_data_ptr )
+    del seqs_ptr
+    return consensus, seed_id
+
+
+def get_seq_data(config):
+    seqs = []
+    seed_id = None
+    seqs_data = []
+    with sys.stdin as f:
+        for l in f:
+            l = l.strip().split()
+            if len(l) != 2:
+                continue
+            if l[0] not in ("+", "-"):
+                if len(l[1]) > 100:
+                    if len(seqs) == 0:
+                        seqs.append(l[1]) #the "seed"
+                        seed_id = l[0]
+                    seqs.append(l[1])
+            elif l[0] == "+":
+                if len(seqs) > 10:
+                    yield (seqs, seed_id, config) 
+                #seqs_data.append( (seqs, seed_id) ) 
+                seqs = []
+                seed_id = None
+            elif l[0] == "-":
+                #yield (seqs, seed_id)
+                #seqs_data.append( (seqs, seed_id) )
+                break
+
+if __name__ == "__main__":
+    import argparse
+    import re
+    parser = argparse.ArgumentParser(description='a simple multi-processor consensus sequence generator')
+    parser.add_argument('--n_core', type=int, default=24,
+                        help='number of processes used for generating consensus')
+    parser.add_argument('--local_match_count_window', type=int, default=12,
+                        help='local match window size')
+    parser.add_argument('--local_match_count_threshold', type=int, default=6,
+                        help='local match count threshold')
+    parser.add_argument('--min_cov', type=int, default=6,
+                        help='minimum coverage to break the consensus')
+    parser.add_argument('--max_n_read', type=int, default=500,
+                        help='minimum number of reads used in generating the consensus')
+    parser.add_argument('--trim', action="store_true", default=False,
+                        help='trim the input sequence with k-mer spare dynamic programming to find the mapped range')
+    parser.add_argument('--output_full', action="store_true", default=False,
+                        help='output uncorrected regions too')
+    parser.add_argument('--output_multi', action="store_true", default=False,
+                        help='output multi correct regions')
+    parser.add_argument('--min_idt', type=float, default=0.70,
+                        help='minimum identity of the alignments used for correction')
+    parser.add_argument('--edge_tolerance', type=int, default=1000,
+                        help='for trimming, the there is unaligned edge leng > edge_tolerance, ignore the read')
+    parser.add_argument('--trim_size', type=int, default=50,
+                        help='the size for triming both ends from initial sparse aligned region')
+    good_region = re.compile("[ACGT]+")
+    args = parser.parse_args()
+    exe_pool = Pool(args.n_core)
+    if args.trim:
+        get_consensus = get_consensus_with_trim
+    else:
+        get_consensus = get_consensus_without_trim
+
+    K = 8
+    config = args.min_cov, K, args.local_match_count_window, args.local_match_count_threshold,\
+             args.max_n_read, args.min_idt, args.edge_tolerance, args.trim_size
+    for res in exe_pool.imap(get_consensus, get_seq_data(config)):  
+        cns, seed_id = res
+        if args.output_full == True:
+            if len(cns) > 500:
+                print ">"+seed_id+"_f"
+                print cns
+        else:
+            cns = good_region.findall(cns)
+            if len(cns) == 0:
+                continue
+            if args.output_multi == True:
+                seq_i = 0
+                for cns_seq in cns:
+                    if len(cns_seq) > 500:
+                        print ">"+seed_id+"_%d" % seq_i
+                        print cns_seq
+                    seq_i += 1
+            else:
+                cns.sort(key = lambda x: len(x))
+                if len(cns[-1]) > 500:
+                    print ">"+seed_id
+                    print cns[-1]
+
diff --git a/src/py_scripts/falcon_ucns_data.py b/src/py_scripts/falcon_ucns_data.py
new file mode 100644
index 0000000..feae510
--- /dev/null
+++ b/src/py_scripts/falcon_ucns_data.py
@@ -0,0 +1,120 @@
+#!/usr/bin/env python
+
+#################################################################################$$
+# Copyright (c) 2011-2014, Pacific Biosciences of California, Inc.
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted (subject to the limitations in the
+# disclaimer below) provided that the following conditions are met:
+#
+#  * Redistributions of source code must retain the above copyright
+#  notice, this list of conditions and the following disclaimer.
+#
+#  * Redistributions in binary form must reproduce the above
+#  copyright notice, this list of conditions and the following
+#  disclaimer in the documentation and/or other materials provided
+#  with the distribution.
+#
+#  * Neither the name of Pacific Biosciences nor the names of its
+#  contributors may be used to endorse or promote products derived
+#  from this software without specific prior written permission.
+#
+# NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE
+# GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC
+# BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
+# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+#################################################################################$$
+
+import sys
+import os
+
+
+rcmap = dict(zip("ACGTacgtNn-", "TGCAtgcaNn-"))
+
+if __name__ == "__main__":
+    import argparse
+    import re
+    from pbcore.io import FastaReader
+    
+    tiling_path = {}
+    with open("all_tiling_path_c") as f:
+        for l in f:
+            l = l.strip().split()
+            tiling_path.setdefault( l[0], [])
+
+            offset = int(l[1])
+            node_id = l[2].split(":")
+            s = int(l[3])
+            e = int(l[4])
+
+            tiling_path[ l[0] ].append( (offset, node_id[0], node_id[1], s, e) )
+
+    f = FastaReader("preads.fa")
+    seq_db = {}
+    for r in f:
+         seq_db[r.name] = r.sequence
+
+    f = FastaReader("primary_tigs_c.fa")
+    p_tigs_db = {}
+    for r in f:
+         p_tigs_db[r.name] = r.sequence
+
+    for p_tig_id in p_tigs_db:
+        pread_data = {}
+        offsets = []
+        seqs = []
+        p_tig = p_tigs_db[p_tig_id]
+        if len(tiling_path[p_tig_id]) <= 5:
+            continue
+        print p_tig_id, 0, p_tig
+        for offset, s_id, end, s, e in tiling_path[p_tig_id]:
+            seq = seq_db[s_id]
+            if end == "B":
+                s, e = e, s
+                offset = offset - len(seq) 
+                seq = "".join([rcmap[c] for c in seq[::-1]])
+            else:
+                offset = offset - len(seq)
+            print s_id, offset, seq
+        
+        print "+ + +"
+
+    f = FastaReader("a_nodup.fa")
+    a_tigs_db = {}
+    for r in f:
+         a_tigs_db[r.name] = r.sequence
+
+    for a_tig_id in a_tigs_db:
+        pread_data = {}
+        offsets = []
+        seqs = []
+        a_tig = a_tigs_db[a_tig_id]
+        if len(tiling_path[a_tig_id]) <= 5:
+            continue
+        print a_tig_id, 0, a_tig
+        for offset, s_id, end, s, e in tiling_path[a_tig_id]:
+            seq = seq_db[s_id]
+            if end == "B":
+                s, e = e, s
+                offset = offset - len(seq) 
+                seq = "".join([rcmap[c] for c in seq[::-1]])
+            else:
+                offset = offset - len(seq)
+            print s_id, offset, seq
+        
+        print "+ + +"
+
+    print "- - -"
+
diff --git a/src/py_scripts/falcon_utgcns.py b/src/py_scripts/falcon_utgcns.py
new file mode 100644
index 0000000..57b8db5
--- /dev/null
+++ b/src/py_scripts/falcon_utgcns.py
@@ -0,0 +1,124 @@
+#!/usr/bin/env python
+
+#################################################################################$$
+# Copyright (c) 2011-2014, Pacific Biosciences of California, Inc.
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted (subject to the limitations in the
+# disclaimer below) provided that the following conditions are met:
+#
+#  * Redistributions of source code must retain the above copyright
+#  notice, this list of conditions and the following disclaimer.
+#
+#  * Redistributions in binary form must reproduce the above
+#  copyright notice, this list of conditions and the following
+#  disclaimer in the documentation and/or other materials provided
+#  with the distribution.
+#
+#  * Neither the name of Pacific Biosciences nor the names of its
+#  contributors may be used to endorse or promote products derived
+#  from this software without specific prior written permission.
+#
+# NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE
+# GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC
+# BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
+# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+#################################################################################$$
+
+from ctypes import *
+import sys
+from multiprocessing import Pool
+import os
+import falcon_kit
+
+module_path = falcon_kit.__path__[0]
+
+falcon = CDLL(os.path.join(module_path, "falcon.so"))
+"""
+consensus_data * generate_utg_consensus( char ** input_seq, 
+                           seq_coor_t *offset,
+                           unsigned int n_seq, 
+                           unsigned min_cov, 
+                           unsigned K,
+                           double min_idt) {
+"""
+falcon.generate_utg_consensus.argtypes = [ POINTER(c_char_p), POINTER(falcon_kit.seq_coor_t), c_uint, c_uint, c_uint, c_double ]
+falcon.generate_utg_consensus.restype = POINTER(falcon_kit.ConsensusData)
+falcon.free_consensus_data.argtypes = [ POINTER(falcon_kit.ConsensusData) ]
+
+rcmap = dict(zip("ACGTacgtNn-", "TGCAtgcaNn-"))
+
+def get_consensus(c_input):
+    t_id, seqs, offsets, config = c_input 
+    K = config[0]
+    seqs_ptr = (c_char_p * len(seqs))()
+    seqs_ptr[:] = seqs
+    offset_ptr = (c_long * len(seqs))( *offsets )
+    consensus_data_ptr = falcon.generate_utg_consensus( seqs_ptr, offset_ptr, len(seqs), 0, K, 0.)
+    consensus = string_at(consensus_data_ptr[0].sequence)[:]
+    del seqs_ptr
+    del offset_ptr
+    falcon.free_consensus_data( consensus_data_ptr )
+    return consensus, t_id
+
+def echo(c_input):
+
+    t_id, seqs, offsets, config = c_input 
+
+    return len(seqs), "test"
+
+def get_seq_data(config):
+    seqs = []
+    offsets = []
+    seed_id = None
+    with sys.stdin as f:
+        for l in f:
+            l = l.strip().split()
+            if len(l) != 3:
+                continue
+            if l[0] not in ("+", "-"):
+                if len(seqs) == 0:
+                    seqs.append(l[2]) #the "seed"
+                    offsets.append( int(l[1]) )
+                    seed_id = l[0]
+                else:
+                    seqs.append(l[2])
+                    offsets.append( int(l[1]) )
+            elif l[0] == "+":
+                yield (seed_id, seqs, offsets, config) 
+                seqs = []
+                offsets = []
+                seed_id = None
+            elif l[0] == "-":
+                break
+
+if __name__ == "__main__":
+    import argparse
+    import re
+    parser = argparse.ArgumentParser(description='a simple multi-processor consensus sequence generator')
+    parser.add_argument('--n_core', type=int, default=4,
+                        help='number of processes used for generating consensus')
+    args = parser.parse_args()
+    exe_pool = Pool(args.n_core)
+    K = 8
+    config = (K, )
+    for res in exe_pool.imap(get_consensus, get_seq_data(config)):  
+    #for res in exe_pool.imap(echo, get_seq_data(config)):  
+    #for res in map(echo, get_seq_data(config)):  
+    #for res in map(get_consensus, get_seq_data(config)):  
+        cns, t_id = res
+        print ">"+t_id+"|tigcns"
+        print cns
+
diff --git a/src/py_scripts/get_rdata.py b/src/py_scripts/get_rdata.py
new file mode 100755
index 0000000..14704a4
--- /dev/null
+++ b/src/py_scripts/get_rdata.py
@@ -0,0 +1,207 @@
+#!/usr/bin/env python
+
+#################################################################################$$
+# Copyright (c) 2011-2014, Pacific Biosciences of California, Inc.
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted (subject to the limitations in the
+# disclaimer below) provided that the following conditions are met:
+#
+#  * Redistributions of source code must retain the above copyright
+#  notice, this list of conditions and the following disclaimer.
+#
+#  * Redistributions in binary form must reproduce the above
+#  copyright notice, this list of conditions and the following
+#  disclaimer in the documentation and/or other materials provided
+#  with the distribution.
+#
+#  * Neither the name of Pacific Biosciences nor the names of its
+#  contributors may be used to endorse or promote products derived
+#  from this software without specific prior written permission.
+#
+# NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE
+# GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC
+# BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
+# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+#################################################################################$$
+
+import sys
+import glob
+#import pkg_resources
+import uuid
+from datetime import datetime
+
+from collections import Counter
+from multiprocessing import Pool
+#from pbtools.pbdagcon.q_sense import *
+import os
+
+"""
+try:
+    __p4revision__ = "$Revision: #4 $"
+    __p4change__ = "$Change: 121571 $"
+    revNum = int(__p4revision__.strip("$").split(" ")[1].strip("#"))
+    changeNum = int(__p4change__.strip("$").split(":")[-1])
+    __version__ = "%s-r%d-c%d" % ( pkg_resources.require("pbtools.pbhgap")[0].version, revNum, changeNum )
+except:
+    __version__ = "pbtools.hbar-dtk-github"
+"""
+
+query_fasta_fn = sys.argv[1]
+target_fasta_fn = sys.argv[2]
+m4_fofn = sys.argv[3]
+bestn = int(sys.argv[4])
+group_id = int(sys.argv[5])
+num_chunk = int(sys.argv[6])
+min_cov = int(sys.argv[7])
+max_cov = int(sys.argv[8])
+trim_align = int(sys.argv[9])
+trim_plr = int(sys.argv[10])
+
+
+rmap = dict(zip("ACGTNacgt-","TGCANntgca-"))
+def rc(seq):
+    return "".join([rmap[c] for c in seq[::-1]])
+
+"""0x239fb832/0_590 0x722a1e26 -1843 81.6327 0 62 590 590 0 6417 6974 9822 254 11407 -74.5375 -67.9 1"""
+query_to_target = {}
+with open(m4_fofn) as fofn:
+    for fn in fofn:
+        fn = fn.strip()
+        with open(fn) as m4_f:
+            for l in m4_f:
+                d = l.strip().split()
+                id1, id2 = d[:2]
+                #if -noSplitSubread not used, we will need the following line    
+                #id1 = id1.split("/")[0]
+                if id1 == id2:
+                    continue
+                if hash(id2) % num_chunk != group_id:
+                    continue
+                if int(d[2]) > -1000: continue
+                if int(d[11]) < 4000: continue
+                query_to_target.setdefault(id1, [])
+                query_to_target[id1].append( (int(d[2]), l) )
+
+target_to_query = {}
+for id1 in query_to_target:
+    query_to_target[id1].sort()
+    rank = 0
+    for s, ll in query_to_target[id1][:bestn]:
+        l = ll.strip()
+        d = l.split()
+        id1, id2 = d[:2]
+        target_to_query.setdefault(id2,[])
+        target_to_query[id2].append( ( (int(d[5])-int(d[6]), int(d[2])), l ) )
+        #target_to_query[id2].append( ( int(d[2]), l ) )
+        #rank += 1
+
+from pbcore.io import FastaIO
+query_data = {}
+with open(query_fasta_fn) as fofn:
+    for fa_fn in fofn:
+        fa_fn = fa_fn.strip()
+        f_s = FastaIO.FastaReader(fa_fn)
+        for s in f_s:
+            id1 = s.name
+            if id1 not in query_to_target:
+                continue
+            query_data[id1]=s.sequence
+        f_s.file.close()
+
+target_data = {}
+with open(target_fasta_fn) as fofn:
+    for fa_fn in fofn:
+        fa_fn = fa_fn.strip()
+        f_s = FastaIO.FastaReader(fa_fn)
+        for s in f_s:
+            id2 = s.name
+            if hash(id2) % num_chunk != group_id:
+                continue
+            target_data[id2]=s.sequence
+        f_s.file.close()
+
+
+ec_data = []
+base_count = Counter()
+r_count =0
+
+for id2 in target_to_query:
+    if len(target_to_query[id2])<10:
+        continue
+    if id2 not in target_data:
+        continue
+
+    ref_data = (id2, target_data[id2]) 
+    ref_len = len(target_data[id2])
+    base_count.clear()
+    base_count.update( target_data[id2] )
+    if 1.0*base_count.most_common(1)[0][1]/ref_len > 0.8:  # don't do preassmbly if a read is of >80% of the same base
+        continue
+    read_data = []
+    
+    query_alignment = target_to_query[id2]
+    query_alignment.sort() # get better alignment
+    total_bases = 0
+    max_cov_bases = max_cov * ref_len * 1.2
+    #min_cov_bases = min_cov * ref_len * 3
+    
+    for rank_score, l in query_alignment:
+        rank, score = rank_score
+        #score = rank_score
+        l = l.split()
+        id1 = l[0]
+        #if -noSplitSubread not used, we will need the following line    
+        #id1 = id1.split("/")[0]
+        q_s = int(l[5]) + trim_align
+        q_e = int(l[6]) - trim_align
+        strand = int(l[8])
+        t_s = int(l[9])
+        t_e = int(l[10])
+        t_l = int(l[11])
+        #if strand == 1:
+        #    t_s, t_e = t_l - t_e, t_l - t_s
+        #    t_s += trim_align
+        #    t_e -= trim_align
+
+        if q_e - q_s < 400:
+            continue
+        total_bases += q_e - q_s
+        if total_bases > max_cov_bases:
+            break
+        q_seq = query_data[id1][q_s:q_e]
+        read_data.append( ( "%s/0/%d_%d" % (id1, q_s, q_e), q_s, q_e, q_seq, strand, t_s, t_e) )
+
+    if len(read_data) > 5:
+        r_count += 1
+        t_id, t_seq = ref_data 
+        t_len = len(t_seq)
+        print t_id, t_seq
+        for r in read_data:
+            q_id, q_s, q_e, q_seq, strand, t_s, t_e = r
+            if strand == 1:
+                q_seq = rc(q_seq)
+            print q_id, q_seq
+        #if r_count > 600:
+        #    break
+        print "+ +"
+print "- -"
+
+#output_dir,dumb = os.path.split( os.path.abspath( output_file ) )
+#output_log = open ( os.path.join( output_dir, "j%02d.log" % group_id ), "w" )
+
+
+
+
diff --git a/src/py_scripts/overlapper.py b/src/py_scripts/overlapper.py
new file mode 100644
index 0000000..30f8fa8
--- /dev/null
+++ b/src/py_scripts/overlapper.py
@@ -0,0 +1,216 @@
+from falcon_kit import kup, falcon, DWA, get_consensus, get_alignment
+from pbcore.io import FastaReader
+import numpy as np
+import collections
+import sys
+
+seqs = []
+q_seqs = {}
+f = FastaReader(sys.argv[1]) # take one commnad line argument of the input fasta file name
+
+for r in f:
+    if len(r.sequence) < 6000:
+        continue
+    seq = r.sequence.upper()
+    seqs.append( (r.name, seq[:500], seq[-500:] ) )
+    q_seqs[r.name] = seq
+
+
+total_index_base = len(seqs) * 1000
+print total_index_base
+sa_ptr = kup.allocate_seq( total_index_base )
+sda_ptr = kup.allocate_seq_addr( total_index_base )
+K=14
+lk_ptr = kup.allocate_kmer_lookup( 1 << (K * 2) )
+
+start = 0
+for r_name, prefix, suffix in seqs:
+    kup.add_sequence( start, K, prefix, 500, sda_ptr, sa_ptr, lk_ptr)
+    start += 500
+    kup.add_sequence( start, K, suffix, 500, sda_ptr, sa_ptr, lk_ptr)
+    start += 500
+#kup.mask_k_mer(1 << (K * 2), lk_ptr, 256)
+
+kup.mask_k_mer(1 << (K * 2), lk_ptr, 64)
+
+def get_alignment(seq1, seq0):
+
+    K = 8
+    lk_ptr = kup.allocate_kmer_lookup( 1 << (K * 2) )
+    sa_ptr = kup.allocate_seq( len(seq0) )
+    sda_ptr = kup.allocate_seq_addr( len(seq0) )
+    kup.add_sequence( 0, K, seq0, len(seq0), sda_ptr, sa_ptr, lk_ptr)
+
+    kmer_match_ptr = kup.find_kmer_pos_for_seq(seq1, len(seq1), K, sda_ptr, lk_ptr)
+    kmer_match = kmer_match_ptr[0]
+    aln_range = kup.find_best_aln_range(kmer_match_ptr, K, K*5, 50)
+    #x,y = zip( * [ (kmer_match.query_pos[i], kmer_match.target_pos[i]) for i in range(kmer_match.count )] )
+    kup.free_kmer_match(kmer_match_ptr)
+    s1, e1, s2, e2 = aln_range.s1, aln_range.e1, aln_range.s2, aln_range.e2
+    if e1 - s1 > 500:
+        #s1 = 0 if s1 < 14 else s1 - 14
+        #s2 = 0 if s2 < 14 else s2 - 14
+        e1 = len(seq1) if e1 >= len(seq1)-2*K else e1 + K*2
+        e2 = len(seq0) if e2 >= len(seq0)-2*K else e2 + K*2
+        
+        alignment = DWA.align(seq1[s1:e1], e1-s1,
+                              seq0[s2:e2], e2-s2,
+                              100, 0)
+        #print seq1[s1:e1]
+        #print seq0[s2:e2]
+        #if alignment[0].aln_str_size > 500:
+
+        #aln_str1 = alignment[0].q_aln_str
+        #aln_str0 = alignment[0].t_aln_str
+        aln_size = alignment[0].aln_str_size
+        aln_dist = alignment[0].dist
+        aln_q_s = alignment[0].aln_q_s
+        aln_q_e = alignment[0].aln_q_e
+        aln_t_s = alignment[0].aln_t_s
+        aln_t_e = alignment[0].aln_t_e
+        assert aln_q_e- aln_q_s <= alignment[0].aln_str_size or aln_t_e- aln_t_s <= alignment[0].aln_str_size
+        #print aln_str1
+        #print aln_str0
+    
+        DWA.free_alignment(alignment)
+
+    kup.free_seq_addr_array(sda_ptr)
+    kup.free_seq_array(sa_ptr)
+    kup.free_kmer_lookup(lk_ptr)
+    if e1 - s1 > 500 and aln_size > 500:
+        return s1, s1+aln_q_e-aln_q_s, s2, s2+aln_t_e-aln_t_s, aln_size, aln_dist
+    else:
+        return None
+
+
+def get_ovelap_alignment(seq1, seq0):
+
+    K = 8
+    lk_ptr = kup.allocate_kmer_lookup( 1 << (K * 2) )
+    sa_ptr = kup.allocate_seq( len(seq0) )
+    sda_ptr = kup.allocate_seq_addr( len(seq0) )
+    kup.add_sequence( 0, K, seq0, len(seq0), sda_ptr, sa_ptr, lk_ptr)
+
+    kmer_match_ptr = kup.find_kmer_pos_for_seq(seq1, len(seq1), K, sda_ptr, lk_ptr)
+    kmer_match = kmer_match_ptr[0]
+    aln_range = kup.find_best_aln_range(kmer_match_ptr, K, K*5, 50)
+    #x,y = zip( * [ (kmer_match.query_pos[i], kmer_match.target_pos[i]) for i in range(kmer_match.count )] )
+    kup.free_kmer_match(kmer_match_ptr)
+    s1, e1, s0, e0 = aln_range.s1, aln_range.e1, aln_range.s2, aln_range.e2  
+    len_1 = len(seq1)
+    len_0 = len(seq0)
+    do_aln = False
+    contain_status = "none" 
+    if e1 - s1 > 500:
+        if s1 < 100 and len_1 - e1 < 100:
+            do_aln = False
+            contain_status = "contains"
+        elif s0 < 100 and len_0 - e0 < 100:
+            do_aln = False
+            contain_status = "contained"
+        else:
+            do_aln = True
+            if s0 < s1:
+                s1 -= s0 #assert s1 > 0
+                s0 = 0
+                e1 = len_1
+                e0 = len_1 - s1 if len_1 - s1 < len_0 else len_0
+                if e0 == len_0:
+                    do_aln = False
+                    contain_status = "contained"
+                
+            if s1 <= s0:
+                s0 -= s1 #assert s1 > 0
+                s1 = 0
+                e0 = len_0
+                e1 = len_0 - s0 if len_0 - s0 < len_1 else len_1
+                if e1 == len_1:
+                    do_aln = False
+                    contain_status = "contains"
+
+
+        if do_aln:
+            alignment = DWA.align(seq1[s1:e1], e1-s1,
+                                  seq0[s0:e0], e0-s0,
+                                  500, 0)
+            #print seq1[s1:e1]
+            #print seq0[s2:e2]
+            #if alignment[0].aln_str_size > 500:
+    
+            #aln_str1 = alignment[0].q_aln_str
+            #aln_str0 = alignment[0].t_aln_str
+            aln_size = alignment[0].aln_str_size
+            aln_dist = alignment[0].dist
+            aln_q_s = alignment[0].aln_q_s
+            aln_q_e = alignment[0].aln_q_e
+            aln_t_s = alignment[0].aln_t_s
+            aln_t_e = alignment[0].aln_t_e
+            assert aln_q_e- aln_q_s <= alignment[0].aln_str_size or aln_t_e- aln_t_s <= alignment[0].aln_str_size
+            #print aln_str1
+            #print aln_str0
+            if aln_size > 500: 
+                contain_status = "overlap"            
+            DWA.free_alignment(alignment)
+        
+    kup.free_seq_addr_array(sda_ptr)
+    kup.free_seq_array(sa_ptr)
+    kup.free_kmer_lookup(lk_ptr)
+
+    if e1 - s1 > 500 and do_aln and aln_size > 500:
+        #return s1, s1+aln_q_e-aln_q_s, s2, s2+aln_t_e-aln_t_s, aln_size, aln_dist, x, y
+        return s1, s1+aln_q_e-aln_q_s, s0, s0+aln_t_e-aln_t_s, aln_size, aln_dist, contain_status
+    else:
+        return 0, 0, 0, 0, 0, 0, contain_status 
+
+rc_map = dict( zip("ACGTacgtNn-", "TGCAtgcaNn-") )
+with open("test_ovlp.dat","w") as f:
+    for name, q_seq in q_seqs.items():
+        kmer_match_ptr = kup.find_kmer_pos_for_seq(q_seq, len(q_seq), K, sda_ptr, lk_ptr)
+        kmer_match = kmer_match_ptr[0]
+        count = kmer_match.count
+        hit_index = np.array(kmer_match.target_pos[0:count])/500
+        kup.free_kmer_match(kmer_match_ptr)
+        
+        c = collections.Counter(hit_index)
+        s = [c[0] for c in c.items() if c[1] >50]
+        #s.sort()
+        targets = set()
+        for p in s:
+            hit_id = seqs[p/2][0]
+            if hit_id in targets or hit_id == name:
+                continue
+            targets.add(hit_id)
+            seq1, seq0 = q_seq, q_seqs[hit_id ]
+            rtn = get_ovelap_alignment(seq1, seq0)
+            #rtn = get_alignment(seq1, seq0)
+            if rtn != None:
+                
+                s1, e1, s2, e2, aln_size, aln_dist, c_status = rtn
+                #print >>f, name, 0, s1, e1, len(seq1), hit_id, 0, s2, e2, len(seq0),  aln_size, aln_dist
+                print >>f, hit_id, name, aln_dist - aln_size, "%0.2f" % (100 - 100.0*aln_dist/(aln_size+1)), 0, s2, e2, len(seq0), 0, s1, e1, len(seq1), c_status
+                
+        r_q_seq = "".join([rc_map[c] for c in q_seq[::-1]])
+        
+        kmer_match_ptr = kup.find_kmer_pos_for_seq(r_q_seq, len(r_q_seq), K, sda_ptr, lk_ptr)
+        kmer_match = kmer_match_ptr[0]
+        count = kmer_match.count
+        hit_index = np.array(kmer_match.target_pos[0:count])/500
+        kup.free_kmer_match(kmer_match_ptr)
+        
+        c = collections.Counter(hit_index)
+        s = [c[0] for c in c.items() if c[1] >50]
+        #s.sort()
+        targets = set()
+        for p in s:
+            hit_id = seqs[p/2][0]
+            if hit_id in targets or hit_id == name:
+                continue
+            targets.add(hit_id)
+            seq1, seq0 = r_q_seq, q_seqs[hit_id]
+            rtn = get_ovelap_alignment(seq1, seq0)
+            #rtn = get_alignment(seq1, seq0)
+            if rtn != None:
+                s1, e1, s2, e2, aln_size, aln_dist, c_status = rtn
+                #print >>f, name, 1, s1, e1, len(seq1), hit_id, 0, s2, e2, len(seq0),  aln_size, aln_dist
+                print >>f, hit_id, name, aln_dist - aln_size, "%0.2f" % (100 - 100.0*aln_dist/(aln_size+1)), 0, s2, e2, len(seq0), 1, len(seq1) - e1, len(seq1)- s1, len(seq1), c_status
+
diff --git a/src/py_scripts/remove_dup_ctg.py b/src/py_scripts/remove_dup_ctg.py
new file mode 100755
index 0000000..3164eb6
--- /dev/null
+++ b/src/py_scripts/remove_dup_ctg.py
@@ -0,0 +1,75 @@
+#!/usr/bin/env python
+
+#################################################################################$$
+# Copyright (c) 2011-2014, Pacific Biosciences of California, Inc.
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted (subject to the limitations in the
+# disclaimer below) provided that the following conditions are met:
+#
+#  * Redistributions of source code must retain the above copyright
+#  notice, this list of conditions and the following disclaimer.
+#
+#  * Redistributions in binary form must reproduce the above
+#  copyright notice, this list of conditions and the following
+#  disclaimer in the documentation and/or other materials provided
+#  with the distribution.
+#
+#  * Neither the name of Pacific Biosciences nor the names of its
+#  contributors may be used to endorse or promote products derived
+#  from this software without specific prior written permission.
+#
+# NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE
+# GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC
+# BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
+# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+#################################################################################$$
+
+import pbcore.io
+
+import sys
+"""nucmer -maxmatch all_tigs.fa all_tigs.fa -p all_tigs_self >& /dev/null"""
+"""show-coords -o -H -T all_tigs_self.delta | grep CONTAINS | awk '$7>96' | awk '{print $9}' | sort -u > all_tigs_duplicated_ids"""
+
+id_to_remove = set()
+with open("all_tigs_duplicated_ids") as f:
+    for l in f:
+        l = l.strip().split("-")
+        major, minor = l[:2]
+        id_to_remove.add ( (major, minor) )
+
+f = pbcore.io.FastaReader("all_tigs.fa")
+with open("a-tigs_nodup.fa", "w") as f_out:
+    for r in f:
+        major, minor = r.name.split()[0].split("-")[:2]
+        if minor == "0000":
+            continue
+        if (major, minor) in id_to_remove:
+            continue
+        if len(r.sequence) < 500:
+            continue
+        print >>f_out, ">"+r.name
+        print >>f_out, r.sequence
+
+f = pbcore.io.FastaReader("primary_tigs_c.fa")
+with open("p-tigs_nodup.fa", "w") as f_out:
+    for r in f:
+        major, minor = r.name.split()[0].split("_")[:2]
+        if (major, "0000") in id_to_remove:
+            continue
+        if len(r.sequence) < 500:
+            continue
+        print >>f_out, ">"+r.name
+        print >>f_out, r.sequence
diff --git a/src/utils/fetch_preads.py b/src/utils/fetch_preads.py
new file mode 100644
index 0000000..c5ba7d2
--- /dev/null
+++ b/src/utils/fetch_preads.py
@@ -0,0 +1,70 @@
+from pbcore.io import FastaReader
+import networkx as nx
+import sys
+
+u_graph = nx.DiGraph()
+u_edges = {}
+with open("./unit_edges.dat") as f:
+    for l in f:
+        v, w, path, seq = l.strip().split()
+        u_edges.setdefault( (v, w), [] )
+        u_edges[ (v, w) ].append( (path, seq) )
+        u_graph.add_edge(v, w)
+        
+len(u_edges)
+u_graph_r = u_graph.reverse()
+
+
+p_tig_path = {}
+a_tig_path = {}
+with open("primary_tigs_paths_c") as f:
+    for l in f:
+        l = l.strip().split()
+        id_ = l[0][1:]
+        path = l[1:]
+        p_tig_path[id_] = path
+
+with open("all_tigs_paths") as f:
+    for l in f:
+        l = l.strip().split()
+        id_ = l[0][1:]
+        path = l[1:]
+        a_tig_path[id_] = path
+
+p_ugraph = nx.DiGraph()
+p_sgraph = nx.DiGraph()
+p_tig_id = sys.argv[1]
+
+main_path = p_tig_path["%s_00" % p_tig_id]
+all_nodes = set(main_path[:])
+main_path_nodes = set(main_path[:])
+p_ugraph.add_path(main_path)
+for id_ in a_tig_path:
+    if id_[:4] == p_tig_id:
+        a_path = a_tig_path[id_]
+        if a_path[0] in main_path_nodes and a_path[-1] in main_path_nodes:
+            p_ugraph.add_path(a_path)
+            for pp in a_path:
+                all_nodes.add(pp)
+        
+for v, w in u_edges:
+    if v in all_nodes and w in all_nodes:
+        for p, s in u_edges[(v,w)]:
+            p = p.split("-")
+            p_sgraph.add_path(p)
+            #print p
+            for pp in p:
+                all_nodes.add(pp)
+
+nx.write_gexf(p_ugraph, "p_ugraph.gexf")
+nx.write_gexf(p_sgraph, "p_sgraph.gexf")
+
+
+preads = FastaReader(sys.argv[2])
+
+all_nodes_ids = set( [s.split(":")[0] for s in list(all_nodes)] )
+with open("p_sgraph_nodes.fa","w") as f:
+    for r in preads:
+        if r.name in all_nodes_ids:
+            print >>f, ">"+r.name
+            print >>f, r.sequence
diff --git a/test_data/t1.fa b/test_data/t1.fa
new file mode 100755
index 0000000..3a20a43
--- /dev/null
+++ b/test_data/t1.fa
@@ -0,0 +1,2 @@
+>30a5633d_129405_0
+AAAAGAGAGAGATCGCCCAATTTGGATTACAGTTAGGCACGCCGCTTGTTTTTTTTTTTATTTGCTTTTCGCAGAAAGGTTCTTTCCTTTAATCAGCGCCTCTTTGATTAATGGCGTCTCCGGCAATTGACAGGATTTGTTGTTTTGCAGTAAAAGGAGAAAAAAAATGAGTATGCCACGAATAACTAGAAATAGGGCTAAAAATGTTGCCAAGATCTTTGTGGCTCGGCCAGAGACAAGCGAGCAATGAGACAAAATTGGTCGCCAGATTTTTCTCTTTCTTTTGGATTTTTTTTTTTCTTATTTTCCAATGCCGTCTGCGGCATTCAAATATGCAACAGCAAAGGGCGCGGAAAAAGCAAGGAAAAATGGTGAAAATGGGGTTGGGTGAGAGATGCCTGGGCATGCCAAAGTAGCTGCCAATTTATTTTGGGCATTTTGCTTGGCTGATAGTTGGCCATCTTTATACTCTTCCCAAAAGTGTGAAAGAAT [...]
diff --git a/test_data/t1.fofn b/test_data/t1.fofn
new file mode 100755
index 0000000..1b88fb0
--- /dev/null
+++ b/test_data/t1.fofn
@@ -0,0 +1 @@
+./t1.fa
diff --git a/test_data/t2.fa b/test_data/t2.fa
new file mode 100755
index 0000000..d8fc441
--- /dev/null
+++ b/test_data/t2.fa
@@ -0,0 +1,2 @@
+>5d64830a_48915_0
+AGTAGAGATCATCTAAACTTTGGTGGTATTTGGCTAACTTGCTTATGTACACATATTAATTTAATTATACGAGTAAACTATTTCCATATTAGCGTATAGCAGCTACGCATAGTTTATAGAACAATAAAAATGAAATATTTTCGGCGACTTTGAACAAATGACGCTTTAGGGGCCTAACGGAGTATTTTTATGTGATAGACGATTTTTTGGCGGGCCAAAAAAAATAAAAGGGAAATTGGTGCTGCGCATAAAATTGAAAGCAGGCTTGCCCTCCAACCCCGCGTCTGCCCTCCCCCCCCCCCCCGCAGATCAAGAGATTATGCTATCCCGCAATAATTCGCGCCTTGCCCGCTTAACTACGTTGGCCATGCGTCGGGGGCGGGCGTCTATGCAATGGTTCAATTGGGCGTTGACTGGCCGCTGGCTAGTGTAAGCCCAGTTTTGCGGCTTATTGCCGCTACTCGGCTCGGGCAATCACATCGAGGTCATTAA [...]
diff --git a/test_data/t2.fofn b/test_data/t2.fofn
new file mode 100755
index 0000000..de317c5
--- /dev/null
+++ b/test_data/t2.fofn
@@ -0,0 +1 @@
+./t2.fa

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/falconkit.git



More information about the debian-med-commit mailing list