[med-svn] [Git][med-team/flye][master] 5 commits: New upstream version 2.9.1+dfsg
Andreas Tille (@tille)
gitlab at salsa.debian.org
Tue Aug 23 15:48:03 BST 2022
Andreas Tille pushed to branch master at Debian Med / flye
Commits:
3b3ff2d2 by Andreas Tille at 2022-08-23T16:40:24+02:00
New upstream version 2.9.1+dfsg
- - - - -
2a8f6be8 by Andreas Tille at 2022-08-23T16:40:24+02:00
routine-update: New upstream version
- - - - -
ec4dc15c by Andreas Tille at 2022-08-23T16:40:39+02:00
Update upstream source from tag 'upstream/2.9.1+dfsg'
Update to upstream version '2.9.1+dfsg'
with Debian dir 26676e4b1d95f2a58d36fda2b7809d851e082f69
- - - - -
05cd4019 by Andreas Tille at 2022-08-23T16:44:53+02:00
Refresh patches
- - - - -
2fbaaf43 by Andreas Tille at 2022-08-23T16:46:56+02:00
Upload to unstable
- - - - -
29 changed files:
- Makefile
- README.md
- bin/flye
- debian/changelog
- − debian/patches/fix_makefile.patch
- debian/patches/series
- debian/patches/use_debian_packaged_libs.patch
- docs/FAQ.md
- docs/NEWS.md
- docs/USAGE.md
- flye/__build__.py
- flye/__version__.py
- flye/config/bin_cfg/asm_defaults.cfg
- flye/main.py
- flye/polishing/alignment.py
- flye/polishing/bubbles.py
- flye/polishing/consensus.py
- flye/polishing/polish.py
- flye/trestle/divergence.py
- flye/trestle/trestle.py
- flye/utils/sam_parser.py
- setup.py
- src/Makefile
- src/repeat_graph/haplotype_resolver.cpp
- src/repeat_graph/main_repeat.cpp
- src/repeat_graph/multiplicity_inferer.cpp
- src/repeat_graph/repeat_graph.cpp
- src/sequence/alignment.cpp
- src/sequence/sequence_container.cpp
Changes:
=====================================
Makefile
=====================================
@@ -39,5 +39,5 @@ clean:
make clean -C src
make clean -C ${MINIMAP2_DIR}
make clean-all -C ${SAMTOOLS_DIR}
- rm ${BIN_DIR}/flye-minimap2
- rm ${BIN_DIR}/flye-samtools
+ rm -f ${BIN_DIR}/flye-minimap2
+ rm -f ${BIN_DIR}/flye-samtools
=====================================
README.md
=====================================
@@ -3,15 +3,19 @@ Flye assembler
[![BioConda Install](https://img.shields.io/conda/dn/bioconda/flye.svg?style=flag&label=BioConda%20install)](https://anaconda.org/bioconda/flye)
-### Version: 2.9
+### Version: 2.9.1
-Flye is a de novo assembler for single molecule sequencing reads,
+Flye is a de novo assembler for single-molecule sequencing reads,
such as those produced by PacBio and Oxford Nanopore Technologies.
It is designed for a wide range of datasets, from small bacterial projects
to large mammalian-scale assemblies. The package represents a complete
pipeline: it takes raw PacBio / ONT reads as input and outputs polished contigs.
Flye also has a special mode for metagenome assembly.
+Currently, Flye will produce collapsed assemblies of diploid genomes,
+represented by a single mosaic haplotype. To recover two phased haplotypes
+consider applying [HapDup](https://github.com/fenderglass/hapdup) after the assembly.
+
Manuals
-------
@@ -22,12 +26,19 @@ Manuals
Latest updates
--------------
-### Flye 2.9 release (20 Aug 2022)
-* Better assembly of very short sequences (e.g. plasmids or viruses). They vere often missed in previous versions.
-* New --nano-hq mode for ONT Guppy5+ and Q20 reads (3-5% error rate)
+### Flye 2.9.1 release (07 Aug 2022)
+* New option --no-alt-contigs to remove all non-primary contigs from the assembly
+* Fixed crash on MacOS with Python 3.8+
+* Fixed rare artificially introduced mismatches while polishing
+* Fixed slow simplification of very tangled graphs
+* Various other small fixes
+
+### Flye 2.9 release (20 Aug 2021)
+* Better assembly of very short sequences (e.g. plasmids or viruses). They were often missed in previous versions.
+* New --nano-hq mode for ONT Guppy5+ (SUP mode) and Q20 reads (3-5% error rate)
* Optimized default parameters for HiFi (HPC error threshold 0.01 -> 0.001; increased min overlap)
* Polishing improvements: reduced number of possible clusters of errors
-* Improvements in repeat delection algorithm to further limit a chance of (otherwise infrequent) misassemblies
+* Improvements in repeat detection algorithm to further limit a chance of (otherwise infrequent) misassemblies
* Scaffolding is no longer performed by default (could be enabled with --scaffold)
* Bam file input for the standalone polisher (same interface as for FASTA/Q)
* Automatically selected minimum overlap up to 10k (was 5k)
@@ -35,13 +46,13 @@ Latest updates
* --trestle and --subassemblies modes are now deprecated, and will be removed in the future versions
* New --extra-params option to modify config-level parameters
* Contig paths output in Gfa + number of reads supporting each link (RC tag)
-* Update to minimap 2.19
+* Update to minimap 2.18
* Several rare bug fixes/other improvements
### Flye 2.8.3 release (10 Feb 2021)
* Reduced RAM consumption for some ultra-long ONT datasets
-* Fixed rare artifical sequence insertions on some ONT datasets
-* Asseemblies should be largely identical to 2.8
+* Fixed rare artificial sequence insertions on some ONT datasets
+* Assemblies should be largely identical to 2.8
### Flye 2.8.2 release (12 Dec 2020)
* Improvements in GFA output, much faster generation of large and tangled graphs
@@ -55,24 +66,24 @@ Latest updates
### Flye 2.8 release (04 Aug 2020)
* Improvements in contiguity and speed for PacBio HiFi mode
* Using the `--meta` k-mer selection strategy in isolate assemblies as well.
-This strategy is more robust to drops in coverage/contamination and reqires less memory
+This strategy is more robust to drops in coverage/contamination and requires less memory
* 1.5-2x RAM footprint reduction for large assemblies (e.g. human ONT assembly now uses 400-500 Gb)
* Genome size parameter is no longer required (it is still needed for downsampling though `--asm-coverage`)
-* Flye now can occasionally use overlaps shorter than "minOverlap" parameter to close disjointig gaps
+* Flye now can occasionally use overlaps shorter than "minOverlap" parameter to close disjointing gaps
* Various improvements and bugfixes
Repeat graph
------------
-Flye is using repeat graph as a core data structure.
-In difference to de Bruijn graphs (which require exact k-mer matches),
+Flye is using a repeat graph as the core data structure.
+Compared to de Bruijn graphs (which require exact k-mer matches),
repeat graphs are built using approximate sequence matches, and
-can tolerate higher noise of SMS reads.
+can tolerate the higher noise of SMS reads.
-The edges of repeat graph represent genomic sequence, and nodes define
-the junctions. Each edges is classified into unique or repetitive.
-The genome traverses the graph (in an unknown way), so as each unique
+The edges of a repeat graph represent the genomic sequence, and nodes define
+the junctions. Each edge is classified as unique or repetitive.
+The genome traverses the graph (in an unknown way), so each unique
edge appears exactly once in this traversal. Repeat graphs reveal the
repeat structure of the genome, which helps to reconstruct an optimal assembly.
@@ -82,7 +93,7 @@ repeat structure of the genome, which helps to reconstruct an optimal assembly.
</p>
Above is an example of the repeat graph of a bacterial assembly.
-Each edge is labeled with its id, length and coverage. Repetitive edges are shown
+Each edge is labeled with its id, length, and coverage. Repetitive edges are shown
in color, and unique edges are black. Note that each edge is represented in
two copies: forward and reverse complement (marked with +/- signs),
therefore the entire genome is represented in two copies. This is necessary
@@ -138,7 +149,7 @@ All datasets were run with default parameters for the corresponding read type
with the following exceptions: CHM13 T2T, CHM1 and HG002 were run with `--asm-coverage 50`.
Note that this version of the table reflects contigs NG50, while the previous versions
-were refering to scaffold NG50.
+were referring to scaffold NG50.
Third-party
@@ -150,7 +161,7 @@ Flye package includes some third-party software:
* [intervaltree](https://github.com/ekg/intervaltree)
* [lemon](http://lemon.cs.elte.hu/trac/lemon)
* [minimap2](https://github.com/lh3/minimap2)
-* [samtools](https://https://github.com/samtools/samtools)
+* [samtools](https://github.com/samtools/samtools)
License
@@ -196,10 +207,10 @@ How to get help
A preferred way report any problems or ask questions about Flye is the
[issue tracker](https://github.com/fenderglass/Flye/issues).
Before posting an issue/question, consider to look through the FAQ
-and existing issues (opened and closed) - it is possble that your question
+and existing issues (opened and closed) - it is possible that your question
has already been answered.
-If you reporting a problem, please include the `flye.log` file and provide
+If you are reporting a problem, please include the `flye.log` file and provide
details about your dataset.
In case you prefer personal communication, please contact Mikhail at fenderglass at gmail.com.
=====================================
bin/flye
=====================================
@@ -14,12 +14,17 @@ import sys
BIN_DIR = "bin"
-#Setting executable paths
-flye_root = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
-bin_absolute = os.path.join(flye_root, BIN_DIR)
-sys.path.insert(0, flye_root)
-os.environ["PATH"] = bin_absolute + os.pathsep + os.environ["PATH"]
-
-#Flye entry point
-from flye.main import main
-sys.exit(main())
+def main():
+ #Setting executable paths
+ flye_root = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
+ bin_absolute = os.path.join(flye_root, BIN_DIR)
+ sys.path.insert(0, flye_root)
+ os.environ["PATH"] = bin_absolute + os.pathsep + os.environ["PATH"]
+
+ #Flye entry point
+ from flye.main import main
+ sys.exit(main())
+
+
+if __name__ == "__main__":
+ main()
=====================================
debian/changelog
=====================================
@@ -1,3 +1,9 @@
+flye (2.9.1+dfsg-1) unstable; urgency=medium
+
+ * New upstream version
+
+ -- Andreas Tille <tille at debian.org> Tue, 23 Aug 2022 16:45:19 +0200
+
flye (2.9+dfsg-2) unstable; urgency=medium
* Source-only upload
=====================================
debian/patches/fix_makefile.patch deleted
=====================================
@@ -1,24 +0,0 @@
-Author: Andreas Tille <tille at debian.org>
-Last-Update: Fri, 05 Jun 2020 15:05:08 +0200
-Description: Fix clean target in Makefile
-
---- a/src/Makefile
-+++ b/src/Makefile
-@@ -63,10 +63,10 @@ main.o: main.cpp
-
-
- clean:
-- -rm ${repeat_obj}
-- -rm ${sequence_obj}
-- -rm ${assemble_obj}
-- -rm ${polish_obj}
-- -rm ${contigger_obj}
-- -rm ${main_obj}
-- -rm ${MODULES_BIN}
-+ -rm -f ${repeat_obj}
-+ -rm -f ${sequence_obj}
-+ -rm -f ${assemble_obj}
-+ -rm -f ${polish_obj}
-+ -rm -f ${contigger_obj}
-+ -rm -f ${main_obj}
-+ -rm -f ${MODULES_BIN}
=====================================
debian/patches/series
=====================================
@@ -1,2 +1 @@
-fix_makefile.patch
use_debian_packaged_libs.patch
=====================================
debian/patches/use_debian_packaged_libs.patch
=====================================
@@ -34,13 +34,11 @@ Description: use Debian packaged libminimap2 and liblemon
make clean -C src
- make clean -C ${MINIMAP2_DIR}
- make clean-all -C ${SAMTOOLS_DIR}
-- rm ${BIN_DIR}/flye-minimap2
-- rm ${BIN_DIR}/flye-samtools
-+ rm -f ${BIN_DIR}/flye-minimap2
-+ rm -f ${BIN_DIR}/flye-samtools
+ rm -f ${BIN_DIR}/flye-minimap2
+ rm -f ${BIN_DIR}/flye-samtools
--- a/setup.py
+++ b/setup.py
-@@ -55,7 +55,7 @@ class MakeInstall(SetuptoolsInstall):
+@@ -58,7 +58,7 @@ class MakeInstall(SetuptoolsInstall):
build_dir = os.path.join(script_dir, "bin")
install_dir = self.install_scripts
@@ -51,7 +49,7 @@ Description: use Debian packaged libminimap2 and liblemon
sys.exit('Error: binary not found: ' + file)
--- a/flye/polishing/alignment.py
+++ b/flye/polishing/alignment.py
-@@ -22,8 +22,8 @@ from flye.six.moves import range
+@@ -23,8 +23,8 @@ from flye.six.moves import range
logger = logging.getLogger()
=====================================
docs/FAQ.md
=====================================
@@ -19,13 +19,16 @@ or local reassembly.
Are diploid genomes supported?
------------------------------
-Currently Flye does not explicitly support diploid assemblies. If heterzygosity
-is low, it should not be a problem for contiguity; however similar alternative
-haplotypes could be collapsed. Likewise, SNPs and structural variations between
-the alternative haplotypes will not be captured. If heterozygosity is high,
-Flye will likely recover alternative haplotypes, but will not phase them..
-Because we do not attempt to reconstruct pseudo-haplotypes,
-this will also reduce the overall contiguity.
+Currently Flye will likely produce a collapsed assembly of diploid genomes,
+levaing only one mosaic allele per haplotype. If heterzygosity is low (e.g. ~0.1%),
+it should not be a problem for contiguity; however SNPs and structural variations between
+the alternative haplotypes will not be captured. It is now possible to recover two phased haplotypes
+by using [HapDup](https://github.com/fenderglass/hapdup) after the Flye assembly.
+
+If heterozygosity is high, Flye will likely recover alternative haplotypes, but will not phase them.
+Because we currently do not attempt to reconstruct pseudo-haplotypes, this will also reduce the overall contiguity.
+We plan to bring more support for highly heterozygous genome in the future.
+
Are metagenomes supported?
--------------------------
@@ -50,7 +53,7 @@ Are PacBio CCS/HiFi reads supported?
Yes, use the `--pacbio-hifi` option.
-Are there any special parameters/modes for the newest ONT data (Guppy 5+, Q20)?
+Are there any special parameters/modes for the newest ONT data (Guppy 5+ SUP, Q20)?
-------------------------------------------------------------------------------
Yes, use the new `--nano-hq` mode.
@@ -97,7 +100,7 @@ What is minimum read coverage required for Flye?
One can typically get satisfying assembly contiguity
with 30x+ PacBio / ONT reads, if the read length is
-sufficient (e.g. with N50 of severl kb).
+sufficient (e.g. with N50 of several kb).
You might need higher coverage to improve the consensus quality.
Depending on the technology and read length distribution,
@@ -107,7 +110,7 @@ of datasets with coverage below 10x is not recommended.
How do I select genome size if I don't know it in advance?
----------------------------------------------------------
-Genome size parameter is no longer required since the version 2.8.
+Genome size parameter is no longer required since the version 2.8.
I have a seemingly sufficient number of reads, but nothing was assembled. Why is that?
@@ -124,7 +127,7 @@ N90 over 1kb (5kb+ recommended). Flye will not work with reads shorter than 1kb.
If you have verified that Flye configuration is adequate for your dataset
and the assembly is still empty, it is very likely that there is simply no
sufficient overlaps between reads to assemble anything! This often happens with
-metagenomic datasets that were sequences with low read depth.
+metagenomic datasets that were sequenced with low read depth.
My assembly size / contiguity is not what I expected. What parameters can I tweak to fix it?
--------------------------------------------------------------------------------------------
@@ -190,7 +193,7 @@ But it should be applied with caution to prevent over-correction of repetitive r
Also see [Watson and Warr paper](https://www.nature.com/articles/s41587-018-0004-z)
for a discussion on the assembly quality.
-Should I use regulat or error-corrected reads?
+Should I use regular or error-corrected reads?
---------------------------------------------
Flye was primarily designed and tested using regular (uncorrected) reads, so it is always the recommended option.
@@ -209,7 +212,7 @@ Note that in PacBio mode, Flye assumes that the input files represent PacBio sub
e.g. adaptors and scraps are removed and multiple passes of the same insertion
sequence are separated. This is typically handled by PacBio instruments/toolchains,
however we saw examples of problemmatic raw -> fastq conversions,
-which resulted into incorrectl subreads. In this case,
+which resulted into incorrect subreads. In this case,
consider using [pbclip](https://github.com/fenderglass/pbclip) to fix your Fasta/q reads.
Are cluster environments (SGE / Slurm etc.) supported?
@@ -217,7 +220,7 @@ Are cluster environments (SGE / Slurm etc.) supported?
Currently, cluster environments are not supported. Flye was designed to run on a single
high-memory node, and it will be difficult to make it run in a distributed environment.
-Note that Flye pipeline has multiple consecutive stages, that are could be resumed and
+Note that Flye pipeline has multiple consecutive stages, that could be resumed and
run on different machines, if desired.
Can I run the Flye polisher on an existing assembly?
=====================================
docs/NEWS.md
=====================================
@@ -1,7 +1,15 @@
+Flye 2.9.1 release (07 Aug 2022)
+===============================
+* New option --no-alt-contigs to remove all non-primary contigs from the assembly
+* Fixed crash on MacOS with Python 3.8+
+* Fixed rare artificially introduced mismatches while polishing
+* Fixed slow simplification of very tangled graphs
+* Various other small fixes
+
Flye 2.9 release (20 Aug 2022)
=============================
* Better assembly of very short sequences (e.g. plasmids or viruses). They vere often missed in previous versions.
-* New --nano-hq mode for ONT Guppy5+ and Q20 reads (3-5% error rate)
+* New --nano-hq mode for ONT Guppy5+ SUP and Q20 reads (3-5% error rate)
* Optimized default parameters for HiFi (HPC error threshold 0.01 -> 0.001; increased min overlap)
* Polishing improvements: reduced number of possible clusters of errors
* Improvements in repeat delection algorithm to further limit a chance of (otherwise infrequent) misassemblies
@@ -12,7 +20,7 @@ Flye 2.9 release (20 Aug 2022)
* --trestle and --subassemblies modes are now deprecated, and will be removed in the future versions
* New --extra-params option to modify config-level parameters
* Contig paths output in Gfa + number of reads supporting each link (RC tag)
-* Update to minimap 2.19
+* Update to minimap 2.18
* Several rare bug fixes/other improvements
Flye 2.8.3 release (10 Feb 2021)
=====================================
docs/USAGE.md
=====================================
@@ -8,6 +8,7 @@ Table of Contents
- [Examples](#examples)
- [Supported Input Data](#inputdata)
- [Parameter Descriptions](#parameters)
+- [Assembling diploid genomes](#diploid)
- [Flye output](#output)
- [Repeat graph](#graph)
- [Flye benchmarks](#performance)
@@ -42,7 +43,7 @@ optional arguments:
--nano-corr path [path ...]
ONT reads that were corrected with other methods (<3% error)
--nano-hq path [path ...]
- ONT high-quality reads: Guppy5+ or Q20 (<5% error)
+ ONT high-quality reads: Guppy5+ SUP or Q20 (<5% error)
--subassemblies path [path ...]
[deprecated] high-quality contigs input
-g size, --genome-size size
@@ -63,6 +64,8 @@ optional arguments:
--plasmids unused (retained for backward compatibility)
--meta metagenome / uneven coverage mode
--keep-haplotypes do not collapse alternative haplotypes
+ --no-alt-contigs do not output contigs representing alternative
+ haplotypes
--scaffold enable scaffolding using graph [disabled by default]
--trestle [deprecated] enable Trestle [disabled by default]
--polish-target path run polisher on the target sequence
@@ -125,7 +128,7 @@ The dataset was originally released by the
range of datasets, from old R7 pores to the most recent R9.x and R10.x. The
expected error rate is 10-15%.
-* For the most recent ONT data basecalled with Guppy5 use the new `--nano-hq` mode.
+* For the most recent ONT data basecalled with Guppy5+ SUP use the new `--nano-hq` mode.
Expected error rate is <5%.
* For Q20 data, use a combination of `--nano-hq` and `--read-error 0.03`..
@@ -207,6 +210,13 @@ longer consensus contigs. The option `--keep-haplotypes` retains
the alternative paths on the graph, producing less contigouos, but
more detailed assembly.
+### Removing alternative contigs
+
+In default mode, Flye is performing collapsed/haploid assmebly,
+but may output contigs representing alternative alleles if they
+differ significatnly from the "primary" assmebled allele.
+To disable output of alternative contigs, use the `--no-alt-contigs` option.
+
### Scaffold
Starting from the version 2.9 Flye does not perform scaffolding by default,
@@ -260,6 +270,13 @@ You might also resume from a particular stage with `--resume-from stage_name`,
where `stage_name` is a choice of `assembly, consensus, repeat, trestle, polishing`.
For example, you might supply different sets of reads for different stages.
+## <a name="diploid"></a> Assembling diploid genomes
+
+Currently Flye will produce collapsed assemblies of diploid genomes,
+represented by a sigle mosaic haplotype. To recover two phased haplotypes
+consider applying [HapDup](https://github.com/fenderglass/hapdup) after the assembly.
+
+
## <a name="output"></a> Flye output
The main output files are:
@@ -299,7 +316,8 @@ Scaffold gaps are marked with `??` symbols, and `*` symbol denotes a
terminal graph node.
Alternative contigs (representing alternative haplotypes) will have the same
-alt. group ID. Primary contigs are marked by `*`
+alt. group ID. Primary contigs are marked by `*`. Note that the ouptut of
+alternative contigs could be disabled via the `--no-alt-contigs` option.
## <a name="graph"></a> Repeat graph
=====================================
flye/__build__.py
=====================================
@@ -1 +1 @@
-__build__ = 1768
+__build__ = 1780
=====================================
flye/__version__.py
=====================================
@@ -1 +1 @@
-__version__ = "2.9"
+__version__ = "2.9.1"
=====================================
flye/config/bin_cfg/asm_defaults.cfg
=====================================
@@ -34,3 +34,4 @@ tip_coverage_rate = 2
tip_length_rate = 2
output_gfa_before_rr = 0
+remove_alt_edges = 0
=====================================
flye/main.py
=====================================
@@ -687,7 +687,7 @@ def main():
help="ONT reads that were corrected with other methods (<3%% error)")
read_group.add_argument("--nano-hq", dest="nano_hq", nargs="+",
default=None, metavar="path",
- help="ONT high-quality reads: Guppy5+ or Q20 (<5%% error)")
+ help="ONT high-quality reads: Guppy5+ SUP or Q20 (<5%% error)")
read_group.add_argument("--subassemblies", dest="subassemblies", nargs="+",
default=None, metavar="path",
help="[deprecated] high-quality contigs input")
@@ -726,6 +726,9 @@ def main():
parser.add_argument("--keep-haplotypes", action="store_true",
dest="keep_haplotypes", default=False,
help="do not collapse alternative haplotypes")
+ parser.add_argument("--no-alt-contigs", action="store_true",
+ dest="no_alt_contigs", default=False,
+ help="do not output contigs representing alternative haplotypes")
parser.add_argument("--scaffold", action="store_true",
dest="scaffold", default=False,
help="enable scaffolding using graph [disabled by default]")
@@ -771,6 +774,13 @@ def main():
else:
args.extra_params = hifi_str
+ if args.no_alt_contigs:
+ alt_params = "remove_alt_edges=1"
+ if args.extra_params:
+ args.extra_params += "," + alt_params
+ else:
+ args.extra_params = "remove_alt_edges=1"
+
if args.pacbio_raw:
args.reads = args.pacbio_raw
args.platform = "pacbio"
=====================================
flye/polishing/alignment.py
=====================================
@@ -13,6 +13,7 @@ from collections import namedtuple
import subprocess
import logging
import datetime
+from copy import copy
import flye.utils.fasta_parser as fp
from flye.utils.utils import which, get_median
@@ -100,10 +101,10 @@ def get_uniform_alignments(alignments):
if not alignments:
return []
- WINDOW = 500
- MIN_COV = 10
+ WINDOW = 100
+ MIN_COV = 20
GOOD_RATE = 0.66
- MIN_QV = 30
+ MIN_QV = 20
def is_reliable(aln):
return not aln.is_secondary and not aln.is_supplementary and aln.map_qv >= MIN_QV
@@ -123,6 +124,7 @@ def get_uniform_alignments(alignments):
wnd_all_cov[i] += 1
cov_threshold = max(int(get_median(wnd_primary_cov)), MIN_COV)
+ orig_primary_cov = copy(wnd_primary_cov)
selected_alignments = []
original_sequence = 0
@@ -156,9 +158,6 @@ def get_uniform_alignments(alignments):
wnd_good, wnd_bad = _aln_score(aln)
sec_aln_scores[aln.qry_id] = (wnd_good, wnd_bad, aln)
- #logger.debug("Seq: {0} pri_cov: {1} all_cov: {2}".format(ctg_id, _get_median(wnd_primary_cov),
- # _get_median(wnd_all_cov)))
-
#now, greedily add secondaty alignments, until they add useful coverage
_score_fun = lambda x: (sec_aln_scores[x][0] - 2 * sec_aln_scores[x][1],
sec_aln_scores[x][2].trg_end - sec_aln_scores[x][2].trg_start)
@@ -177,11 +176,17 @@ def get_uniform_alignments(alignments):
#logger.debug("\tSec score: {} {} {} {}".format(aln_id, wnd_good, wnd_bad, to_take))
- #logger.debug("Original seq: {0}, reads: {1}".format(original_sequence, len(alignments)))
- #logger.debug("Primary seq: {0}, reads: {1}".format(primary_sequence, primary_aln))
- #logger.debug("Secondary seq: {0}, reads: {1}".format(secondary_sequence, secondary_aln))
+ #logger.debug("Seq: {0} pri_cov: {1} all_cov: {2}".format(ctg_id, get_median(orig_primary_cov),
+ # get_median(wnd_all_cov)) + "\n" +
+ # "\tOriginal seq: {0}, reads: {1}".format(original_sequence, len(alignments)) + "\n" +
+ # "\tPrimary seq: {0}, reads: {1}".format(primary_sequence, primary_aln) + "\n" +
+ # "\tSecondary seq: {0}, reads: {1}".format(secondary_sequence, secondary_aln) + "\n" +
+ # "\tSelected size: {0}, median coverage: {1}".format(len(selected_alignments), get_median(wnd_primary_cov)))
+
+ median_cov = get_median(wnd_primary_cov)
+ #mean_cov = sum(wnd_primary_cov) / (len(wnd_primary_cov) + 1)
- return selected_alignments
+ return selected_alignments, median_cov
def split_into_chunks(fasta_in, chunk_size, fasta_out):
@@ -276,7 +281,7 @@ def _run_minimap(reference_file, reads_files, num_proc, mode, out_file,
stderr=open(stderr_file, "w"),
stdout=open(os.devnull, "w"))
if sam_output:
- subprocess.check_call(SAMTOOLS_BIN + " index " + "'" + out_file + "'", shell=True)
+ subprocess.check_call(SAMTOOLS_BIN + " index -@ 4 " + "'" + out_file + "'", shell=True)
#os.remove(stderr_file)
except (subprocess.CalledProcessError, OSError) as e:
=====================================
flye/polishing/bubbles.py
=====================================
@@ -53,7 +53,7 @@ class Bubble(object):
def _thread_worker(aln_reader, chunk_feeder, contigs_info, err_mode,
- results_queue, error_queue, bubbles_file_handle,
+ results_queue, error_queue, bubbles_file,
bubbles_file_lock):
"""
Will run in parallel
@@ -73,14 +73,18 @@ def _thread_worker(aln_reader, chunk_feeder, contigs_info, err_mode,
#since we are working with contig chunks, tranform alignment coorinates
ctg_aln = aln_reader.trim_and_transpose(ctg_aln, ctg_region.start, ctg_region.end)
+ ctg_aln, mean_cov = get_uniform_alignments(ctg_aln)
- ctg_aln = get_uniform_alignments(ctg_aln)
profile, aln_errors = _compute_profile(ctg_aln, ref_seq)
partition, num_long_bubbles = _get_partition(profile, err_mode)
ctg_bubbles = _get_bubble_seqs(ctg_aln, profile, partition, ctg_id)
- mean_cov = aln_reader.get_median_depth(ctg_region.ctg_id, ctg_region.start,
- ctg_region.end)
+ ##
+ coverage_cap = 0.9 * cfg.vals["max_read_coverage"]
+ if mean_cov > coverage_cap:
+ mean_cov = aln_reader.get_median_depth(ctg_region.ctg_id, ctg_region.start,
+ ctg_region.end)
+ ##
ctg_bubbles, num_empty = _postprocess_bubbles(ctg_bubbles)
ctg_bubbles, num_long_branch = _split_long_bubbles(ctg_bubbles)
@@ -90,7 +94,7 @@ def _thread_worker(aln_reader, chunk_feeder, contigs_info, err_mode,
b.position += ctg_region.start
with bubbles_file_lock:
- _output_bubbles(ctg_bubbles, bubbles_file_handle)
+ _output_bubbles(ctg_bubbles, open(bubbles_file, "a"))
results_queue.put((ctg_id, len(ctg_bubbles), num_long_bubbles,
num_empty, num_long_branch, aln_errors,
mean_cov))
@@ -112,18 +116,20 @@ def make_bubbles(alignment_path, contigs_info, contigs_path,
CHUNK_SIZE = 1000000
contigs_fasta = fp.read_sequence_dict(contigs_path)
- aln_reader = SynchronizedSamReader(alignment_path, contigs_fasta,
+ manager = multiprocessing.Manager()
+ aln_reader = SynchronizedSamReader(alignment_path, contigs_fasta, manager,
cfg.vals["max_read_coverage"], use_secondary=True)
- chunk_feeder = SynchonizedChunkManager(contigs_fasta, chunk_size=CHUNK_SIZE)
+ chunk_feeder = SynchonizedChunkManager(contigs_fasta, manager, chunk_size=CHUNK_SIZE)
- manager = multiprocessing.Manager()
results_queue = manager.Queue()
error_queue = manager.Queue()
bubbles_out_lock = multiprocessing.Lock()
- bubbles_out_handle = open(bubbles_out, "w")
+ #bubbles_out_handle = open(bubbles_out, "w")
process_in_parallel(_thread_worker, (aln_reader, chunk_feeder, contigs_info, err_mode,
- results_queue, error_queue, bubbles_out_handle, bubbles_out_lock), num_proc)
+ results_queue, error_queue, bubbles_out, bubbles_out_lock), num_proc)
+ #_thread_worker(aln_reader, chunk_feeder, contigs_info, err_mode,
+ # results_queue, error_queue, bubbles_out, bubbles_out_lock)
if not error_queue.empty():
raise error_queue.get()
@@ -237,7 +243,7 @@ def _postprocess_bubbles(bubbles):
new_branches.append(branch)
#checking again (since we might have tossed some branches)
- if len(bubble.branches) == 0:
+ if len(new_branches) == 0:
empty_bubbles += 1
continue
@@ -439,27 +445,26 @@ def _get_bubble_seqs(alignment, profile, partition, contig_id):
bubbles[-1].consensus = "".join(consensus)
for aln in alignment:
- bubble_id = bisect(partition, aln.trg_start)
+ bubble_id = bisect(ext_partition, aln.trg_start) - 1
next_bubble_start = ext_partition[bubble_id + 1]
- chromosome_start = bubble_id == 0
- chromosome_end = aln.trg_end > partition[-1]
+ chromosome_end = aln.trg_end >= ext_partition[-1]
- branch_start = None
- first_segment = True
+ incomplete_segment = aln.trg_start > ext_partition[bubble_id]
trg_pos = aln.trg_start
+ branch_start = 0
for i, trg_nuc in enumerate(aln.trg_seq):
if trg_nuc == "-":
continue
#if trg_pos >= contig_info.length:
#trg_pos -= contig_info.length
- if trg_pos >= next_bubble_start or trg_pos == 0:
- if not first_segment or chromosome_start:
+ if trg_pos >= next_bubble_start:
+ if not incomplete_segment:
branch_seq = fp.to_acgt(aln.qry_seq[branch_start : i].replace("-", ""))
bubbles[bubble_id].branches.append(branch_seq)
- first_segment = False
- bubble_id = bisect(partition, trg_pos)
+ incomplete_segment = False
+ bubble_id = bisect(ext_partition, trg_pos) - 1
next_bubble_start = ext_partition[bubble_id + 1]
branch_start = i
=====================================
flye/polishing/consensus.py
=====================================
@@ -48,7 +48,7 @@ def _thread_worker(aln_reader, chunk_feeder, platform, results_queue, error_queu
continue
ctg_aln = aln_reader.trim_and_transpose(ctg_aln, ctg_region.start, ctg_region.end)
- ctg_aln = get_uniform_alignments(ctg_aln)
+ ctg_aln, _mean_cov = get_uniform_alignments(ctg_aln)
profile, aln_errors = _contig_profile(ctg_aln, platform)
sequence = _flatten_profile(profile)
@@ -68,14 +68,15 @@ def get_consensus(alignment_path, contigs_path, contigs_info, num_proc,
CHUNK_SIZE = 1000000
contigs_fasta = fp.read_sequence_dict(contigs_path)
- aln_reader = SynchronizedSamReader(alignment_path, contigs_fasta,
+ mp_manager = multiprocessing.Manager()
+ aln_reader = SynchronizedSamReader(alignment_path, contigs_fasta, mp_manager,
max_coverage=cfg.vals["max_read_coverage"],
use_secondary=True)
- chunk_feeder = SynchonizedChunkManager(contigs_fasta, CHUNK_SIZE)
+ chunk_feeder = SynchonizedChunkManager(contigs_fasta, mp_manager, CHUNK_SIZE)
- manager = multiprocessing.Manager()
- results_queue = manager.Queue()
- error_queue = manager.Queue()
+ #manager = multiprocessing.Manager()
+ results_queue = mp_manager.Queue()
+ error_queue = mp_manager.Queue()
process_in_parallel(_thread_worker, (aln_reader, chunk_feeder,
platform, results_queue, error_queue), num_proc)
=====================================
flye/polishing/polish.py
=====================================
@@ -9,7 +9,7 @@ Runs polishing binary in parallel and concatentes output
from __future__ import absolute_import
from __future__ import division
import logging
-import subprocess
+import subprocess, multiprocessing
import os
from collections import defaultdict
@@ -156,7 +156,7 @@ def generate_polished_edges(edges_file, gfa_file, polished_contigs, work_dir,
work_dir, error_mode, alignment_file,
reference_mode=True, sam_output=True)
aln_reader = SynchronizedSamReader(alignment_file,
- polished_dict,
+ polished_dict, multiprocessing.Manager(),
cfg.vals["max_read_coverage"])
aln_by_edge = defaultdict(list)
=====================================
flye/trestle/divergence.py
=====================================
@@ -178,11 +178,11 @@ def find_divergence(alignment_path, contigs_path, contigs_info,
return
contigs_fasta = fp.read_sequence_dict(contigs_path)
- aln_reader = SynchronizedSamReader(alignment_path, contigs_fasta,
+ manager = multiprocessing.Manager()
+ aln_reader = SynchronizedSamReader(alignment_path, contigs_fasta, manager,
config.vals["max_read_coverage"])
- chunk_feeder = SynchonizedChunkManager(contigs_fasta)
+ chunk_feeder = SynchonizedChunkManager(contigs_fasta, manager)
- manager = multiprocessing.Manager()
results_queue = manager.Queue()
error_queue = manager.Queue()
=====================================
flye/trestle/trestle.py
=====================================
@@ -1148,7 +1148,7 @@ def partition_reads(edges, it, side, position_path, cons_align_path,
def _read_alignment(alignment, target_path, min_aln_rate):
alignments = []
fasta_dict = fp.read_sequence_dict(target_path)
- aln_reader = SynchronizedSamReader(alignment, fasta_dict,
+ aln_reader = SynchronizedSamReader(alignment, fasta_dict, multiprocessing.Manager(),
config.vals["max_read_coverage"])
for ctg_id in fasta_dict:
ctg_aln = aln_reader.get_alignments(ctg_id)
=====================================
flye/utils/sam_parser.py
=====================================
@@ -129,15 +129,15 @@ class SynchonizedChunkManager(object):
Stores the list of reference segments / chunks and can
return them in multiple threds
"""
- def __init__(self, reference_fasta, chunk_size=None):
+ def __init__(self, reference_fasta, multiproc_manager, chunk_size=None):
#prepare list of chunks to read
self.fetch_list = []
self.chunk_size = chunk_size
#will be shared between processes
- self.shared_manager = multiprocessing.Manager()
+ #self.shared_manager = multiprocessing.Manager()
self.shared_num_jobs = multiprocessing.Value(ctypes.c_int, 0)
- self.shared_lock = self.shared_manager.Lock()
+ self.shared_lock = multiproc_manager.Lock()
self.shared_eof = multiprocessing.Value(ctypes.c_bool, False)
@@ -182,7 +182,7 @@ class SynchronizedSamReader(object):
"""
Parses SAM file in multiple threads.
"""
- def __init__(self, sam_alignment, reference_fasta,
+ def __init__(self, sam_alignment, reference_fasta, multiproc_manager,
max_coverage=None, use_secondary=False):
#check that alignment exists
if not os.path.exists(sam_alignment):
@@ -196,8 +196,8 @@ class SynchronizedSamReader(object):
self.use_secondary = use_secondary
self.cigar_parser = re.compile(b"[0-9]+[MIDNSHP=X]")
- self.shared_manager = multiprocessing.Manager()
- self.ref_fasta = self.shared_manager.dict()
+ #self.shared_manager = multiprocessing.Manager()
+ self.ref_fasta = multiproc_manager.dict()
for (h, s) in iteritems(reference_fasta):
self.ref_fasta[_BYTES(h)] = _BYTES(s)
=====================================
setup.py
=====================================
@@ -53,6 +53,9 @@ class MakeInstall(SetuptoolsInstall):
sys.exit('Error: no write permission for ' + self.install_scripts + ' ' +
'Perhaps you need to use sudo?')
+ if not os.path.exists(self.install_scripts):
+ os.makedirs(self.install_scripts)
+
build_dir = os.path.join(script_dir, "bin")
install_dir = self.install_scripts
bin_files = ['flye-modules', 'flye-minimap2', 'flye-samtools']
=====================================
src/Makefile
=====================================
@@ -1,6 +1,7 @@
.PHONY: all clean debug profile
CXXFLAGS += -Wall -Wextra -pthread -std=c++11 -g
+CXXFLAGS += -Wno-missing-field-initializers
LDFLAGS += -pthread -std=c++11 -rdynamic
MODULES_BIN := ${BIN_DIR}/flye-modules
@@ -63,10 +64,10 @@ main.o: main.cpp
clean:
- -rm ${repeat_obj}
- -rm ${sequence_obj}
- -rm ${assemble_obj}
- -rm ${polish_obj}
- -rm ${contigger_obj}
- -rm ${main_obj}
- -rm ${MODULES_BIN}
+ rm -f ${repeat_obj}
+ rm -f ${sequence_obj}
+ rm -f ${assemble_obj}
+ rm -f ${polish_obj}
+ rm -f ${contigger_obj}
+ rm -f ${main_obj}
+ rm -f ${MODULES_BIN}
=====================================
src/repeat_graph/haplotype_resolver.cpp
=====================================
@@ -195,10 +195,10 @@ int HaplotypeResolver::findHeterozygousLoops()
_graph.linkEdges(_graph.complementEdge(outEdge),
_graph.complementEdge(inEdge));
- //bridging sequence.
//either remove or unroll loop, depending on the coverage
- if (loop.meanCoverage <
- (entrancePath->meanCoverage + exitPath->meanCoverage) / 4)
+ int32_t mainPathCoverage = (entrancePath->meanCoverage + exitPath->meanCoverage) / 2;
+ int32_t LOOP_COV_RATE = 4; //2 times lower than half coverage (in case of haploid indel)
+ if (loop.meanCoverage < mainPathCoverage / LOOP_COV_RATE)
{
_bridgingSeqs[std::make_pair(inEdge, outEdge)] = DnaSequence("A");
_bridgingSeqs[std::make_pair(_graph.complementEdge(outEdge),
@@ -630,7 +630,7 @@ void HaplotypeResolver::collapseHaplotypes()
}
//for each haplotype path, if it's a loop - convert it to a linear edge
- GraphProcessor proc(_graph, _asmSeqs);
+ /*GraphProcessor proc(_graph, _asmSeqs);
auto unbranchingPaths = proc.getUnbranchingPaths();
for (auto& path : unbranchingPaths)
{
@@ -646,6 +646,16 @@ void HaplotypeResolver::collapseHaplotypes()
break;
}
}
+ }*/
+
+ if ((bool)Config::get("remove_alt_edges"))
+ {
+ std::unordered_set<GraphEdge*> toDelete;
+ for (auto& edge: _graph.iterEdges())
+ {
+ if (edge->altHaplotype) toDelete.insert(edge);
+ }
+ for (auto& edge : toDelete) _graph.removeEdge(edge);
}
_aligner.updateAlignments();
@@ -724,7 +734,7 @@ namespace
//finds any path of length up to maxDepth from the given edge
//(first and last edges do not count towards length).
//if there are no paths of that length, return the longest one
- auto anyPath = [](GraphEdge* startEdge, int maxDepth)
+ auto anyPath = [](GraphEdge* startEdge, int maxDepth, const RepeatGraph& graph)
{
std::vector<PathWithLen> deadEnds;
std::vector<PathWithLen> queue;
@@ -743,7 +753,9 @@ namespace
{
bool localRepeat = std::find(curPath.path.rbegin(), curPath.path.rend(),
nextEdge) != curPath.path.rend();
- if (localRepeat) continue;
+ bool localComplRepeat = std::find(curPath.path.rbegin(), curPath.path.rend(),
+ graph.complementEdge(nextEdge)) != curPath.path.rend();
+ if (localRepeat || localComplRepeat) continue;
if (nextEdge->isLooped() && nextEdge->length() < maxDepth) continue;
deadEnd = false;
@@ -756,6 +768,11 @@ namespace
{
deadEnds.push_back(curPath);
}
+
+ //heuristic to prevent very slow processing of very tangled components
+ const size_t MAX_CANDIDATES = 100;
+ if (deadEnds.size() > MAX_CANDIDATES) break;
+ if (queue.size() > MAX_CANDIDATES) break;
}
//no paths over MAX_DEPTH, return the longest path
@@ -856,7 +873,7 @@ namespace
const std::unordered_set<GraphEdge*> loopedEdges)
{
//Logger::get().debug() << "\t\tSearching for ref. path";
- auto refPath = anyPath(startEdge, maxBubbleLen);
+ auto refPath = anyPath(startEdge, maxBubbleLen, graph);
if (refPath.empty()) return Superbubble();
/*std::string pathStr;
@@ -980,19 +997,19 @@ int HaplotypeResolver::findSuperbubbles()
//require at least two alternative paths (not counting loops)
//to initiate the bubble. No such requirement for the bubble end though
- //int numIn = 0;
+ int numIn = 0;
int numOut = 0;
- /*for (auto& edge : startEdge->nodeRight->inEdges)
+ for (auto& edge : startEdge->nodeRight->inEdges)
{
if (!loopedEdges.count(edge)) ++numIn;
- }*/
+ }
for (auto& edge : startEdge->nodeRight->outEdges)
{
if (!loopedEdges.count(edge)) ++numOut;
}
- if (numOut < 2) continue;
+ //if (numOut < 2) continue;
//Logger::get().debug() << "\tChecking start: " << startEdge->edgeId.signedId();
- //if (numOut < 2 || numIn > 1) continue;
+ if (numOut < 2 || numIn > 1) continue;
//if (startEdge->nodeRight->inEdges.size() > 1 ||
// startEdge->nodeRight->outEdges.size() < 2) continue;
=====================================
src/repeat_graph/main_repeat.cpp
=====================================
@@ -251,7 +251,7 @@ int repeat_main(int argc, char** argv)
outGen.outputGfa(proc.getEdgesPaths(), outFolder + "/graph_before_rr.gfa");
}
- if (isMeta)
+ if (isMeta && !keepHaplotypes)
{
repResolver.resolveSimpleRepeats();
}
=====================================
src/repeat_graph/multiplicity_inferer.cpp
=====================================
@@ -560,8 +560,8 @@ void MultiplicityInferer::trimTipsIteration(int& outShort, int& outLong)
{
if (toRemove.count(path.id))
{
- //Logger::get().debug() << "Tip " << path.edgesStr()
- // << " len:" << path.length << " cov:" << path.meanCoverage;
+ Logger::get().debug() << "Tip " << path.edgesStr()
+ << " len:" << path.length << " cov:" << path.meanCoverage;
GraphEdge* targetEdge = path.path.front();
GraphEdge* complEdge = _graph.complementEdge(targetEdge);
@@ -575,6 +575,16 @@ void MultiplicityInferer::trimTipsIteration(int& outShort, int& outLong)
vecRemove(complEdge->nodeRight->inEdges, complEdge);
complEdge->nodeRight = _graph.addNode();
complEdge->nodeRight->inEdges.push_back(complEdge);
+
+
+ if ((bool)Config::get("remove_alt_edges"))
+ {
+ for (auto& edge : path.path)
+ {
+ _graph.removeEdge(_graph.complementEdge(edge));
+ _graph.removeEdge(edge);
+ }
+ }
}
}
outShort = shortClipped;
=====================================
src/repeat_graph/repeat_graph.cpp
=====================================
@@ -50,11 +50,18 @@ namespace
bool GraphEdge::isRightTerminal() const
{
+ /*int inCount = 0;
+ int outCount = 0;
for (GraphEdge* edge : this->nodeRight->outEdges)
{
- if (!edge->isLooped()) return false;
+ if (!edge->isLooped()) ++outCount;
}
- return true;
+ for (GraphEdge* edge : this->nodeRight->inEdges)
+ {
+ if (!edge->isLooped()) ++inCount;
+ }
+ return inCount == 1 && outCount == 0;*/
+ return this->nodeRight->outEdges.size() == 0 && this->nodeRight->inEdges.size() == 1;
}
std::unordered_set<GraphEdge*> GraphEdge::adjacentEdges()
=====================================
src/sequence/alignment.cpp
=====================================
@@ -153,7 +153,24 @@ float getAlignmentCigarKsw(const DnaSequence& trgSeq, size_t trgBegin, size_t tr
trgByte.size(), &trgByte[0], NUM_NUCL,
subsMat, gapOpen, gapExtend, bandWidth, Z_DROP,
END_BONUS, FLAG, &ez);
- if (!ez.zdropped) break;
+ if (!ez.zdropped)
+ {
+ //check deviation from the diagonal
+ int64_t deviation = 0;
+ for (size_t i = 0; i < (size_t)ez.n_cigar; ++i)
+ {
+ int32_t size = ez.cigar[i] >> 4;
+ char op = "MID"[ez.cigar[i] & 0xf];
+ if (op == 'I') deviation += size;
+ if (op == 'D') deviation -= size;
+ }
+ if (labs(deviation) > bandWidth) //looks like this never happens
+ {
+ Logger::get().warning() << "Deviation: " << deviation << " " << bandWidth;
+ }
+ if (labs(deviation) <= bandWidth) break;
+ }
+
if (bandWidth > (int)std::max(qryByte.size(), trgByte.size())) break; //just in case
bandWidth *= 2;
}
=====================================
src/sequence/sequence_container.cpp
=====================================
@@ -150,6 +150,7 @@ size_t SequenceContainer::readFasta(std::vector<FastaRecord>& record,
auto* fd = gzopen(fileName.c_str(), "rb");
if (!fd)
{
+ delete[] rawBuffer;
throw ParseException("Can't open reads file");
}
@@ -235,6 +236,7 @@ size_t SequenceContainer::readFastq(std::vector<FastaRecord>& record,
auto* fd = gzopen(fileName.c_str(), "rb");
if (!fd)
{
+ delete[] rawBuffer;
throw ParseException("Can't open reads file");
}
View it on GitLab: https://salsa.debian.org/med-team/flye/-/compare/86ee0688c54eb5c44b737c58f6aeedcef0348226...2fbaaf43498c5d30a843f6ed97e5d0a581061b3b
--
View it on GitLab: https://salsa.debian.org/med-team/flye/-/compare/86ee0688c54eb5c44b737c58f6aeedcef0348226...2fbaaf43498c5d30a843f6ed97e5d0a581061b3b
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20220823/252c6b66/attachment-0001.htm>
More information about the debian-med-commit
mailing list