[med-svn] [Git][med-team/unicycler][upstream] New upstream version 0.4.7+dfsg

Thu Sep 6 10:32:12 BST 2018

Andreas Tille pushed to branch upstream at Debian Med / unicycler


Commits:
6b867d2e by Andreas Tille at 2018-09-06T06:32:06Z
New upstream version 0.4.7+dfsg
- - - - -


10 changed files:

- README.md
- docs/unicycler-polish.md
- test/test_misc.py
- unicycler/blast_func.py
- unicycler/misc.py
- unicycler/pilon_func.py
- unicycler/spades_func.py
- unicycler/unicycler.py
- unicycler/unicycler_polish.py
- unicycler/version.py


Changes:

=====================================
README.md
=====================================
@@ -53,6 +53,7 @@ And read about how we use it to complete bacterial genomes here:
     * [Known contamination](#known-contamination)
     * [Manual multiplicity](#manual-multiplicity)
     * [Manual completion](#manual-completion)
+    * [Using an external long-read assembly](#using-an-external-long-read-assembly)
 * [Other included tools](#other-included-tools)
 * [Paper](#paper)
 * [Acknowledgements](#acknowledgements)
@@ -367,40 +368,42 @@ usage: unicycler [-h] [--help_all] [--version] [-1 SHORT1] [-2 SHORT2] [-s UNPAI
 Unicycler: an assembly pipeline for bacterial genomes
 
 Help:
-  -h, --help                           Show this help message and exit
-  --help_all                           Show a help message with all program options
-  --version                            Show Unicycler's version number
+  -h, --help                     Show this help message and exit
+  --help_all                     Show a help message with all program options
+  --version                      Show Unicycler's version number
 
 Input:
-  -1 SHORT1, --short1 SHORT1           FASTQ file of first short reads in each pair (required)
-  -2 SHORT2, --short2 SHORT2           FASTQ file of second short reads in each pair (required)
-  -s UNPAIRED, --unpaired UNPAIRED     FASTQ file of unpaired short reads (optional)
-  -l LONG, --long LONG                 FASTQ or FASTA file of long reads (optional)
+  -1 SHORT1, --short1 SHORT1     FASTQ file of first short reads in each pair (required)
+  -2 SHORT2, --short2 SHORT2     FASTQ file of second short reads in each pair (required)
+  -s UNPAIRED, --unpaired UNPAIRED
+                                 FASTQ file of unpaired short reads (optional)
+  -l LONG, --long LONG           FASTQ or FASTA file of long reads (optional)
 
 Output:
-  -o OUT, --out OUT                    Output directory (required)
-  --verbosity VERBOSITY                Level of stdout and log file information (default: 1)
-                                         0 = no stdout, 1 = basic progress indicators, 2 = extra info,
-                                         3 = debugging info
-  --min_fasta_length MIN_FASTA_LENGTH  Exclude contigs from the FASTA file which are shorter than this length
-                                       (default: 100)
-  --keep KEEP                          Level of file retention (default: 1)
-                                         0 = only keep final files: assembly (FASTA, GFA and log),
-                                         1 = also save graphs at main checkpoints,
-                                         2 = also keep SAM (enables fast rerun in different mode),
-                                         3 = keep all temp files and save all graphs (for debugging)
-  --vcf                                Produce a VCF by mapping the short reads to the final assembly
-                                       (experimental, default: do not produce a vcf file)
+  -o OUT, --out OUT              Output directory (required)
+  --verbosity VERBOSITY          Level of stdout and log file information (default: 1)
+                                   0 = no stdout, 1 = basic progress indicators, 2 = extra info,
+                                   3 = debugging info
+  --min_fasta_length MIN_FASTA_LENGTH
+                                 Exclude contigs from the FASTA file which are shorter than this
+                                 length (default: 100)
+  --keep KEEP                    Level of file retention (default: 1)
+                                   0 = only keep final files: assembly (FASTA, GFA and log),
+                                   1 = also save graphs at main checkpoints,
+                                   2 = also keep SAM (enables fast rerun in different mode),
+                                   3 = keep all temp files and save all graphs (for debugging)
+  --vcf                          Produce a VCF by mapping the short reads to the final assembly
+                                 (experimental, default: do not produce a vcf file)
 
 Other:
-  -t THREADS, --threads THREADS        Number of threads used (default: 8)
-  --mode {conservative,normal,bold}    Bridging mode (default: normal)
-                                         conservative = smaller contigs, lowest misassembly rate
-                                         normal = moderate contig size and misassembly rate
-                                         bold = longest contigs, higher misassembly rate
-  --linear_seqs LINEAR_SEQS            The expected number of linear (i.e. non-circular) sequences in the
-                                       underlying sequence (default: 0)
-
+  -t THREADS, --threads THREADS  Number of threads used (default: 8)
+  --mode {conservative,normal,bold}
+                                 Bridging mode (default: normal)
+                                   conservative = smaller contigs, lowest misassembly rate
+                                   normal = moderate contig size and misassembly rate
+                                   bold = longest contigs, higher misassembly rate
+  --linear_seqs LINEAR_SEQS      The expected number of linear (i.e. non-circular) sequences in
+                                 the underlying sequence (default: 0)
 ```
 
 ### Advanced options
@@ -409,77 +412,96 @@ Run `unicycler --help_all` to see a complete list of the program's options. Thes
 
 ```
 SPAdes assembly:
-  These options control the short-read SPAdes assembly at the beginning of the Unicycler pipeline.
-
-  --spades_path SPADES_PATH           Path to the SPAdes executable (default: spades.py)
-  --no_correct                        Skip SPAdes error correction step (default: conduct SPAdes error
-                                      correction)
-  --min_kmer_frac MIN_KMER_FRAC       Lowest k-mer size for SPAdes assembly, expressed as a fraction of the read
-                                      length (default: 0.2)
-  --max_kmer_frac MAX_KMER_FRAC       Highest k-mer size for SPAdes assembly, expressed as a fraction of the read
-                                      length (default: 0.95)
-  --kmer_count KMER_COUNT             Number of k-mer steps to use in SPAdes assembly (default: 10)
-  --depth_filter DEPTH_FILTER         Filter out contigs lower than this fraction of the chromosomal depth, if
-                                      doing so does not result in graph dead ends (default: 0.25)
+  These options control the short-read SPAdes assembly at the beginning of the Unicycler
+  pipeline.
+
+  --spades_path SPADES_PATH      Path to the SPAdes executable (default: spades.py)
+  --no_correct                   Skip SPAdes error correction step (default: conduct SPAdes error
+                                 correction)
+  --min_kmer_frac MIN_KMER_FRAC  Lowest k-mer size for SPAdes assembly, expressed as a fraction of
+                                 the read length (default: 0.2)
+  --max_kmer_frac MAX_KMER_FRAC  Highest k-mer size for SPAdes assembly, expressed as a fraction
+                                 of the read length (default: 0.95)
+  --kmers KMERS                  Exact k-mers to use for SPAdes assembly, comma-separated
+                                 (example: 22,33,44, default: automatic)
+  --kmer_count KMER_COUNT        Number of k-mer steps to use in SPAdes assembly (default: 10)
+  --depth_filter DEPTH_FILTER    Filter out contigs lower than this fraction of the chromosomal
+                                 depth, if doing so does not result in graph dead ends (default:
+                                 0.25)
+  --spades_tmp_dir SPADES_TMP_DIR
+                                 Specify SPAdes temporary directory using the SPAdes --tmp-dir
+                                 option (default: make a temporary directory in the output
+                                 directory)
 
 miniasm+Racon assembly:
   These options control the use of miniasm and Racon to produce long-read bridges.
 
-  --no_miniasm                        Skip miniasm+Racon bridging (default: use miniasm and Racon to produce
-                                      long-read bridges)
-  --racon_path RACON_PATH             Path to the Racon executable (default: racon)
+  --no_miniasm                   Skip miniasm+Racon bridging (default: use miniasm and Racon to
+                                 produce long-read bridges)
+  --racon_path RACON_PATH        Path to the Racon executable (default: racon)
+  --existing_long_read_assembly EXISTING_LONG_READ_ASSEMBLY
+                                 A pre-prepared long read assembly for the sample in GFA format.
+                                 If this option is used, Unicycler will skip the miniasm/Racon
+                                 steps and instead use the given assembly (default: perform long
+                                 read assembly using miniasm/Racon)
 
 Assembly rotation:
-  These options control the rotation of completed circular sequence near the end of the Unicycler pipeline.
-
-  --no_rotate                         Do not rotate completed replicons to start at a standard gene (default:
-                                      completed replicons are rotated)
-  --start_genes START_GENES           FASTA file of genes for start point of rotated replicons (default:
-                                      start_genes.fasta)
-  --start_gene_id START_GENE_ID       The minimum required BLAST percent identity for a start gene search
-                                      (default: 90.0)
-  --start_gene_cov START_GENE_COV     The minimum required BLAST percent coverage for a start gene search
-                                      (default: 95.0)
+  These options control the rotation of completed circular sequence near the end of the
+  Unicycler pipeline.
+
+  --no_rotate                    Do not rotate completed replicons to start at a standard gene
+                                 (default: completed replicons are rotated)
+  --start_genes START_GENES      FASTA file of genes for start point of rotated replicons
+                                 (default: start_genes.fasta)
+  --start_gene_id START_GENE_ID  The minimum required BLAST percent identity for a start gene
+                                 search (default: 90.0)
+  --start_gene_cov START_GENE_COV
+                                 The minimum required BLAST percent coverage for a start gene
+                                 search (default: 95.0)
   --makeblastdb_path MAKEBLASTDB_PATH
-                                      Path to the makeblastdb executable (default: makeblastdb)
-  --tblastn_path TBLASTN_PATH         Path to the tblastn executable (default: tblastn)
+                                 Path to the makeblastdb executable (default: makeblastdb)
+  --tblastn_path TBLASTN_PATH    Path to the tblastn executable (default: tblastn)
 
 Pilon polishing:
-  These options control the final assembly polish using Pilon at the end of the Unicycler pipeline.
+  These options control the final assembly polish using Pilon at the end of the Unicycler
+  pipeline.
 
-  --no_pilon                          Do not use Pilon to polish the final assembly (default: Pilon is used)
-  --bowtie2_path BOWTIE2_PATH         Path to the bowtie2 executable (default: bowtie2)
+  --no_pilon                     Do not use Pilon to polish the final assembly (default: Pilon is
+                                 used)
+  --bowtie2_path BOWTIE2_PATH    Path to the bowtie2 executable (default: bowtie2)
   --bowtie2_build_path BOWTIE2_BUILD_PATH
-                                      Path to the bowtie2_build executable (default: bowtie2-build)
-  --samtools_path SAMTOOLS_PATH       Path to the samtools executable (default: samtools)
-  --pilon_path PILON_PATH             Path to a Pilon executable or the Pilon Java archive file (default: pilon)
-  --java_path JAVA_PATH               Path to the java executable (default: java)
-  --min_polish_size MIN_POLISH_SIZE   Contigs shorter than this value (bp) will not be polished using Pilon
-                                      (default: 10000)
+                                 Path to the bowtie2_build executable (default: bowtie2-build)
+  --samtools_path SAMTOOLS_PATH  Path to the samtools executable (default: samtools)
+  --pilon_path PILON_PATH        Path to a Pilon executable or the Pilon Java archive file
+                                 (default: pilon)
+  --java_path JAVA_PATH          Path to the java executable (default: java)
+  --min_polish_size MIN_POLISH_SIZE
+                                 Contigs shorter than this value (bp) will not be polished using
+                                 Pilon (default: 10000)
 
 VCF:
   These options control the production of the VCF of the final assembly.
 
-  --bcftools_path BCFTOOLS_PATH       Path to the bcftools executable (default: bcftools)
+  --bcftools_path BCFTOOLS_PATH  Path to the bcftools executable (default: bcftools)
 
 Graph cleaning:
   These options control the removal of small leftover sequences after bridging is complete.
 
   --min_component_size MIN_COMPONENT_SIZE
-                                      Graph components smaller than this size (bp) will be removed from the final
-                                      graph (default: 1000)
+                                 Graph components smaller than this size (bp) will be removed from
+                                 the final graph (default: 1000)
   --min_dead_end_size MIN_DEAD_END_SIZE
-                                      Graph dead ends smaller than this size (bp) will be removed from the final
-                                      graph (default: 1000)
+                                 Graph dead ends smaller than this size (bp) will be removed from
+                                 the final graph (default: 1000)
 
 Long read alignment:
   These options control the alignment of long reads to the assembly graph.
 
-  --contamination CONTAMINATION       FASTA file of known contamination in long reads
-  --scores SCORES                     Comma-delimited string of alignment scores: match, mismatch, gap open, gap
-                                      extend (default: 3,-6,-5,-2)
-  --low_score LOW_SCORE               Score threshold - alignments below this are considered poor (default: set
-                                      threshold automatically)
+  --contamination CONTAMINATION  FASTA file of known contamination in long reads
+  --scores SCORES                Comma-delimited string of alignment scores: match, mismatch, gap
+                                 open, gap extend (default: 3,-6,-5,-2)
+  --low_score LOW_SCORE          Score threshold - alignments below this are considered poor
+                                 (default: set threshold automatically)
 ```
 
 
@@ -620,6 +642,12 @@ If you believe this has happened in your assembly, you can manually assign multi
 If Unicycler doesn't complete your bacterial genome assembly on its own, you may be able to complete it manually with a bit of bioinformatics detective work. There's no single, straight-forward procedure for doing so, but I've put together [a few examples on the Unicycler wiki](https://github.com/rrwick/Unicycler/wiki/Tips-for-finishing-genomes) which may be helpful.
 
 
+### Using an external long-read assembly
+
+If you have a long-read assembly that you've prepared outside Unicycler and trust (e.g. with Canu), you can give it to Unicycler with `--existing_long_read_assembly`. Unicycler will then skip its miniasm/Racon step and use this assembly instead.
+
+
+
 
 # Other included tools
 


=====================================
docs/unicycler-polish.md
=====================================
@@ -85,6 +85,7 @@ Generic long reads:
 Polishing settings:
   Various settings for polishing behaviour (defaults should work well in most cases)
 
+  --no_fix_local                        do not fix local misassemblies (default: False)
   --min_insert MIN_INSERT               minimum valid short read insert size (default: auto)
   --max_insert MAX_INSERT               maximum valid short read insert size (default: auto)
   --min_align_length MIN_ALIGN_LENGTH   Minimum long read alignment length (default: 1000)


=====================================
test/test_misc.py
=====================================
@@ -372,6 +372,13 @@ class TestMiscFunctions(unittest.TestCase):
         version = unicycler.misc.java_version_from_java_output(java_version_output)
         self.assertEqual(version, '')
 
+    def test_java_version_parsing_6(self):
+        java_version_output = 'openjdk version "9-Ubuntu"\n' \
+                              'OpenJDK Runtime Environment (build 9-Ubuntu+0-9b181-4)\n' \
+                              'OpenJDK 64-Bit Server VM (build 9-Ubuntu+0-9b181-4, mixed mode)'
+        version = unicycler.misc.java_version_from_java_output(java_version_output)
+        self.assertEqual(version, '9-Ubuntu')
+
     def test_spades_version_parsing_1(self):
         spades_version_output = 'SPAdes v3.10.1'
         version = unicycler.misc.spades_version_from_spades_output(spades_version_output)


=====================================
unicycler/blast_func.py
=====================================
@@ -36,6 +36,7 @@ def find_start_gene(sequence, start_genes_fasta, identity_threshold, coverage_th
     # gene overlaps from the end to the start, we have to duplicate some of the replicon sequence
     # for the BLAST database.
     seq_len = len(sequence)
+    start_genes_fasta = os.path.abspath(start_genes_fasta)
     queries = load_fasta(start_genes_fasta)
     if not queries:
         raise CannotFindStart


=====================================
unicycler/misc.py
=====================================
@@ -1070,19 +1070,29 @@ def java_path_and_version(java_path):
 
     # Make sure Java is 1.7+
     try:
-        major_version = int(version.split('.')[0])
-        if major_version < 1:
-            status = 'too old'
-        elif major_version > 1:
-            status = 'good'
-        else:  # major_version == 1
-            minor_version = int(version.split('.')[1])
-            if minor_version < 7:
+        if '.' in version:
+            major_version = int(version.split('.')[0])
+            if major_version < 1:
+                status = 'too old'
+            elif major_version > 1:
+                status = 'good'
+            else:  # major_version == 1
+                minor_version = int(version.split('.')[1])
+                if minor_version < 7:
+                    status = 'too old'
+                else:
+                    status = 'good'
+        elif '-' in version:
+            # E.g. '9-Ubuntu'
+            major_version = int(version.split('-')[0])
+            if major_version < 7:
                 status = 'too old'
             else:
                 status = 'good'
+        else:
+            status = 'bad'
     except (ValueError, IndexError):
-        version, status = '?', 'too old'
+        status = 'bad'
     return found_java_path, version, status
 
 


=====================================
unicycler/pilon_func.py
=====================================
@@ -336,7 +336,7 @@ def polish_with_pilon(graph, args, polish_dir, insert_size_1st, insert_size_99th
         for f in list_of_files:
             try:
                 os.remove(f)
-            except FileNotFoundError:
+            except (FileNotFoundError, OSError):
                 pass
 
     return total_count


=====================================
unicycler/spades_func.py
=====================================
@@ -73,7 +73,7 @@ def get_best_spades_graph(short1, short2, short_unpaired, out_dir, read_depth_fi
         kmer_range = kmers
     else:
         kmer_range = get_kmer_range(short1, short2, short_unpaired, spades_dir, kmer_count,
-                                    min_k_frac, max_k_frac)
+                                    min_k_frac, max_k_frac, spades_path)
     assem_dir = os.path.join(spades_dir, 'assembly')
 
     log.log_section_header('SPAdes assemblies')
@@ -427,8 +427,27 @@ def spades_assembly(read_files, out_dir, kmers, threads, spades_path, spades_tmp
         return graph_files, insert_size_mean, insert_size_deviation
 
 
+def get_max_spades_kmer(spades_path):
+    """
+    SPAdes usually has a maximum k-mer size of 127, but this can be changed when compiling SPAdes,
+    so this function checks the help text to see what it is.
+    https://github.com/ablab/spades/issues/40
+    """
+    try:
+        process = subprocess.Popen([spades_path, '--help'], stdout=subprocess.PIPE,
+                                   stderr=subprocess.PIPE)
+        out, err = process.communicate()
+        all_output = out.decode() + err.decode()
+        all_output = all_output.replace('\n', ' ')
+        all_output = ' '.join(all_output.split())
+        max_kmer = all_output.split('must be odd and less than ')[1].split(')')[0]
+        return int(max_kmer) - 1
+    except (IndexError, ValueError):
+        return 127
+
+
 def get_kmer_range(reads_1_filename, reads_2_filename, unpaired_reads_filename, spades_dir,
-                   kmer_count, min_kmer_frac, max_kmer_frac):
+                   kmer_count, min_kmer_frac, max_kmer_frac, spades_path):
     """
     Uses the read lengths to determine the k-mer range to be used in the SPAdes assembly.
     """
@@ -453,6 +472,11 @@ def get_kmer_range(reads_1_filename, reads_2_filename, unpaired_reads_filename,
             except ValueError:
                 pass
 
+    max_spades_kmer = get_max_spades_kmer(spades_path)
+    log.log('SPAdes maximum k-mer: {}'.format(max_spades_kmer))
+    if max_spades_kmer != 127:
+        log.log('    (unusual value, probably indicates custom SPAdes compilation)')
+
     # If the code got here, then the k-mer range doesn't already exist and we'll create one by
     # examining the read lengths.
     read_lengths = get_read_lengths(reads_1_filename) + get_read_lengths(reads_2_filename) + \
@@ -460,18 +484,25 @@ def get_kmer_range(reads_1_filename, reads_2_filename, unpaired_reads_filename,
     read_lengths = sorted(read_lengths)
     median_read_length = read_lengths[len(read_lengths) // 2 - 1]
     max_kmer = round_to_nearest_odd(max_kmer_frac * median_read_length)
-    if max_kmer > 127:
-        max_kmer = 127
+    if max_kmer > max_spades_kmer:
+        max_kmer = max_spades_kmer
     starting_kmer = round_to_nearest_odd(min_kmer_frac * max_kmer / max_kmer_frac)
     if starting_kmer < 11:
         starting_kmer = 11
 
-    # Create the k-mer range from a non-linear function that spaces out the early k-mers more and
-    # makes the later k-mers (which are most likely to be the good, used ones) closer together.
-    kmer_range = []
-    for x in [x / (kmer_count - 1) for x in range(kmer_count)]:
-        kmer_range.append((max_kmer - starting_kmer) * (2 - 2 / (x + 1)) + starting_kmer)
-    kmer_range = sorted(list(set([round_to_nearest_odd(x) for x in kmer_range])))
+    if kmer_count == 1:
+        kmer_range = [max_kmer]
+    elif kmer_count == 2:
+        kmer_range = [starting_kmer, max_kmer]
+    else:
+        # Create the k-mer range from a non-linear function that spaces out the early k-mers more
+        # and makes the later k-mers (which are most likely to be the good, used ones) closer
+        # together.
+        kmer_range = []
+        for x in [x / (kmer_count - 1) for x in range(kmer_count)]:
+            kmer_range.append((max_kmer - starting_kmer) * (2 - 2 / (x + 1)) + starting_kmer)
+        kmer_range = sorted(list(set([round_to_nearest_odd(x) for x in kmer_range])))
+
     kmer_range_str = ', '.join([str(x) for x in kmer_range])
 
     log.log('Median read length: ' + str(median_read_length))


=====================================
unicycler/unicycler.py
=====================================
@@ -493,6 +493,9 @@ def get_arguments():
     if args.threads <= 0:
         quit_with_error('--threads must be at least 1')
 
+    if args.kmer_count < 1:
+        quit_with_error('--kmer_count must be at least 1')
+
     if args.kmers is not None:
         args.kmers = args.kmers.split(',')
         try:


=====================================
unicycler/unicycler_polish.py
=====================================
@@ -106,6 +106,8 @@ def get_arguments():
     settings_group = parser.add_argument_group('Polishing settings',
                                                'Various settings for polishing behaviour '
                                                '(defaults should work well in most cases)')
+    settings_group.add_argument('--no_fix_local', action='store_true',
+                                help='do not fix local misassemblies')
     settings_group.add_argument('--min_insert', type=int,
                                 help='minimum valid short read insert size (default: auto)')
     settings_group.add_argument('--max_insert', type=int,
@@ -410,6 +412,11 @@ def pilon_small_changes(fasta, round_num, args, all_ale_scores):
 
 
 def pilon_large_changes(fasta, round_num, args, all_ale_scores):
+    if args.no_fix_local:
+        if args.verbosity > 0:
+            print('Not fixing local misassemblies', flush=True)
+        return fasta, round_num, []
+
     current, round_num, applied_variant = ale_assessed_changes(fasta, round_num, args, True, False,
                                                                all_ale_scores, 'local',
                                                                'Pilon polish, large variants, '


=====================================
unicycler/version.py
=====================================
@@ -13,4 +13,4 @@ details. You should have received a copy of the GNU General Public License along
 not, see <http://www.gnu.org/licenses/>.
 """
 
-__version__ = '0.4.6'
+__version__ = '0.4.7'



View it on GitLab: https://salsa.debian.org/med-team/unicycler/commit/6b867d2eeaddc1460170007da8c4972083a47323

-- 
View it on GitLab: https://salsa.debian.org/med-team/unicycler/commit/6b867d2eeaddc1460170007da8c4972083a47323
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20180906/ab3c1394/attachment-0001.html>