[med-svn] [Git][med-team/unicycler][upstream] New upstream version 0.4.7+dfsg
Andreas Tille
gitlab at salsa.debian.org
Thu Sep 6 10:32:12 BST 2018
Andreas Tille pushed to branch upstream at Debian Med / unicycler
Commits:
6b867d2e by Andreas Tille at 2018-09-06T06:32:06Z
New upstream version 0.4.7+dfsg
- - - - -
10 changed files:
- README.md
- docs/unicycler-polish.md
- test/test_misc.py
- unicycler/blast_func.py
- unicycler/misc.py
- unicycler/pilon_func.py
- unicycler/spades_func.py
- unicycler/unicycler.py
- unicycler/unicycler_polish.py
- unicycler/version.py
Changes:
=====================================
README.md
=====================================
@@ -53,6 +53,7 @@ And read about how we use it to complete bacterial genomes here:
* [Known contamination](#known-contamination)
* [Manual multiplicity](#manual-multiplicity)
* [Manual completion](#manual-completion)
+ * [Using an external long-read assembly](#using-an-external-long-read-assembly)
* [Other included tools](#other-included-tools)
* [Paper](#paper)
* [Acknowledgements](#acknowledgements)
@@ -367,40 +368,42 @@ usage: unicycler [-h] [--help_all] [--version] [-1 SHORT1] [-2 SHORT2] [-s UNPAI
Unicycler: an assembly pipeline for bacterial genomes
Help:
- -h, --help Show this help message and exit
- --help_all Show a help message with all program options
- --version Show Unicycler's version number
+ -h, --help Show this help message and exit
+ --help_all Show a help message with all program options
+ --version Show Unicycler's version number
Input:
- -1 SHORT1, --short1 SHORT1 FASTQ file of first short reads in each pair (required)
- -2 SHORT2, --short2 SHORT2 FASTQ file of second short reads in each pair (required)
- -s UNPAIRED, --unpaired UNPAIRED FASTQ file of unpaired short reads (optional)
- -l LONG, --long LONG FASTQ or FASTA file of long reads (optional)
+ -1 SHORT1, --short1 SHORT1 FASTQ file of first short reads in each pair (required)
+ -2 SHORT2, --short2 SHORT2 FASTQ file of second short reads in each pair (required)
+ -s UNPAIRED, --unpaired UNPAIRED
+ FASTQ file of unpaired short reads (optional)
+ -l LONG, --long LONG FASTQ or FASTA file of long reads (optional)
Output:
- -o OUT, --out OUT Output directory (required)
- --verbosity VERBOSITY Level of stdout and log file information (default: 1)
- 0 = no stdout, 1 = basic progress indicators, 2 = extra info,
- 3 = debugging info
- --min_fasta_length MIN_FASTA_LENGTH Exclude contigs from the FASTA file which are shorter than this length
- (default: 100)
- --keep KEEP Level of file retention (default: 1)
- 0 = only keep final files: assembly (FASTA, GFA and log),
- 1 = also save graphs at main checkpoints,
- 2 = also keep SAM (enables fast rerun in different mode),
- 3 = keep all temp files and save all graphs (for debugging)
- --vcf Produce a VCF by mapping the short reads to the final assembly
- (experimental, default: do not produce a vcf file)
+ -o OUT, --out OUT Output directory (required)
+ --verbosity VERBOSITY Level of stdout and log file information (default: 1)
+ 0 = no stdout, 1 = basic progress indicators, 2 = extra info,
+ 3 = debugging info
+ --min_fasta_length MIN_FASTA_LENGTH
+ Exclude contigs from the FASTA file which are shorter than this
+ length (default: 100)
+ --keep KEEP Level of file retention (default: 1)
+ 0 = only keep final files: assembly (FASTA, GFA and log),
+ 1 = also save graphs at main checkpoints,
+ 2 = also keep SAM (enables fast rerun in different mode),
+ 3 = keep all temp files and save all graphs (for debugging)
+ --vcf Produce a VCF by mapping the short reads to the final assembly
+ (experimental, default: do not produce a vcf file)
Other:
- -t THREADS, --threads THREADS Number of threads used (default: 8)
- --mode {conservative,normal,bold} Bridging mode (default: normal)
- conservative = smaller contigs, lowest misassembly rate
- normal = moderate contig size and misassembly rate
- bold = longest contigs, higher misassembly rate
- --linear_seqs LINEAR_SEQS The expected number of linear (i.e. non-circular) sequences in the
- underlying sequence (default: 0)
-
+ -t THREADS, --threads THREADS Number of threads used (default: 8)
+ --mode {conservative,normal,bold}
+ Bridging mode (default: normal)
+ conservative = smaller contigs, lowest misassembly rate
+ normal = moderate contig size and misassembly rate
+ bold = longest contigs, higher misassembly rate
+ --linear_seqs LINEAR_SEQS The expected number of linear (i.e. non-circular) sequences in
+ the underlying sequence (default: 0)
```
### Advanced options
@@ -409,77 +412,96 @@ Run `unicycler --help_all` to see a complete list of the program's options. Thes
```
SPAdes assembly:
- These options control the short-read SPAdes assembly at the beginning of the Unicycler pipeline.
-
- --spades_path SPADES_PATH Path to the SPAdes executable (default: spades.py)
- --no_correct Skip SPAdes error correction step (default: conduct SPAdes error
- correction)
- --min_kmer_frac MIN_KMER_FRAC Lowest k-mer size for SPAdes assembly, expressed as a fraction of the read
- length (default: 0.2)
- --max_kmer_frac MAX_KMER_FRAC Highest k-mer size for SPAdes assembly, expressed as a fraction of the read
- length (default: 0.95)
- --kmer_count KMER_COUNT Number of k-mer steps to use in SPAdes assembly (default: 10)
- --depth_filter DEPTH_FILTER Filter out contigs lower than this fraction of the chromosomal depth, if
- doing so does not result in graph dead ends (default: 0.25)
+ These options control the short-read SPAdes assembly at the beginning of the Unicycler
+ pipeline.
+
+ --spades_path SPADES_PATH Path to the SPAdes executable (default: spades.py)
+ --no_correct Skip SPAdes error correction step (default: conduct SPAdes error
+ correction)
+ --min_kmer_frac MIN_KMER_FRAC Lowest k-mer size for SPAdes assembly, expressed as a fraction of
+ the read length (default: 0.2)
+ --max_kmer_frac MAX_KMER_FRAC Highest k-mer size for SPAdes assembly, expressed as a fraction
+ of the read length (default: 0.95)
+ --kmers KMERS Exact k-mers to use for SPAdes assembly, comma-separated
+ (example: 22,33,44, default: automatic)
+ --kmer_count KMER_COUNT Number of k-mer steps to use in SPAdes assembly (default: 10)
+ --depth_filter DEPTH_FILTER Filter out contigs lower than this fraction of the chromosomal
+ depth, if doing so does not result in graph dead ends (default:
+ 0.25)
+ --spades_tmp_dir SPADES_TMP_DIR
+ Specify SPAdes temporary directory using the SPAdes --tmp-dir
+ option (default: make a temporary directory in the output
+ directory)
miniasm+Racon assembly:
These options control the use of miniasm and Racon to produce long-read bridges.
- --no_miniasm Skip miniasm+Racon bridging (default: use miniasm and Racon to produce
- long-read bridges)
- --racon_path RACON_PATH Path to the Racon executable (default: racon)
+ --no_miniasm Skip miniasm+Racon bridging (default: use miniasm and Racon to
+ produce long-read bridges)
+ --racon_path RACON_PATH Path to the Racon executable (default: racon)
+ --existing_long_read_assembly EXISTING_LONG_READ_ASSEMBLY
+ A pre-prepared long read assembly for the sample in GFA format.
+ If this option is used, Unicycler will skip the miniasm/Racon
+ steps and instead use the given assembly (default: perform long
+ read assembly using miniasm/Racon)
Assembly rotation:
- These options control the rotation of completed circular sequence near the end of the Unicycler pipeline.
-
- --no_rotate Do not rotate completed replicons to start at a standard gene (default:
- completed replicons are rotated)
- --start_genes START_GENES FASTA file of genes for start point of rotated replicons (default:
- start_genes.fasta)
- --start_gene_id START_GENE_ID The minimum required BLAST percent identity for a start gene search
- (default: 90.0)
- --start_gene_cov START_GENE_COV The minimum required BLAST percent coverage for a start gene search
- (default: 95.0)
+ These options control the rotation of completed circular sequence near the end of the
+ Unicycler pipeline.
+
+ --no_rotate Do not rotate completed replicons to start at a standard gene
+ (default: completed replicons are rotated)
+ --start_genes START_GENES FASTA file of genes for start point of rotated replicons
+ (default: start_genes.fasta)
+ --start_gene_id START_GENE_ID The minimum required BLAST percent identity for a start gene
+ search (default: 90.0)
+ --start_gene_cov START_GENE_COV
+ The minimum required BLAST percent coverage for a start gene
+ search (default: 95.0)
--makeblastdb_path MAKEBLASTDB_PATH
- Path to the makeblastdb executable (default: makeblastdb)
- --tblastn_path TBLASTN_PATH Path to the tblastn executable (default: tblastn)
+ Path to the makeblastdb executable (default: makeblastdb)
+ --tblastn_path TBLASTN_PATH Path to the tblastn executable (default: tblastn)
Pilon polishing:
- These options control the final assembly polish using Pilon at the end of the Unicycler pipeline.
+ These options control the final assembly polish using Pilon at the end of the Unicycler
+ pipeline.
- --no_pilon Do not use Pilon to polish the final assembly (default: Pilon is used)
- --bowtie2_path BOWTIE2_PATH Path to the bowtie2 executable (default: bowtie2)
+ --no_pilon Do not use Pilon to polish the final assembly (default: Pilon is
+ used)
+ --bowtie2_path BOWTIE2_PATH Path to the bowtie2 executable (default: bowtie2)
--bowtie2_build_path BOWTIE2_BUILD_PATH
- Path to the bowtie2_build executable (default: bowtie2-build)
- --samtools_path SAMTOOLS_PATH Path to the samtools executable (default: samtools)
- --pilon_path PILON_PATH Path to a Pilon executable or the Pilon Java archive file (default: pilon)
- --java_path JAVA_PATH Path to the java executable (default: java)
- --min_polish_size MIN_POLISH_SIZE Contigs shorter than this value (bp) will not be polished using Pilon
- (default: 10000)
+ Path to the bowtie2_build executable (default: bowtie2-build)
+ --samtools_path SAMTOOLS_PATH Path to the samtools executable (default: samtools)
+ --pilon_path PILON_PATH Path to a Pilon executable or the Pilon Java archive file
+ (default: pilon)
+ --java_path JAVA_PATH Path to the java executable (default: java)
+ --min_polish_size MIN_POLISH_SIZE
+ Contigs shorter than this value (bp) will not be polished using
+ Pilon (default: 10000)
VCF:
These options control the production of the VCF of the final assembly.
- --bcftools_path BCFTOOLS_PATH Path to the bcftools executable (default: bcftools)
+ --bcftools_path BCFTOOLS_PATH Path to the bcftools executable (default: bcftools)
Graph cleaning:
These options control the removal of small leftover sequences after bridging is complete.
--min_component_size MIN_COMPONENT_SIZE
- Graph components smaller than this size (bp) will be removed from the final
- graph (default: 1000)
+ Graph components smaller than this size (bp) will be removed from
+ the final graph (default: 1000)
--min_dead_end_size MIN_DEAD_END_SIZE
- Graph dead ends smaller than this size (bp) will be removed from the final
- graph (default: 1000)
+ Graph dead ends smaller than this size (bp) will be removed from
+ the final graph (default: 1000)
Long read alignment:
These options control the alignment of long reads to the assembly graph.
- --contamination CONTAMINATION FASTA file of known contamination in long reads
- --scores SCORES Comma-delimited string of alignment scores: match, mismatch, gap open, gap
- extend (default: 3,-6,-5,-2)
- --low_score LOW_SCORE Score threshold - alignments below this are considered poor (default: set
- threshold automatically)
+ --contamination CONTAMINATION FASTA file of known contamination in long reads
+ --scores SCORES Comma-delimited string of alignment scores: match, mismatch, gap
+ open, gap extend (default: 3,-6,-5,-2)
+ --low_score LOW_SCORE Score threshold - alignments below this are considered poor
+ (default: set threshold automatically)
```
@@ -620,6 +642,12 @@ If you believe this has happened in your assembly, you can manually assign multi
If Unicycler doesn't complete your bacterial genome assembly on its own, you may be able to complete it manually with a bit of bioinformatics detective work. There's no single, straight-forward procedure for doing so, but I've put together [a few examples on the Unicycler wiki](https://github.com/rrwick/Unicycler/wiki/Tips-for-finishing-genomes) which may be helpful.
+### Using an external long-read assembly
+
+If you have a long-read assembly that you've prepared outside Unicycler and trust (e.g. with Canu), you can give it to Unicycler with `--existing_long_read_assembly`. Unicycler will then skip its miniasm/Racon step and use this assembly instead.
+
+
+
# Other included tools
=====================================
docs/unicycler-polish.md
=====================================
@@ -85,6 +85,7 @@ Generic long reads:
Polishing settings:
Various settings for polishing behaviour (defaults should work well in most cases)
+ --no_fix_local do not fix local misassemblies (default: False)
--min_insert MIN_INSERT minimum valid short read insert size (default: auto)
--max_insert MAX_INSERT maximum valid short read insert size (default: auto)
--min_align_length MIN_ALIGN_LENGTH Minimum long read alignment length (default: 1000)
=====================================
test/test_misc.py
=====================================
@@ -372,6 +372,13 @@ class TestMiscFunctions(unittest.TestCase):
version = unicycler.misc.java_version_from_java_output(java_version_output)
self.assertEqual(version, '')
+ def test_java_version_parsing_6(self):
+ java_version_output = 'openjdk version "9-Ubuntu"\n' \
+ 'OpenJDK Runtime Environment (build 9-Ubuntu+0-9b181-4)\n' \
+ 'OpenJDK 64-Bit Server VM (build 9-Ubuntu+0-9b181-4, mixed mode)'
+ version = unicycler.misc.java_version_from_java_output(java_version_output)
+ self.assertEqual(version, '9-Ubuntu')
+
def test_spades_version_parsing_1(self):
spades_version_output = 'SPAdes v3.10.1'
version = unicycler.misc.spades_version_from_spades_output(spades_version_output)
=====================================
unicycler/blast_func.py
=====================================
@@ -36,6 +36,7 @@ def find_start_gene(sequence, start_genes_fasta, identity_threshold, coverage_th
# gene overlaps from the end to the start, we have to duplicate some of the replicon sequence
# for the BLAST database.
seq_len = len(sequence)
+ start_genes_fasta = os.path.abspath(start_genes_fasta)
queries = load_fasta(start_genes_fasta)
if not queries:
raise CannotFindStart
=====================================
unicycler/misc.py
=====================================
@@ -1070,19 +1070,29 @@ def java_path_and_version(java_path):
# Make sure Java is 1.7+
try:
- major_version = int(version.split('.')[0])
- if major_version < 1:
- status = 'too old'
- elif major_version > 1:
- status = 'good'
- else: # major_version == 1
- minor_version = int(version.split('.')[1])
- if minor_version < 7:
+ if '.' in version:
+ major_version = int(version.split('.')[0])
+ if major_version < 1:
+ status = 'too old'
+ elif major_version > 1:
+ status = 'good'
+ else: # major_version == 1
+ minor_version = int(version.split('.')[1])
+ if minor_version < 7:
+ status = 'too old'
+ else:
+ status = 'good'
+ elif '-' in version:
+ # E.g. '9-Ubuntu'
+ major_version = int(version.split('-')[0])
+ if major_version < 7:
status = 'too old'
else:
status = 'good'
+ else:
+ status = 'bad'
except (ValueError, IndexError):
- version, status = '?', 'too old'
+ status = 'bad'
return found_java_path, version, status
=====================================
unicycler/pilon_func.py
=====================================
@@ -336,7 +336,7 @@ def polish_with_pilon(graph, args, polish_dir, insert_size_1st, insert_size_99th
for f in list_of_files:
try:
os.remove(f)
- except FileNotFoundError:
+ except (FileNotFoundError, OSError):
pass
return total_count
=====================================
unicycler/spades_func.py
=====================================
@@ -73,7 +73,7 @@ def get_best_spades_graph(short1, short2, short_unpaired, out_dir, read_depth_fi
kmer_range = kmers
else:
kmer_range = get_kmer_range(short1, short2, short_unpaired, spades_dir, kmer_count,
- min_k_frac, max_k_frac)
+ min_k_frac, max_k_frac, spades_path)
assem_dir = os.path.join(spades_dir, 'assembly')
log.log_section_header('SPAdes assemblies')
@@ -427,8 +427,27 @@ def spades_assembly(read_files, out_dir, kmers, threads, spades_path, spades_tmp
return graph_files, insert_size_mean, insert_size_deviation
+def get_max_spades_kmer(spades_path):
+ """
+ SPAdes usually has a maximum k-mer size of 127, but this can be changed when compiling SPAdes,
+ so this function checks the help text to see what it is.
+ https://github.com/ablab/spades/issues/40
+ """
+ try:
+ process = subprocess.Popen([spades_path, '--help'], stdout=subprocess.PIPE,
+ stderr=subprocess.PIPE)
+ out, err = process.communicate()
+ all_output = out.decode() + err.decode()
+ all_output = all_output.replace('\n', ' ')
+ all_output = ' '.join(all_output.split())
+ max_kmer = all_output.split('must be odd and less than ')[1].split(')')[0]
+ return int(max_kmer) - 1
+ except (IndexError, ValueError):
+ return 127
+
+
def get_kmer_range(reads_1_filename, reads_2_filename, unpaired_reads_filename, spades_dir,
- kmer_count, min_kmer_frac, max_kmer_frac):
+ kmer_count, min_kmer_frac, max_kmer_frac, spades_path):
"""
Uses the read lengths to determine the k-mer range to be used in the SPAdes assembly.
"""
@@ -453,6 +472,11 @@ def get_kmer_range(reads_1_filename, reads_2_filename, unpaired_reads_filename,
except ValueError:
pass
+ max_spades_kmer = get_max_spades_kmer(spades_path)
+ log.log('SPAdes maximum k-mer: {}'.format(max_spades_kmer))
+ if max_spades_kmer != 127:
+ log.log(' (unusual value, probably indicates custom SPAdes compilation)')
+
# If the code got here, then the k-mer range doesn't already exist and we'll create one by
# examining the read lengths.
read_lengths = get_read_lengths(reads_1_filename) + get_read_lengths(reads_2_filename) + \
@@ -460,18 +484,25 @@ def get_kmer_range(reads_1_filename, reads_2_filename, unpaired_reads_filename,
read_lengths = sorted(read_lengths)
median_read_length = read_lengths[len(read_lengths) // 2 - 1]
max_kmer = round_to_nearest_odd(max_kmer_frac * median_read_length)
- if max_kmer > 127:
- max_kmer = 127
+ if max_kmer > max_spades_kmer:
+ max_kmer = max_spades_kmer
starting_kmer = round_to_nearest_odd(min_kmer_frac * max_kmer / max_kmer_frac)
if starting_kmer < 11:
starting_kmer = 11
- # Create the k-mer range from a non-linear function that spaces out the early k-mers more and
- # makes the later k-mers (which are most likely to be the good, used ones) closer together.
- kmer_range = []
- for x in [x / (kmer_count - 1) for x in range(kmer_count)]:
- kmer_range.append((max_kmer - starting_kmer) * (2 - 2 / (x + 1)) + starting_kmer)
- kmer_range = sorted(list(set([round_to_nearest_odd(x) for x in kmer_range])))
+ if kmer_count == 1:
+ kmer_range = [max_kmer]
+ elif kmer_count == 2:
+ kmer_range = [starting_kmer, max_kmer]
+ else:
+ # Create the k-mer range from a non-linear function that spaces out the early k-mers more
+ # and makes the later k-mers (which are most likely to be the good, used ones) closer
+ # together.
+ kmer_range = []
+ for x in [x / (kmer_count - 1) for x in range(kmer_count)]:
+ kmer_range.append((max_kmer - starting_kmer) * (2 - 2 / (x + 1)) + starting_kmer)
+ kmer_range = sorted(list(set([round_to_nearest_odd(x) for x in kmer_range])))
+
kmer_range_str = ', '.join([str(x) for x in kmer_range])
log.log('Median read length: ' + str(median_read_length))
=====================================
unicycler/unicycler.py
=====================================
@@ -493,6 +493,9 @@ def get_arguments():
if args.threads <= 0:
quit_with_error('--threads must be at least 1')
+ if args.kmer_count < 1:
+ quit_with_error('--kmer_count must be at least 1')
+
if args.kmers is not None:
args.kmers = args.kmers.split(',')
try:
=====================================
unicycler/unicycler_polish.py
=====================================
@@ -106,6 +106,8 @@ def get_arguments():
settings_group = parser.add_argument_group('Polishing settings',
'Various settings for polishing behaviour '
'(defaults should work well in most cases)')
+ settings_group.add_argument('--no_fix_local', action='store_true',
+ help='do not fix local misassemblies')
settings_group.add_argument('--min_insert', type=int,
help='minimum valid short read insert size (default: auto)')
settings_group.add_argument('--max_insert', type=int,
@@ -410,6 +412,11 @@ def pilon_small_changes(fasta, round_num, args, all_ale_scores):
def pilon_large_changes(fasta, round_num, args, all_ale_scores):
+ if args.no_fix_local:
+ if args.verbosity > 0:
+ print('Not fixing local misassemblies', flush=True)
+ return fasta, round_num, []
+
current, round_num, applied_variant = ale_assessed_changes(fasta, round_num, args, True, False,
all_ale_scores, 'local',
'Pilon polish, large variants, '
=====================================
unicycler/version.py
=====================================
@@ -13,4 +13,4 @@ details. You should have received a copy of the GNU General Public License along
not, see <http://www.gnu.org/licenses/>.
"""
-__version__ = '0.4.6'
+__version__ = '0.4.7'
View it on GitLab: https://salsa.debian.org/med-team/unicycler/commit/6b867d2eeaddc1460170007da8c4972083a47323
--
View it on GitLab: https://salsa.debian.org/med-team/unicycler/commit/6b867d2eeaddc1460170007da8c4972083a47323
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20180906/ab3c1394/attachment-0001.html>
More information about the debian-med-commit
mailing list