[med-svn] [Git][med-team/busco][upstream] New upstream version 4.1.2
Steffen Möller
gitlab at salsa.debian.org
Mon Jul 27 15:50:41 BST 2020
Steffen Möller pushed to branch upstream at Debian Med / busco
Commits:
796d64b1 by Steffen Moeller at 2020-07-27T16:47:05+02:00
New upstream version 4.1.2
- - - - -
22 changed files:
- CHANGELOG
- README.md
- bin/busco
- config/config.ini
- src/busco/Analysis.py
- src/busco/AutoLineage.py
- src/busco/BuscoAnalysis.py
- src/busco/BuscoConfig.py
- src/busco/BuscoDownloadManager.py
- src/busco/BuscoLogger.py
- src/busco/BuscoPlacer.py
- src/busco/BuscoRunner.py
- src/busco/BuscoTools.py
- src/busco/GeneSetAnalysis.py
- src/busco/GenomeAnalysis.py
- src/busco/Toolset.py
- src/busco/TranscriptomeAnalysis.py
- − src/busco/ViralAnalysis.py
- src/busco/_version.py
- src/busco/run_BUSCO.py
- test_data/bacteria/expected_log.txt
- test_data/eukaryota/expected_log.txt
Changes:
=====================================
CHANGELOG
=====================================
@@ -1,3 +1,17 @@
+4.1.2
+- Issue #295 fixed
+
+4.1.1
+- Issue #287 fixed
+
+4.1.0
+- Reintroduce restart mode (Issues #203, #229, #251)
+- Fix augustus hanging problem (Issues #224, #232, #266)
+- Allow multiple cores for BLAST 2.10.1+
+- Issue #271 fixed
+- Issue #247 fixed
+- Issue #234 fixed
+
4.0.6
- Fix Augustus GFF parsing bug. Remove constraint on Augustus version.
=====================================
README.md
=====================================
@@ -1,5 +1,7 @@
**BUSCOv4 - Benchmarking sets of Universal Single-Copy Orthologs.**
+For full documentation please consult the user guide: https://busco.ezlab.org/busco_userguide.html
+
Main changes in v4:
- Automated selection of lineages issued from https://www.orthodb.org/ release 10
@@ -14,30 +16,22 @@ To install, clone the repository and enter ``sudo python3 setup.py install`` or
More details in the user guide: https://busco.ezlab.org/busco_userguide.html#manual-installation
-Do not forget to edit the ``config/config.ini`` file to match your environment. The script `scripts/busco_configurator.py` can help you filling it. You also have to set the ``BUSCO_CONFIG_FILE``
-environment variable to define the path (including the filename) to that ``config.ini`` file. It can be located anywhere.
+Do not forget to edit the ``config/config.ini`` file to match your environment. The script `scripts/busco_configurator.py` can help with this.
+You can set the ``BUSCO_CONFIG_FILE`` environment variable to define the path (including the filename) to that ``config.ini`` file.
```
export BUSCO_CONFIG_FILE="/path/to/myconfig.ini"
```
+Alternatively you can pass the config file path as a command line argument using ``--config /path/to/config.ini``.
+
If you have trouble installing one of the many third-party tools, try the official Docker container: https://hub.docker.com/r/ezlabgva/busco/tags
Report problems on the BUSCO issue board at https://gitlab.com/ezlab/busco/issues
-To get help on BUSCO use: ``busco -h`` and ``python3 scripts/generate_plot.py -h``
-
-**!!!** Don't use "odb9" datasets with BUSCOv4. If you need to reproduce previous analyses, use BUSCOv3 (https://gitlab.com/ezlab/busco/-/tags/3.0.2)
+To get help with BUSCO use: ``busco -h`` and ``python3 scripts/generate_plot.py -h``
-Note: For v4.0.2 and before, when running auto-lineage, the initial results for eukaryotes were incomplete. This was
-deliberate, as these initial results are used merely to determine whether the genome scores highest against the
-bacteria, archaea or eukaryota datasets. If the eukaryota dataset was selected, BUSCO then attempts to place the input
-assembly on the eukaryote phylogenetic tree before running a complete BUSCO assessment using the selected child dataset.
-Unless the top-level eukaryota dataset was selected as the best match for the input file, the eukaryota dataset run
-would not complete. So while the specific dataset run returned accurate results, the generic eukaryota dataset run
-should be considered unreliable.
-This has been changed in v4.0.3. The eukaryota run now always completes so the final generic eukaryota results can be
-considered reliable.
+**!!!** Do not use "odb9" datasets with BUSCOv4. If you need to reproduce previous analyses, use BUSCOv3 (https://gitlab.com/ezlab/busco/-/tags/3.0.2)
**How to cite BUSCO**
=====================================
bin/busco
=====================================
@@ -8,14 +8,16 @@ except ImportError as err:
pattern_search = re.search("cannot import name '(?P<module_name>[\w]+)", err.msg)
missing_module = pattern_search.group("module_name")
if missing_module == "run_BUSCO":
- print("BUSCO must be installed before it is run. Please enter 'python setup.py install (--user)'. See the user guide for more information.")
+ print("BUSCO must be installed before it is run. Please enter 'python setup.py install (--user)'. "
+ "See the user guide for more information.")
elif missing_module == "Bio":
print("Please install BioPython (https://biopython.org/) before running BUSCO.")
elif missing_module == "numpy":
print("Please install NumPy before running BUSCO.")
else:
print("Unable to find module {}. Please make sure it is installed. See the user guide and the GitLab issue "
- "board (https://gitlab.com/ezlab/busco/issues) if you need further assistance.".format(missing_module))
+ "board (https://gitlab.com/ezlab/busco/issues) if you need further assistance."
+ "".format(missing_module))
except:
print(err.msg)
@@ -23,4 +25,4 @@ except ImportError as err:
"GitLab issue board (https://gitlab.com/ezlab/busco/issues) if you need further assistance.")
raise SystemExit(0)
-run_BUSCO.main()
\ No newline at end of file
+run_BUSCO.main()
=====================================
config/config.ini
=====================================
@@ -5,7 +5,7 @@
# Many of the options in the busco_run section can alternatively be set using command line arguments. See the help prompt (busco -h) for details.
# WARNING: passing a parameter through the command line overrides the value specified in this file.
#
-# You need to set the path to this file in the environment variable BUSCO_CONFIG_PATH
+# You need to set the path to this file in the environment variable BUSCO_CONFIG_FILE
# as follows:
# export BUSCO_CONFIG_FILE="/path/to/myconfig.ini"
#
@@ -32,6 +32,8 @@
;cpu = 16
# Force rewrite if files already exist (True/False)
;force = False
+# Restart a previous BUSCO run (True/False)
+;restart = False
# Blast e-value
;evalue = 1e-3
# How many candidate regions (contigs, scaffolds) to consider for each BUSCO
@@ -56,11 +58,11 @@
;update-data = True
[tblastn]
-path = /ncbi-blast-2.2.31+/bin/
+path = /ncbi-blast-2.10.1+/bin/
command = tblastn
[makeblastdb]
-path = /ncbi-blast-2.2.31+/bin/
+path = /ncbi-blast-2.10.1+/bin/
command = makeblastdb
[augustus]
=====================================
src/busco/Analysis.py
=====================================
@@ -1,11 +1,8 @@
from Bio import SeqIO
from busco.BuscoTools import TBLASTNRunner, MKBLASTRunner
-from busco.Toolset import Tool
from busco.BuscoLogger import BuscoLogger
-from busco.BuscoLogger import LogDecorator as log
-import subprocess
import os
-from abc import ABCMeta, abstractmethod
+from abc import ABCMeta
logger = BuscoLogger.get_logger(__name__)
@@ -17,26 +14,13 @@ class NucleotideAnalysis(metaclass=ABCMeta):
# explanation of ambiguous codes found here: https://www.dnabaser.com/articles/IUPAC%20ambiguity%20codes.html
AMBIGUOUS_CODES = ["Y", "R", "W", "S", "K", "M", "D", "V", "H", "B"]
- MAX_FLANK = 20000
-
- def __init__(self, config):
- # Variables inherited from BuscoAnalysis
- self._config = None
- self._cpus = None
- self._input_file = None
-
- super().__init__(config) # Initialize BuscoAnalysis
- self._long = self._config.getboolean("busco_run", "long")
- self._flank = self._define_flank()
- self._ev_cutoff = self._config.getfloat("busco_run", "evalue")
- self._region_limit = self._config.getint("busco_run", "limit")
- self.blast_cpus = self._cpus
+ def __init__(self):
+ super().__init__() # Initialize BuscoAnalysis
if not self.check_nucleotide_file(self._input_file):
raise SystemExit("Please provide a nucleotide file as input")
def check_nucleotide_file(self, filename):
-
i = 0
for record in SeqIO.parse(filename, "fasta"):
for letter in record.seq.upper():
@@ -51,66 +35,43 @@ class NucleotideAnalysis(metaclass=ABCMeta):
return True
- def _define_flank(self):
- """
- TODO: Add docstring
- :return:
- """
- try:
- size = os.path.getsize(self._input_file) / 1000 # size in mb
- flank = int(size / 50) # proportional flank size
- # Ensure value is between 5000 and MAX_FLANK
- flank = min(max(flank, 5000), type(self).MAX_FLANK)
- except IOError: # Input data is only validated during run_analysis. This will catch any IO issues before that.
- raise SystemExit("Impossible to read the fasta file {}".format(self._input_file))
-
- return flank
-
- @abstractmethod
- def init_tools(self): # todo: This should be an abstract method
- """
- Initialize all required tools for Genome Eukaryote Analysis:
- MKBlast, TBlastn, Augustus and Augustus scripts: GFF2GBSmallDNA, new_species, etraining
- :return:
- """
+ def init_tools(self):
super().init_tools()
+ self.mkblast_runner = MKBLASTRunner()
+ self.tblastn_runner = TBLASTNRunner()
-
- def check_tool_dependencies(self):
- super().check_tool_dependencies()
-
- def _get_blast_version(self):
- mkblastdb_version_call = subprocess.check_output([self._mkblast_tool.cmd, "-version"], shell=False)
- mkblastdb_version = ".".join(mkblastdb_version_call.decode("utf-8").split("\n")[0].split()[1].rsplit(".")[:-1])
-
- tblastn_version_call = subprocess.check_output([self._tblastn_tool.cmd, "-version"], shell=False)
- tblastn_version = ".".join(tblastn_version_call.decode("utf-8").split("\n")[0].split()[1].rsplit(".")[:-1])
-
- if mkblastdb_version != tblastn_version:
- logger.warning("You are using version {} of mkblastdb and version {} of tblastn.".format(mkblastdb_version, tblastn_version))
-
- return tblastn_version
+ if self.mkblast_runner.version != self.tblastn_runner.version:
+ logger.warning("You are using version {} of makeblastdb and version {} of tblastn.".format(
+ self.mkblast_runner.version, self.tblastn_runner.version))
def _run_mkblast(self):
- self.mkblast_runner = MKBLASTRunner(self._mkblast_tool, self._input_file, self.main_out, self._cpus)
- self.mkblast_runner.run()
+ if self.restart and self.mkblast_runner.check_previous_completed_run():
+ logger.info("Skipping makeblastdb as BLAST DB already exists at {}".format(self.mkblast_runner.output_db))
+ else:
+ self.restart = False # Turn off restart mode if this is the entry point
+ self.config.set("busco_run", "restart", str(self.restart))
+ self.mkblast_runner.run()
+ if len(os.listdir(os.path.split(self.mkblast_runner.output_db)[0])) == 0:
+ raise SystemExit("makeblastdb failed to create a BLAST DB at {}".format(self.mkblast_runner.output_db))
def _run_tblastn(self, missing_and_frag_only=False, ancestral_variants=False):
incomplete_buscos = (self.hmmer_runner.missing_buscos + list(self.hmmer_runner.fragmented_buscos.keys())
if missing_and_frag_only else None) # This parameter is only used on the re-run
- self.tblastn_runner = TBLASTNRunner(self._tblastn_tool, self._input_file, self.run_folder, self._lineage_dataset,
- self.mkblast_runner.output_db, self._ev_cutoff, self.blast_cpus,
- self._region_limit, self._flank, missing_and_frag_only, ancestral_variants,
- incomplete_buscos)
-
- self.tblastn_runner.run()
- coords = self.tblastn_runner._get_coordinates()
- coords = self.tblastn_runner._filter_best_matches(coords) # Todo: remove underscores from non-hidden methods
- self.tblastn_runner._write_coordinates_to_file(coords) # writes to "coordinates.tsv"
- self.tblastn_runner._write_contigs(coords)
- return coords
+ self.tblastn_runner.configure_runner(self.mkblast_runner.output_db, missing_and_frag_only,
+ ancestral_variants, incomplete_buscos)
+ if self.restart and self.tblastn_runner.check_previous_completed_run():
+ logger.info("Skipping tblastn as results already exist at {}".format(self.tblastn_runner.blast_filename))
+ else:
+ self.restart = False
+ self.config.set("busco_run", "restart", str(self.restart))
+ self.tblastn_runner.run()
+ self.tblastn_runner.get_coordinates()
+ self.tblastn_runner.filter_best_matches()
+ self.tblastn_runner.write_coordinates_to_file() # writes to "coordinates.tsv"
+ self.tblastn_runner.write_contigs()
+ return
class ProteinAnalysis:
@@ -118,8 +79,8 @@ class ProteinAnalysis:
LETTERS = ["F", "L", "I", "M", "V", "S", "P", "T", "A", "Y", "X", "H", "Q", "N", "K", "D", "E", "C", "W", "R", "G"]
NUCL_LETTERS = ["A", "C", "T", "G", "N"]
- def __init__(self, config):
- super().__init__(config)
+ def __init__(self):
+ super().__init__()
if not self.check_protein_file(self._input_file):
raise SystemExit('Please provide a protein file as input')
=====================================
src/busco/AutoLineage.py
=====================================
@@ -71,7 +71,7 @@ class AutoSelectLineage:
root_runners = self.run_lineages_list(self.all_lineages)
self.get_best_match_lineage(root_runners)
self.config.set("busco_run", "domain_run_name", os.path.basename(self.best_match_lineage_dataset))
- BuscoRunner.final_results.append(self.selected_runner.analysis.hmmer_results_lines)
+ BuscoRunner.final_results.append(self.selected_runner.analysis.hmmer_runner.hmmer_results_lines)
BuscoRunner.results_datasets.append(os.path.basename(self.best_match_lineage_dataset))
return
@@ -80,9 +80,6 @@ class AutoSelectLineage:
for l in lineages_list:
self.current_lineage = "{}_{}".format(l, self.dataset_version)
autoconfig = BuscoConfigAuto(self.config, self.current_lineage)
- # The following line creates a direct reference, so whenever one analysis run adds a tool to this list it
- # is automatically updated here too.
- autoconfig.persistent_tools = self.config.persistent_tools
busco_run = BuscoRunner(autoconfig)
busco_run.run_analysis(callback=self.callback)
root_runners.append(busco_run) # Save all root runs so they can be recalled if chosen
@@ -136,7 +133,7 @@ class AutoSelectLineage:
def cleanup_disused_runs(self, disused_runners):
for runner in disused_runners:
- runner.analysis._cleanup()
+ runner.analysis.cleanup()
def get_lineage_dataset(self): # todo: rethink structure after BuscoPlacer is finalized and protein mode with mollicutes is fixed.
@@ -159,14 +156,16 @@ class AutoSelectLineage:
else:
logger.info("Mollicutes dataset is a better match for your data. Testing subclades...")
self._run_3_datasets(self.selected_runner)
- BuscoRunner.final_results.append(self.selected_runner.analysis.hmmer_results_lines)
+ BuscoRunner.final_results.append(self.selected_runner.analysis.hmmer_runner.hmmer_results_lines)
BuscoRunner.results_datasets.append(os.path.basename(self.best_match_lineage_dataset))
- elif ("geno" in self.selected_runner.mode and self.selected_runner.analysis.code_4_selected and
- os.path.basename(self.selected_runner.config.get("busco_run", "lineage_dataset")).startswith("bacteria")):
+ elif ("geno" in self.selected_runner.mode
+ and self.selected_runner.analysis.prodigal_runner.current_gc == "4"
+ and os.path.basename(
+ self.selected_runner.config.get("busco_run", "lineage_dataset")).startswith("bacteria")):
logger.info("The results from the Prodigal gene predictor indicate that your data belongs to the "
"mollicutes clade. Testing subclades...")
self._run_3_datasets()
- BuscoRunner.final_results.append(self.selected_runner.analysis.hmmer_results_lines)
+ BuscoRunner.final_results.append(self.selected_runner.analysis.hmmer_runner.hmmer_results_lines)
BuscoRunner.results_datasets.append(os.path.basename(self.best_match_lineage_dataset))
else:
self.run_busco_placer()
@@ -181,12 +180,12 @@ class AutoSelectLineage:
self.f_percents = []
runners = self.run_lineages_list(["mollicutes"])
runners.append(self.selected_runner)
- self.s_buscos.append(self.selected_runner.analysis.single_copy)
- self.d_buscos.append(self.selected_runner.analysis.multi_copy)
- self.f_buscos.append(self.selected_runner.analysis.only_fragments)
- self.s_percents.append(self.selected_runner.analysis.s_percent)
- self.d_percents.append(self.selected_runner.analysis.d_percent)
- self.f_percents.append(self.selected_runner.analysis.f_percent)
+ self.s_buscos.append(self.selected_runner.analysis.hmmer_runner.single_copy)
+ self.d_buscos.append(self.selected_runner.analysis.hmmer_runner.multi_copy)
+ self.f_buscos.append(self.selected_runner.analysis.hmmer_runner.only_fragments)
+ self.s_percents.append(self.selected_runner.analysis.hmmer_runner.s_percent)
+ self.d_percents.append(self.selected_runner.analysis.hmmer_runner.d_percent)
+ self.f_percents.append(self.selected_runner.analysis.hmmer_runner.f_percent)
self.get_best_match_lineage(runners)
return
@@ -215,12 +214,12 @@ class AutoSelectLineage:
def _run_3_datasets(self, mollicutes_runner=None):
if mollicutes_runner:
datasets = ["mycoplasmatales", "entomoplasmatales"]
- self.s_buscos = [mollicutes_runner.analysis.single_copy]
- self.d_buscos = [mollicutes_runner.analysis.multi_copy]
- self.f_buscos = [mollicutes_runner.analysis.only_fragments]
- self.s_percents = [mollicutes_runner.analysis.s_percent]
- self.d_percents = [mollicutes_runner.analysis.d_percent]
- self.f_percents = [mollicutes_runner.analysis.f_percent]
+ self.s_buscos = [mollicutes_runner.analysis.hmmer_runner.single_copy]
+ self.d_buscos = [mollicutes_runner.analysis.hmmer_runner.multi_copy]
+ self.f_buscos = [mollicutes_runner.analysis.hmmer_runner.only_fragments]
+ self.s_percents = [mollicutes_runner.analysis.hmmer_runner.s_percent]
+ self.d_percents = [mollicutes_runner.analysis.hmmer_runner.d_percent]
+ self.f_percents = [mollicutes_runner.analysis.hmmer_runner.f_percent]
dataset_runners = [mollicutes_runner]
else:
datasets = ["mollicutes", "mycoplasmatales", "entomoplasmatales"]
=====================================
src/busco/BuscoAnalysis.py
=====================================
@@ -4,24 +4,18 @@
.. module:: BuscoAnalysis
:synopsis: BuscoAnalysis implements general BUSCO analysis specifics
.. versionadded:: 3.0.0
-.. versionchanged:: 3.0.1
+.. versionchanged:: 4.0.7
Copyright (c) 2016-2020, Evgeny Zdobnov (ez at ezlab.org)
Licensed under the MIT license. See LICENSE.md file.
-
"""
from abc import ABCMeta, abstractmethod
-import busco
-from busco.BuscoConfig import BuscoConfig, BuscoConfigMain, BuscoConfigAuto
+from busco.BuscoConfig import BuscoConfig, BuscoConfigAuto
from busco.BuscoTools import HMMERRunner
-import inspect
import os
from busco.BuscoLogger import BuscoLogger
from busco.BuscoLogger import LogDecorator as log
-from busco.Toolset import Tool
-import subprocess
-from Bio import SeqIO
logger = BuscoLogger.get_logger(__name__)
@@ -31,115 +25,93 @@ class BuscoAnalysis(metaclass=ABCMeta):
This abstract base class (ABC) defines methods required for most of BUSCO analyses and has to be extended
by each specific analysis class
"""
+ config = None
-
- def __init__(self, config):
+ def __init__(self):
"""
1) load parameters
2) load and validate tools
3) check data and dataset integrity
4) Ready for analysis
-
- :param config: Values of all parameters to be used during the analysis
- :type config: BuscoConfig
"""
- self._config = config
+ super().__init__()
# Get paths
- self.lineage_results_dir = self._config.get("busco_run", "lineage_results_dir")
- self.main_out = self._config.get("busco_run", "main_out") # todo: decide which are hidden attributes
- self.working_dir = (os.path.join(self.main_out, "auto_lineage")
- if isinstance(self._config, BuscoConfigAuto)
- else self.main_out)
- self.run_folder = os.path.join(self.working_dir, self.lineage_results_dir)
- self.log_folder = os.path.join(self.main_out, "logs")
- self._input_file = self._config.get("busco_run", "in")
- self._lineage_dataset = self._config.get("busco_run", "lineage_dataset")
- self._lineage_name = os.path.basename(self._lineage_dataset)
- self._datasets_version = self._config.get("busco_run", "datasets_version")
- super().__init__()
+ self._lineage_results_dir = self.config.get("busco_run", "lineage_results_dir")
+ self.main_out = self.config.get("busco_run", "main_out")
+ self._working_dir = (os.path.join(self.main_out, "auto_lineage")
+ if isinstance(self.config, BuscoConfigAuto)
+ else self.main_out)
+ self._run_folder = os.path.join(self._working_dir, self._lineage_results_dir)
+ self._log_folder = os.path.join(self.main_out, "logs")
# Get other useful variables
- self._cpus = self._config.getint("busco_run", "cpu")
- self._domain = self._config.get("busco_run", "domain")
+ self._input_file = self.config.get("busco_run", "in")
+ self._lineage_dataset = self.config.get("busco_run", "lineage_dataset")
+ self._lineage_name = os.path.basename(self._lineage_dataset)
+ self._domain = self.config.get("busco_run", "domain")
self._has_variants_file = os.path.exists(os.path.join(self._lineage_dataset, "ancestral_variants"))
- self._dataset_creation_date = self._config.get("busco_run", "creation_date")
- self._dataset_nb_species = self._config.get("busco_run", "number_of_species")
- self._dataset_nb_buscos = self._config.get("busco_run", "number_of_BUSCOs")
+ self._dataset_creation_date = self.config.get("busco_run", "creation_date")
+ self.restart = self.config.getboolean("busco_run", "restart")
+
+ self.gene_details = None # Dictionary containing coordinate information for predicted genes.
- # Get Busco downloader
- self.downloader = self._config.downloader
+ self._lineages_download_path = os.path.join(self.config.get("busco_run", "download_path"), "lineages")
+
+ self.hmmer_runner = None
# Create optimized command line call for the given input
- self.busco_type = "main" if isinstance(self._config, BuscoConfigMain) else "auto"
+ # self.busco_type = "main" if isinstance(self._config, BuscoConfigMain) else "auto"
# if self.busco_type == "main":
# self.set_rerun_busco_command(self._config.clargs) # todo: rework rerun command
- # Variables storing BUSCO results
- self._missing_busco_list = []
- self._fragmented_busco_list = []
- self._gene_details = None # Dictionary containing coordinate information for predicted genes.
- self.s_percent = None
- self.d_percent = None
- self.f_percent = None
- self.all_single_copy_buscos = {}
- self._log_count = 0 # Dummy variable used to skip logging for intermediate eukaryote pipeline results.
-
- # TODO: catch unicode encoding exception and report invalid character line instead of doing content validation
- # todo: check config file exists before parsing
-
@abstractmethod
- def _cleanup(self):
+ def cleanup(self):
# Delete any non-decompressed files in busco_downloads
try:
- for dataset_name in os.listdir(os.path.join(self._config.get("busco_run", "download_path"), "lineages")):
+ for dataset_name in os.listdir(self._lineages_download_path):
if dataset_name.endswith((".gz", ".tar")):
os.remove(dataset_name)
except OSError:
pass
- def _check_data_integrity(self):
- self._check_dataset_integrity()
- if not os.stat(self._input_file).st_size > 0:
- raise SystemExit("Input file is empty.")
- with open(self._input_file) as f:
- for line in f:
- if line.startswith(">"):
- self._check_fasta_header(line)
- return
-
- def get_checkpoint(self): # TODO: rework checkpoint system
- """
- This function return the checkpoint if the checkpoint.tmp file exits or None if absent
- :return: the checkpoint name
- :rtype: int
- """
- checkpt_name = None
- checkpoint_file = os.path.join(self.run_folder, "checkpoint.tmp")
- if os.path.exists(checkpoint_file):
- with open(checkpoint_file, "r") as check_file:
- line = check_file.readline()
- self._random = line.split(".")[-1] # Reset random suffix
- checkpt_name = int(line.split(".")[0])
- return checkpt_name
-
@abstractmethod
- @log("Running BUSCO using lineage dataset {0} ({1}, {2})", logger, attr_name=["_lineage_name", "_domain", "_dataset_creation_date"], on_func_exit=True)
+ @log("Running BUSCO using lineage dataset {0} ({1}, {2})", logger,
+ attr_name=["_lineage_name", "_domain", "_dataset_creation_date"], on_func_exit=True)
def run_analysis(self):
"""
Abstract method, override to call all needed steps for running the child analysis.
"""
- self.create_dirs()
+ self._create_dirs()
self.init_tools()
- self.check_tool_dependencies()
self._check_data_integrity()
+ @log("***** Run HMMER on gene sequences *****", logger)
+ def run_hmmer(self, input_sequences):
+ """
+ This function runs hmmsearch.
+ """
+ files = sorted(os.listdir(os.path.join(self._lineage_dataset, "hmms")))
+ busco_ids = [os.path.splitext(f)[0] for f in files] # Each Busco ID has a HMM file of the form "<busco_id>.hmm"
+ self.hmmer_runner.configure_runner(input_sequences, busco_ids, self._mode, self.gene_details)
+ if self.restart and self.hmmer_runner.check_previous_completed_run():
+ logger.info("Skipping HMMER run as output already processed")
+ else:
+ self.restart = False
+ self.config.set("busco_run", "restart", str(self.restart))
+ self.hmmer_runner.run()
+ self.hmmer_runner.process_output()
+ self.hmmer_runner.write_hmmer_results()
+ self.hmmer_runner.produce_hmmer_summary()
+ return
+
@log("Checking dataset for HMM profiles", logger, debug=True)
def _check_dataset_integrity(self):
"""
Check the input dataset for hmm profiles, both files and folder are available
- Note: score and length cutoffs are checked when read,
- See _load_scores and _load_lengths
+ Note: score and length cutoffs are checked when read by hmmer_runner: see _load_scores and _load_lengths
+ Note: dataset.cfg file is not mandatory for offline mode
+ # todo: implement a check for dataset.cfg file if not using offline mode
:raises SystemExit: if the dataset is missing files or folders
"""
@@ -157,18 +129,24 @@ class BuscoAnalysis(metaclass=ABCMeta):
raise SystemExit("The dataset you provided lacks elements in {}".format(
os.path.join(self._lineage_dataset, "prfl")))
- # note: score and length cutoffs are checked when read,
- # see _load_scores and _load_lengths
- # ancestral would cause blast to fail, and be detected, see _blast() # TODO: clarify comment
- # dataset.cfg is not mandatory
-
if not self._has_variants_file:
logger.warning("The dataset you provided does not contain the file ancestral_variants, likely because it "
"is an old version. All blast steps will use the file \"ancestral\" instead")
return
- def _check_fasta_header(self, header):
+ def _check_data_integrity(self):
+ self._check_dataset_integrity()
+ if not os.stat(self._input_file).st_size > 0:
+ raise SystemExit("Input file is empty.")
+ with open(self._input_file) as f:
+ for line in f:
+ if line.startswith(">"):
+ self._check_fasta_header(line)
+ return
+
+ @staticmethod
+ def _check_fasta_header(header):
"""
This function checks problematic characters in fasta headers,
and warns the user and stops the execution
@@ -183,7 +161,6 @@ class BuscoAnalysis(metaclass=ABCMeta):
"which will crash BUSCO. Please clean the header of your "
"input file." % (char, header.strip()))
-
for char in BuscoConfig.FORBIDDEN_HEADER_CHARS_BEFORE_SPLIT:
if char in header.split()[0]:
raise SystemExit(
@@ -197,19 +174,7 @@ class BuscoAnalysis(metaclass=ABCMeta):
"\">\" which will crash Reader. Please clean the header of "
"your input file." % (header.strip()))
- def check_tool_dependencies(self):
- """
- check dependencies on tools
- :raises SystemExit: if a Tool version is not supported
- """
- # check hmm version
- if not self._get_hmmer_version() >= BuscoConfig.HMMER_VERSION:
- raise SystemExit(
- "HMMer version detected is not supported, please use HMMer v.{} +".format(BuscoConfig.HMMER_VERSION))
- return
-
- @abstractmethod
- def create_dirs(self):
+ def _create_dirs(self):
"""
Create the run (main) directory, log directory and the temporary directories
:return:
@@ -223,8 +188,8 @@ class BuscoAnalysis(metaclass=ABCMeta):
Create a subfolder of the main output folder that contains all log files from BUSCO and the external tools used.
:return:
"""
- if not os.path.exists(self.log_folder):
- os.mkdir(self.log_folder)
+ if not os.path.exists(self._log_folder):
+ os.mkdir(self._log_folder)
return
def _create_main_dir(self):
@@ -233,76 +198,23 @@ class BuscoAnalysis(metaclass=ABCMeta):
:raises SystemExit: if write permissions are not available to the specified location
"""
try:
- os.makedirs(self.run_folder)
+ os.makedirs(self._run_folder)
except FileExistsError:
- raise SystemExit("Something went wrong. BUSCO stopped before overwriting run folder {}".format(self.run_folder))
+ if not self.restart:
+ raise SystemExit("Something went wrong. BUSCO stopped before overwriting run folder "
+ "{}".format(self._run_folder))
except PermissionError:
raise SystemExit(
"Cannot write to the output directory, please make sure "
- "you have write permissions to {}".format(self.run_folder))
+ "you have write permissions to {}".format(self._run_folder))
return
- # @log("Temp directory is {}", logger, attr_name="_tmp", on_func_exit=True)
- # def _create_tmp_dir(self):
- # """
- # This function creates the tmp directory
- # :raises
- # SystemExit: if the user cannot write in the tmp directory
- # """
- # try:
- # if not os.path.exists(self._tmp):
- # os.makedirs(self._tmp)
- #
- # except OSError:
- # raise SystemExit(
- # "Cannot write to the temp directory, please make sure "
- # "you have write permissions to {}".format(self._tmp))
- # return
-
-
-
- def _get_busco_percentages(self):
- self.single_copy = len(self.hmmer_runner.single_copy_buscos) # int
- self.multi_copy = len(self.hmmer_runner.multi_copy_buscos) # int
- self.only_fragments = len(self.hmmer_runner.fragmented_buscos) # int
- self.total_buscos = len(self.hmmer_runner.cutoff_dict)
-
- # Get percentage of each kind of BUSCO match
- self.s_percent = abs(round((self.single_copy / self.total_buscos) * 100, 1))
- self.d_percent = abs(round((self.multi_copy / self.total_buscos) * 100, 1))
- self.f_percent = abs(round((self.only_fragments / self.total_buscos) * 100, 1))
-
- return self.single_copy, self.multi_copy, self.only_fragments, self.total_buscos
-
- def _get_hmmer_version(self):
- """
- check the Tool has the correct version
- :raises SystemExit: if the version is not correct
- """
- hmmer_version = subprocess.check_output([self._hmmer_tool.cmd, "-h"], shell=False)
- hmmer_version = hmmer_version.decode("utf-8")
- try:
- hmmer_version = hmmer_version.split("\n")[1].split()[2]
- hmmer_version = float(hmmer_version[:3])
- except ValueError:
- # to avoid a crash with a super old version
- hmmer_version = hmmer_version.split("\n")[1].split()[1]
- hmmer_version = float(hmmer_version[:3])
- finally:
- return hmmer_version
-
@log("Check all required tools are accessible...", logger, debug=True)
def init_tools(self):
"""
Init the tools needed for the analysis. HMMER is needed for all BUSCO analysis types.
"""
- try:
- assert(isinstance(self._hmmer_tool, Tool))
- except AttributeError:
- self._hmmer_tool = Tool("hmmsearch", self._config)
- except AssertionError:
- raise SystemExit("HMMer should be a tool")
-
+ self.hmmer_runner = HMMERRunner()
return
@property
@@ -310,242 +222,59 @@ class BuscoAnalysis(metaclass=ABCMeta):
def _mode(self):
pass
- # @log("This is not an incomplete run that can be restarted", logger, iswarn=True)
- # # Todo: decide if mini functions are necessary to facilitate decorator logging
- # def _not_incomplete_run(self):
- # self._restart = False
-
- def _produce_hmmer_summary(self):
- single_copy, multi_copy, only_fragments, total_buscos = self._get_busco_percentages()
-
- self.hmmer_results_lines = []
- self.hmmer_results_lines.append("***** Results: *****\n\n")
- self.one_line_summary = "C:{}%[S:{}%,D:{}%],F:{}%,M:{}%,n:{}\t{}\n".format(
- round(self.s_percent + self.d_percent, 1), self.s_percent, self.d_percent,
- self.f_percent, abs(round(100 - self.s_percent - self.d_percent - self.f_percent, 1)), total_buscos, " ")
- self.hmmer_results_lines.append(self.one_line_summary)
- self.hmmer_results_lines.append("{}\tComplete BUSCOs (C)\t\t\t{}\n".format(single_copy + multi_copy, " "))
- self.hmmer_results_lines.append("{}\tComplete and single-copy BUSCOs (S)\t{}\n".format(single_copy, " "))
- self.hmmer_results_lines.append("{}\tComplete and duplicated BUSCOs (D)\t{}\n".format(multi_copy, " "))
- self.hmmer_results_lines.append("{}\tFragmented BUSCOs (F)\t\t\t{}\n".format(only_fragments, " "))
- self.hmmer_results_lines.append("{}\tMissing BUSCOs (M)\t\t\t{}\n".format(
- total_buscos - single_copy - multi_copy - only_fragments, " "))
- self.hmmer_results_lines.append("{}\tTotal BUSCO groups searched\t\t{}\n".format(total_buscos, " "))
-
- with open(os.path.join(self.run_folder, "short_summary.txt"), "w") as summary_file:
-
- self._write_output_header(summary_file, no_table_header=True)
- summary_file.write("# Summarized benchmarking in BUSCO notation for file {}\n"
- "# BUSCO was run in mode: {}\n\n".format(self._input_file, self._mode))
-
- for line in self.hmmer_results_lines:
- summary_file.write("\t{}".format(line))
-
- if self._config.getboolean("busco_run", "auto-lineage") and isinstance(self._config, BuscoConfigMain) \
- and hasattr(self._config, "placement_files"):
- summary_file.write("\nPlacement file versions:\n")
- for placement_file in self._config.placement_files:
- summary_file.write("{}\n".format(placement_file))
-
-
- if isinstance(self._config, BuscoConfigAuto): # todo: rework this if/else block
- self._one_line_hmmer_summary()
- elif self._domain == "eukaryota" and self._log_count == 0:
- self._log_count += 1
- self._produce_full_hmmer_summary_debug()
- else:
- self._one_line_hmmer_summary()
- return
-
- @log("{}", logger, attr_name="hmmer_results_lines", apply="join", on_func_exit=True)
- def _produce_full_hmmer_summary(self):
- return
-
- @log("{}", logger, attr_name="hmmer_results_lines", apply="join", on_func_exit=True, debug=True)
- def _produce_full_hmmer_summary_debug(self):
- return
-
- @log("{}", logger, attr_name="one_line_summary", on_func_exit=True)
- def _one_line_hmmer_summary(self):
- self.one_line_summary = "Results:\t{}".format(self.one_line_summary)
- return
-
- @log("***** Run HMMER on gene sequences *****", logger)
- def run_hmmer(self, input_sequences):
- """
- This function runs hmmsearch.
- """
- self._hmmer_tool.total = 0
- self._hmmer_tool.nb_done = 0
- hmmer_output_dir = os.path.join(self.run_folder, "hmmer_output")
- if not os.path.exists(hmmer_output_dir):
- os.makedirs(hmmer_output_dir)
-
- files = sorted(os.listdir(os.path.join(self._lineage_dataset, "hmms")))
- busco_ids = [os.path.splitext(f)[0] for f in files] # Each Busco ID has a HMM file of the form "<busco_id>.hmm"
-
- self.hmmer_runner = HMMERRunner(self._hmmer_tool, input_sequences, busco_ids, hmmer_output_dir,
- self._lineage_dataset, self._mode, self._cpus, self._gene_details, self._datasets_version)
- self.hmmer_runner.load_buscos()
- self.hmmer_runner.run()
- self.hmmer_runner.process_output()
- # self.all_single_copy_buscos.update(self.hmmer_runner.single_copy_buscos)
- self._write_hmmer_results()
- self._produce_hmmer_summary()
- return
-
- def _write_buscos_to_file(self, sequences_aa, sequences_nt=None):
- """
- Write BUSCO matching sequences to output fasta files. Each sequence is printed in a separate file and both
- nucleotide and amino acid versions are created.
- :param busco_type: one of ["single_copy", "multi_copy", "fragmented"]
- :return:
- """
- for busco_type in ["single_copy", "multi_copy", "fragmented"]:
- if busco_type == "single_copy":
- output_dir = os.path.join(self.run_folder, "busco_sequences", "single_copy_busco_sequences")
- busco_matches = self.hmmer_runner.single_copy_buscos
- elif busco_type == "multi_copy":
- output_dir = os.path.join(self.run_folder, "busco_sequences", "multi_copy_busco_sequences")
- busco_matches = self.hmmer_runner.multi_copy_buscos
- elif busco_type == "fragmented":
- output_dir = os.path.join(self.run_folder, "busco_sequences", "fragmented_busco_sequences")
- busco_matches = self.hmmer_runner.fragmented_buscos
-
- if not os.path.exists(output_dir): # todo: move all create_dir commands to one place
- os.makedirs(output_dir)
-
- for busco, gene_matches in busco_matches.items():
- try:
- aa_seqs, nt_seqs = zip(*[(sequences_aa[gene_id], sequences_nt[gene_id]) for gene_id in gene_matches])
- with open(os.path.join(output_dir, "{}.fna".format(busco)), "w") as f2:
- SeqIO.write(nt_seqs, f2, "fasta")
- except TypeError:
- aa_seqs = [sequences_aa[gene_id] for gene_id in gene_matches]
- with open(os.path.join(output_dir, "{}.faa".format(busco)), "w") as f1:
- SeqIO.write(aa_seqs, f1, "fasta")
-
- return
-
# def _run_tarzip_hmmer_output(self): # todo: rewrite using tarfile
# """
# This function tarzips "hmmer_output" results folder
# """
# self._p_open(["tar", "-C", "%s" % self.run_folder, "-zcf", "%shmmer_output.tar.gz" % self.run_folder,
# "hmmer_output", "--remove-files"], "bash", shell=False)
+ #
+ # @log("To reproduce this run: {}", logger, attr_name="_rerun_cmd", on_func_exit=True)
+ # def set_rerun_busco_command(self, clargs): # todo: reconfigure
+ # """
+ # This function sets the command line to call to reproduce this run
+ # """
+ #
+ # # Find python script path
+ # entry_point = ""
+ # frame_ind = -1
+ # while "run_BUSCO.py" not in entry_point:
+ # entry_point = inspect.stack()[frame_ind].filename
+ # frame_ind -= 1
+ #
+ # # Add required parameters and other options
+ # self._rerun_cmd = "python %s -i %s -o %s -l %s -m %s -c %s" % (entry_point, self._input_file, os.path.basename(self.main_out),
+ # self._lineage_dataset, self._mode, self._cpus)
+ #
+ # try:
+ # if self._long:
+ # self._rerun_cmd += " --long"
+ # if self._region_limit != BuscoConfig.DEFAULT_ARGS_VALUES["limit"]:
+ # self._rerun_cmd += " --limit %s" % self._region_limit
+ # # if self._tmp != BuscoConfig.DEFAULT_ARGS_VALUES["tmp_path"]:
+ # # self._rerun_cmd += " -t %s" % self._tmp
+ # if self._ev_cutoff != BuscoConfig.DEFAULT_ARGS_VALUES["evalue"]:
+ # self._rerun_cmd += " -e %s" % self._ev_cutoff
+ # # if self._tarzip:
+ # # self._rerun_cmd += " -z"
+ # except AttributeError:
+ # pass
+ #
+ # # Include any command line arguments issued by the user
+ # # arg_aliases = {"-i": "--in", "-o": "--out", "-l": "--lineage_dataset", "-m": "--mode", "-c": "--cpu",
+ # # "-e": "--evalue", "-f": "--force", "-sp": "--species", "-z": "--tarzip",
+ # # "-r": "--restart", "-q": "--quiet", "-v": "--version", "-h": "--help"}
+ # arg_aliases.update(dict(zip(arg_aliases.values(), arg_aliases.keys())))
+ # for a, arg in enumerate(clargs):
+ # if arg.startswith("-") and not arg in self._rerun_cmd:
+ # if arg in arg_aliases:
+ # if arg_aliases[arg] in self._rerun_cmd:
+ # continue
+ # if a + 1 < len(clargs) and not clargs[a + 1].startswith("-"):
+ # self._rerun_cmd += " %s %s" % (arg, clargs[a + 1])
+ # else:
+ # self._rerun_cmd += " %s" % arg
+ # return
-
-
- @log("To reproduce this run: {}", logger, attr_name="_rerun_cmd", on_func_exit=True)
- def set_rerun_busco_command(self, clargs): # todo: reconfigure
- """
- This function sets the command line to call to reproduce this run
- """
-
- # Find python script path
- entry_point = ""
- frame_ind = -1
- while "run_BUSCO.py" not in entry_point:
- entry_point = inspect.stack()[frame_ind].filename
- frame_ind -= 1
-
- # Add required parameters and other options
- self._rerun_cmd = "python %s -i %s -o %s -l %s -m %s -c %s" % (entry_point, self._input_file, os.path.basename(self.main_out),
- self._lineage_dataset, self._mode, self._cpus)
-
- try:
- if self._long:
- self._rerun_cmd += " --long"
- if self._region_limit != BuscoConfig.DEFAULT_ARGS_VALUES["limit"]:
- self._rerun_cmd += " --limit %s" % self._region_limit
- # if self._tmp != BuscoConfig.DEFAULT_ARGS_VALUES["tmp_path"]:
- # self._rerun_cmd += " -t %s" % self._tmp
- if self._ev_cutoff != BuscoConfig.DEFAULT_ARGS_VALUES["evalue"]:
- self._rerun_cmd += " -e %s" % self._ev_cutoff
- # if self._tarzip:
- # self._rerun_cmd += " -z"
- except AttributeError:
- pass
-
- # Include any command line arguments issued by the user
- # arg_aliases = {"-i": "--in", "-o": "--out", "-l": "--lineage_dataset", "-m": "--mode", "-c": "--cpu",
- # "-e": "--evalue", "-f": "--force", "-sp": "--species", "-z": "--tarzip",
- # "-r": "--restart", "-q": "--quiet", "-v": "--version", "-h": "--help"}
- arg_aliases.update(dict(zip(arg_aliases.values(), arg_aliases.keys())))
- for a, arg in enumerate(clargs):
- if arg.startswith("-") and not arg in self._rerun_cmd:
- if arg in arg_aliases:
- if arg_aliases[arg] in self._rerun_cmd:
- continue
- if a + 1 < len(clargs) and not clargs[a + 1].startswith("-"):
- self._rerun_cmd += " %s %s" % (arg, clargs[a + 1])
- else:
- self._rerun_cmd += " %s" % arg
- return
-
- def _write_hmmer_results(self):
- """
- Create two output files: one with information on all BUSCOs for the given dataset and the other with a list of
- all BUSCOs that were not found.
- :return:
- """
-
- with open(os.path.join(self.run_folder, "full_table.tsv"), "w") as f_out:
-
- output_lines = self.hmmer_runner._create_output_content()
- self._write_output_header(f_out)
-
- with open(os.path.join(self.run_folder, "missing_busco_list.tsv"), "w") as miss_out:
-
- self._write_output_header(miss_out, missing_list=True)
-
- missing_buscos_lines, missing_buscos = self.hmmer_runner._list_missing_buscos()
- output_lines += missing_buscos_lines
-
- for missing_busco in sorted(missing_buscos):
- miss_out.write("{}\n".format(missing_busco))
-
- sorted_output_lines = self._sort_lines(output_lines)
- for busco in sorted_output_lines:
- f_out.write(busco)
- return
-
- @staticmethod
- def _sort_lines(lines):
- sorted_lines = sorted(lines, key=lambda x: int(x.split("\t")[0].split("at")[0]))
- return sorted_lines
-
-
-
-
- def _write_output_header(self, file_object, missing_list=False, no_table_header=False):
- """
- Write a standardized file header containing information on the BUSCO run.
- :param file_object: Opened file object ready for writing
- :type file_object: file
- :return:
- """
- file_object.write("# BUSCO version is: {} \n"
- "# The lineage dataset is: {} (Creation date: {}, number of species: {}, number of BUSCOs: {}"
- ")\n".format(busco.__version__, self._lineage_name, self._dataset_creation_date,
- self._dataset_nb_species, self._dataset_nb_buscos))
- # if isinstance(self._config, BuscoConfigMain): # todo: wait until rerun command properly implemented again
- # file_object.write("# To reproduce this run: {}\n#\n".format(self._rerun_cmd))
-
- if no_table_header:
- pass
- elif missing_list:
- file_object.write("# Busco id\n")
- elif self._mode == "proteins" or self._mode == "transcriptome":
- if self.hmmer_runner.extra_columns:
- file_object.write("# Busco id\tStatus\tSequence\tScore\tLength\tOrthoDB url\tDescription\n")
- else:
- file_object.write("# Busco id\tStatus\tSequence\tScore\tLength\n")
- elif self._mode == "genome":
- if self.hmmer_runner.extra_columns:
- file_object.write("# Busco id\tStatus\tSequence\tGene Start\tGene End\tScore\tLength\tOrthoDB url\tDescription\n")
- else:
- file_object.write("# Busco id\tStatus\tSequence\tGene Start\tGene End\tScore\tLength\n")
-
- return
-
+ # TODO: catch unicode encoding exception and report invalid character line instead of doing content validation
+ # todo: check config file exists before parsing
=====================================
src/busco/BuscoConfig.py
=====================================
@@ -15,7 +15,7 @@ logger = BuscoLogger.get_logger(__name__)
class BaseConfig(ConfigParser):
- DEFAULT_ARGS_VALUES = {"out_path": os.getcwd(), "cpu": 1, "force": False, "evalue": 1e-3,
+ DEFAULT_ARGS_VALUES = {"out_path": os.getcwd(), "cpu": 1, "force": False, "restart": False, "evalue": 1e-3,
"limit": 3, "long": False, "quiet": False,
"download_path": os.path.join(os.getcwd(), "busco_downloads"), "datasets_version": "odb10",
"offline": False, "download_base_url": "https://busco-data.ezlab.org/v4/data/",
@@ -49,6 +49,10 @@ class BaseConfig(ConfigParser):
self.downloader = BuscoDownloadManager(self)
return
+ # @log("Setting value in config")
+ # def set(self, *args, **kwargs):
+ # super().set(*args, **kwargs)
+
class PseudoConfig(BaseConfig):
@@ -200,9 +204,10 @@ class BuscoConfigMain(BuscoConfig, BaseConfig):
MANDATORY_USER_PROVIDED_PARAMS = ["in", "out", "mode"]
CONFIG_STRUCTURE = {"busco_run": ["in", "out", "out_path", "mode", "auto-lineage", "auto-lineage-prok",
- "auto-lineage-euk", "cpu", "force", "download_path", "datasets_version", "evalue",
- "limit", "long", "quiet", "offline", "download_base_url", "lineage_dataset",
- "update-data", "augustus_parameters", "augustus_species", "main_out"],
+ "auto-lineage-euk", "cpu", "force", "restart", "download_path",
+ "datasets_version", "evalue", "limit", "long", "quiet", "offline",
+ "download_base_url", "lineage_dataset", "update-data", "augustus_parameters",
+ "augustus_species", "main_out"],
"tblastn": ["path", "command"],
"makeblastdb": ["path", "command"],
"prodigal": ["path", "command"],
@@ -241,7 +246,6 @@ class BuscoConfigMain(BuscoConfig, BaseConfig):
self._check_required_input_exists()
self._init_downloader()
- self.persistent_tools = []
self.log_config()
@@ -259,7 +263,7 @@ class BuscoConfigMain(BuscoConfig, BaseConfig):
lineage_dataset = self.get("busco_run", "lineage_dataset")
datasets_version = self.get("busco_run", "datasets_version")
if "_odb" in lineage_dataset:
- dataset_version = lineage_dataset.rsplit("_")[-1]
+ dataset_version = lineage_dataset.rsplit("_")[-1].rstrip("/")
if datasets_version != dataset_version:
logger.warning("There is a conflict in your config. You specified a dataset from {0} while "
"simultaneously requesting the datasets_version parameter be {1}. Proceeding with "
@@ -368,11 +372,15 @@ class BuscoConfigMain(BuscoConfig, BaseConfig):
if os.path.exists(self.main_out):
if self.getboolean("busco_run", "force"):
self._force_remove_existing_output_dir(self.main_out)
+ elif self.getboolean("busco_run", "restart"):
+ logger.info("Attempting to restart the run using the following directory: {}".format(self.main_out))
else:
raise SystemExit("A run with the name {} already exists...\n"
"\tIf you are sure you wish to overwrite existing files, "
"please use the -f (force) option".format(self.main_out))
-
+ elif self.getboolean("busco_run", "restart"):
+ logger.warning("Restart mode not available as directory {} does not exist.".format(self.main_out))
+ self.set("busco_run", "restart", "False")
return
=====================================
src/busco/BuscoDownloadManager.py
=====================================
@@ -50,7 +50,8 @@ class BuscoDownloadManager:
def _create_main_download_dir(self):
if not os.path.exists(self.local_download_path):
- os.makedirs(self.local_download_path)
+ # exist_ok=True to allow for multiple parallel BUSCO runs each trying to create this folder simultaneously
+ os.makedirs(self.local_download_path, exist_ok=True)
def _load_versions(self):
try:
@@ -83,7 +84,8 @@ class BuscoDownloadManager:
# if the category folder does not exist, create it
category_folder = os.path.join(self.local_download_path, category)
if not os.path.exists(category_folder):
- os.mkdir(category_folder)
+ # exist_ok=True to allow for multiple parallel BUSCO runs, each trying to create this folder
+ os.makedirs(category_folder, exist_ok=True)
return
@staticmethod
@@ -100,7 +102,10 @@ class BuscoDownloadManager:
return dataset_date
def _check_existing_version(self, local_filepath, category, data_basename):
- latest_update = type(self).version_files[data_basename][0]
+ try:
+ latest_update = type(self).version_files[data_basename][0]
+ except KeyError:
+ raise SystemExit("{} is not a valid option for '{}'".format(data_basename, category))
path_basename, extension = os.path.splitext(data_basename)
if category == "lineages":
@@ -136,19 +141,22 @@ class BuscoDownloadManager:
if os.path.exists(local_dataset):
return local_dataset
else:
- raise SystemExit("Unable to run BUSCO in offline mode. Dataset {} does not exist.".format(local_dataset))
+ raise SystemExit("Unable to run BUSCO in offline mode. Dataset {} does not "
+ "exist.".format(local_dataset))
else:
basename, extension = os.path.splitext(data_name)
placement_files = sorted(glob.glob(os.path.join(
self.local_download_path, category, "{}.*{}".format(basename, extension))))
if len(placement_files) > 0:
- return placement_files[-1] # todo: for offline mode, log which files are being used (in case of more than one glob match)
+ return placement_files[-1]
+ # todo: for offline mode, log which files are being used (in case of more than one glob match)
else:
- raise SystemExit("Unable to run BUSCO placer in offline mode. Cannot find necessary placement files in {}".format(self.local_download_path))
+ raise SystemExit("Unable to run BUSCO placer in offline mode. Cannot find necessary placement "
+ "files in {}".format(self.local_download_path))
data_basename = os.path.basename(data_name)
local_filepath = os.path.join(self.local_download_path, category, data_basename)
- present, up_to_date, latest_version, local_filepath, hash = self._check_existing_version(local_filepath, category,
- data_basename)
+ present, up_to_date, latest_version, local_filepath, hash = self._check_existing_version(
+ local_filepath, category, data_basename)
if (not up_to_date and self.update_data) or not present:
# download
@@ -169,7 +177,8 @@ class BuscoDownloadManager:
return local_filepath
- def _rename_old_version(self, local_filepath):
+ @staticmethod
+ def _rename_old_version(local_filepath):
if os.path.exists(local_filepath):
try:
os.rename(local_filepath, "{}.old".format(local_filepath))
@@ -179,7 +188,7 @@ class BuscoDownloadManager:
timestamp = time.time()
os.rename(local_filepath, "{}.old.{}".format(local_filepath, timestamp))
logger.info("Renaming {} into {}.old.{}".format(local_filepath, local_filepath, timestamp))
- except OSError as e:
+ except OSError:
pass
return
@@ -189,7 +198,8 @@ class BuscoDownloadManager:
urllib.request.urlretrieve(remote_filepath, local_filepath)
observed_hash = type(self)._md5(local_filepath)
if observed_hash != expected_hash:
- logger.error("md5 hash is incorrect: {} while {} expected".format(str(observed_hash), str(expected_hash)))
+ logger.error("md5 hash is incorrect: {} while {} expected".format(str(observed_hash),
+ str(expected_hash)))
logger.info("deleting corrupted file {}".format(local_filepath))
os.remove(local_filepath)
raise SystemExit("BUSCO was unable to download or update all necessary files")
@@ -208,7 +218,6 @@ class BuscoDownloadManager:
hash_md5.update(chunk)
return hash_md5.hexdigest()
-
@log("Decompressing file {}", logger, func_arg=1)
def _decompress_file(self, local_filepath):
unzipped_filename = local_filepath.replace(".gz", "")
=====================================
src/busco/BuscoLogger.py
=====================================
@@ -83,11 +83,12 @@ class LogDecorator:
try:
string_arg = getattr(obj_inst, self.attr_name)
- # string_arg = attr
if self.apply == 'join' and isinstance(string_arg, list):
+ string_arg = [str(arg) for arg in string_arg] # Ensure all parameters are joinable strings
string_arg = ' '.join(string_arg)
elif self.apply == "basename" and isinstance(string_arg, str):
string_arg = os.path.basename(string_arg)
+
log_msg = self.msg.format(string_arg)
except TypeError: # if there are multiple attributes specified
string_args = (getattr(obj_inst, attr) for attr in self.attr_name)
@@ -247,10 +248,14 @@ class BuscoLogger(logging.getLoggerClass()):
self._err_hdlr.setFormatter(self._normal_formatter)
self.addHandler(self._err_hdlr)
- if not os.access(os.getcwd(), os.W_OK):
- raise SystemExit("No permission to write in the current directory.")
- # Random id used in filename to avoid complications for parallel BUSCO runs.
- self._file_hdlr = logging.FileHandler("busco_{}.log".format(type(self).random_id), mode="a")
+ try:
+ # Random id used in filename to avoid complications for parallel BUSCO runs.
+ self._file_hdlr = logging.FileHandler("busco_{}.log".format(type(self).random_id), mode="a")
+ except IOError as e:
+ errStr = "No permission to write in the current directory: {}".format(os.getcwd()) if e.errno == 13 \
+ else "IO error({0}): {1}".format(e.errno, e.strerror)
+ raise SystemExit(errStr)
+
self._file_hdlr.setLevel(logging.DEBUG)
self._file_hdlr.setFormatter(self._verbose_formatter)
self.addHandler(self._file_hdlr)
=====================================
src/busco/BuscoPlacer.py
=====================================
@@ -35,9 +35,13 @@ class BuscoPlacer:
self._params = config
self.mode = self._config.get("busco_run", "mode")
self.cpus = self._config.get("busco_run", "cpu")
+ self.restart = self._config.getboolean("busco_run", "restart")
self.run_folder = run_folder
self.placement_folder = os.path.join(run_folder, "placement_files")
- os.mkdir(self.placement_folder)
+ if self.restart:
+ os.makedirs(self.placement_folder, exist_ok=True)
+ else:
+ os.mkdir(self.placement_folder)
self.downloader = self._config.downloader
self.datasets_version = self._config.get("busco_run", "datasets_version")
self.protein_seqs = protein_seqs
@@ -92,12 +96,14 @@ class BuscoPlacer:
return dataset, placement_file_versions
def _init_tools(self):
- try:
- assert isinstance(self._sepp, Tool)
- except AttributeError:
- self._sepp = Tool("sepp", self._config)
- except AssertionError:
- raise SystemExit("SEPP should be a tool")
+ setattr(SEPPRunner, "config", self._config)
+ self.sepp_runner = SEPPRunner()
+ # try:
+ # assert isinstance(self._sepp, Tool)
+ # except AttributeError:
+ # self._sepp = Tool("sepp", self._config)
+ # except AssertionError:
+ # raise SystemExit("SEPP should be a tool")
def _pick_dataset(self):
@@ -273,10 +279,16 @@ class BuscoPlacer:
@log("Place the markers on the reference tree...", logger)
def _run_sepp(self):
- self.sepp_runner = SEPPRunner(self._sepp, self.run_folder, self.placement_folder, self.tree_nwk_file,
- self.tree_metadata_file, self.supermatrix_file, self.downloader,
- self.datasets_version, self.cpus)
- self.sepp_runner.run()
+ # self.sepp_runner = SEPPRunner(self._sepp, self.run_folder, self.placement_folder, self.tree_nwk_file,
+ # self.tree_metadata_file, self.supermatrix_file, self.downloader,
+ # self.datasets_version, self.cpus)
+ self.sepp_runner.configure_runner(self.tree_nwk_file, self.tree_metadata_file, self.supermatrix_file, self.downloader)
+ if self.restart and self.sepp_runner.check_previous_completed_run():
+ logger.info("Skipping SEPP run as it has already been completed")
+ else:
+ self.restart = False
+ self._config.set("busco_run", "restart", str(self.restart))
+ self.sepp_runner.run()
def _extract_marker_sequences(self):
"""
=====================================
src/busco/BuscoRunner.py
=====================================
@@ -1,16 +1,18 @@
+from busco.BuscoAnalysis import BuscoAnalysis
from busco.GenomeAnalysis import GenomeAnalysisEukaryotes
from busco.TranscriptomeAnalysis import TranscriptomeAnalysis
from busco.GeneSetAnalysis import GeneSetAnalysis
from busco.GenomeAnalysis import GenomeAnalysisProkaryotes
from busco.BuscoLogger import BuscoLogger
from busco.BuscoConfig import BuscoConfigMain
-from busco.BuscoTools import NoGenesError
+from busco.BuscoTools import NoGenesError, BaseRunner
from configparser import NoOptionError
import os
import shutil
logger = BuscoLogger.get_logger(__name__)
+
class BuscoRunner:
mode_dict = {"euk_genome": GenomeAnalysisEukaryotes, "prok_genome": GenomeAnalysisProkaryotes,
@@ -23,6 +25,8 @@ class BuscoRunner:
def __init__(self, config):
self.config = config
+ setattr(BaseRunner, "config", config)
+ setattr(BuscoAnalysis, "config", config)
self.mode = self.config.get("busco_run", "mode")
self.domain = self.config.get("busco_run", "domain")
@@ -33,25 +37,24 @@ class BuscoRunner:
elif self.domain == "eukaryota":
self.mode = "euk_genome"
analysis_type = type(self).mode_dict[self.mode]
- self.analysis = analysis_type(self.config)
+ self.analysis = analysis_type()
self.prok_fail_count = 0 # Needed to check if both bacteria and archaea return no genes.
def run_analysis(self, callback=(lambda *args: None)):
try:
self.analysis.run_analysis()
- s_buscos = self.analysis.single_copy
- d_buscos = self.analysis.multi_copy
- f_buscos = self.analysis.only_fragments
- s_percent = self.analysis.s_percent
- d_percent = self.analysis.d_percent
- f_percent = self.analysis.f_percent
- if isinstance(self.config, BuscoConfigMain):
- self.analysis._cleanup()
+ s_buscos = self.analysis.hmmer_runner.single_copy
+ d_buscos = self.analysis.hmmer_runner.multi_copy
+ f_buscos = self.analysis.hmmer_runner.only_fragments
+ s_percent = self.analysis.hmmer_runner.s_percent
+ d_percent = self.analysis.hmmer_runner.d_percent
+ f_percent = self.analysis.hmmer_runner.f_percent
+ self.analysis.cleanup()
except NoGenesError as nge:
no_genes_msg = "{0} did not recognize any genes matching the dataset {1} in the input file. " \
- "If this is unexpected, check your input file and your installation of {0}\n".format(
- nge.gene_predictor, self.analysis._lineage_name)
+ "If this is unexpected, check your input file and your " \
+ "installation of {0}\n".format(nge.gene_predictor, self.analysis._lineage_name)
fatal = (isinstance(self.config, BuscoConfigMain)
or (self.config.getboolean("busco_run", "auto-lineage-euk") and self.mode == "euk_genome")
or (self.config.getboolean("busco_run", "auto-lineage-prok") and self.mode == "prok_genome")
@@ -62,11 +65,10 @@ class BuscoRunner:
logger.warning(no_genes_msg)
s_buscos = d_buscos = f_buscos = s_percent = d_percent = f_percent = 0.0
if self.mode == "prok_genome":
- self.config.persistent_tools.append(self.analysis.prodigal_runner)
self.prok_fail_count += 1
except SystemExit as se:
- self.analysis._cleanup()
+ self.analysis.cleanup()
raise se
return callback(s_buscos, d_buscos, f_buscos, s_percent, d_percent, f_percent)
@@ -96,14 +98,16 @@ class BuscoRunner:
missing_in_parasitic_buscos = [entry.strip() for entry in parasitic_file.readlines()]
if len(self.analysis.hmmer_runner.missing_buscos) >= 0.8*len(missing_in_parasitic_buscos) \
and len(missing_in_parasitic_buscos) > 0:
- intersection = [mb for mb in self.analysis.hmmer_runner.missing_buscos if mb in missing_in_parasitic_buscos]
- percent_missing_in_parasites = round(100*len(intersection)/len(self.analysis.hmmer_runner.missing_buscos), 1)
+ intersection = [mb for mb in self.analysis.hmmer_runner.missing_buscos
+ if mb in missing_in_parasitic_buscos]
+ percent_missing_in_parasites = round(
+ 100*len(intersection)/len(self.analysis.hmmer_runner.missing_buscos), 1)
if percent_missing_in_parasites >= 80.0:
corrected_summary = self._recalculate_parasitic_scores(len(missing_in_parasitic_buscos))
positive_parasitic_line = "\n!!! The missing BUSCOs match the pattern of a parasitic-reduced " \
"genome. {}% of your missing BUSCOs are typically missing in these. " \
"A corrected score would be: \n{}\n".format(percent_missing_in_parasites,
- corrected_summary)
+ corrected_summary)
final_output_results.append(positive_parasitic_line)
if not self.config.getboolean("busco_run", "auto-lineage"):
auto_lineage_line = "\nConsider using the auto-lineage mode to select a more specific lineage."
@@ -115,28 +119,29 @@ class BuscoRunner:
return final_output_results
def _recalculate_parasitic_scores(self, num_missing_in_parasitic):
- total_buscos = self.analysis.total_buscos - num_missing_in_parasitic
- single_copy = self.analysis.single_copy
- multi_copy = self.analysis.multi_copy
- fragmented_copy = self.analysis.only_fragments
+ total_buscos = self.analysis.hmmer_runner.total_buscos - num_missing_in_parasitic
+ single_copy = self.analysis.hmmer_runner.single_copy
+ multi_copy = self.analysis.hmmer_runner.multi_copy
+ fragmented_copy = self.analysis.hmmer_runner.only_fragments
s_percent = abs(round(100*single_copy/total_buscos, 1))
d_percent = abs(round(100*multi_copy/total_buscos, 1))
f_percent = abs(round(100*fragmented_copy/total_buscos, 1))
one_line_summary = "C:{}%[S:{}%,D:{}%],F:{}%,M:{}%,n:{}\t\n".format(
- round(s_percent + d_percent, 1), s_percent, d_percent, f_percent, round(100-s_percent-d_percent-f_percent, 1), total_buscos)
+ round(s_percent + d_percent, 1), s_percent, d_percent, f_percent,
+ round(100-s_percent-d_percent-f_percent, 1), total_buscos)
return one_line_summary
-
-
def organize_final_output(self):
main_out_folder = self.config.get("busco_run", "main_out")
try:
domain_results_folder = self.config.get("busco_run", "domain_run_name")
- root_domain_output_folder = os.path.join(main_out_folder, "auto_lineage", "run_{}".format(domain_results_folder))
+ root_domain_output_folder = os.path.join(main_out_folder, "auto_lineage",
+ "run_{}".format(domain_results_folder))
root_domain_output_folder_final = os.path.join(main_out_folder, "run_{}".format(domain_results_folder))
os.rename(root_domain_output_folder, root_domain_output_folder_final)
+ os.symlink(root_domain_output_folder_final, root_domain_output_folder)
shutil.copyfile(os.path.join(root_domain_output_folder_final, "short_summary.txt"),
os.path.join(main_out_folder, "short_summary.generic.{}.{}.txt".format(
domain_results_folder.replace("run_", ""), os.path.basename(main_out_folder))))
@@ -155,8 +160,10 @@ class BuscoRunner:
lineage_results_folder.replace("run_", ""), os.path.basename(main_out_folder))))
return
- @staticmethod # This is deliberately a staticmethod so it can be called from run_BUSCO() even if BuscoRunner has not yet been initialized.
+ @staticmethod
def move_log_file(config):
+ # This is deliberately a staticmethod so it can be called from run_BUSCO() even if BuscoRunner has not yet
+ # been initialized.
try:
log_folder = os.path.join(config.get("busco_run", "main_out"), "logs")
if not os.path.exists(log_folder):
@@ -166,11 +173,7 @@ class BuscoRunner:
logger.warning("Unable to move 'busco_{}.log' to the 'logs' folder.".format(BuscoLogger.random_id))
return
-
- def finish(self, elapsed_time, root_lineage=False):
- # if root_lineage:
- # logger.info("Generic lineage selected. Results reproduced here.\n"
- # "{}".format(" ".join(self.analysis.hmmer_results_lines)))
+ def finish(self, elapsed_time):
final_output_results = self.format_results()
logger.info("".join(final_output_results))
@@ -265,5 +268,3 @@ class SmartBox:
box_lines.append("\t{}".format(line))
box_lines.append("\t{}".format(self.add_horizontal()))
return "\n".join(box_lines)
-
-
=====================================
src/busco/BuscoTools.py
=====================================
The diff for this file was not included because it is too large.
=====================================
src/busco/GeneSetAnalysis.py
=====================================
@@ -24,17 +24,17 @@ class GeneSetAnalysis(ProteinAnalysis, BuscoAnalysis):
"""
_mode = 'proteins'
- def __init__(self, config):
+ def __init__(self):
"""
Initialize an instance.
:param params: Values of all parameters that have to be defined
:type params: PipeConfig
"""
- super().__init__(config)
+ super().__init__()
self.sequences_aa = {record.id: record for record in list(SeqIO.parse(self._input_file, "fasta"))}
- def _cleanup(self):
- super()._cleanup()
+ def cleanup(self):
+ super().cleanup()
def run_analysis(self):
"""
@@ -42,11 +42,7 @@ class GeneSetAnalysis(ProteinAnalysis, BuscoAnalysis):
"""
super().run_analysis()
self.run_hmmer(self._input_file)
- self._write_buscos_to_file(self.sequences_aa)
- self._cleanup()
+ self.hmmer_runner.write_buscos_to_file(self.sequences_aa)
# if self._tarzip:
# self._run_tarzip_hmmer_output()
return
-
- def create_dirs(self):
- super().create_dirs()
=====================================
src/busco/GenomeAnalysis.py
=====================================
@@ -12,19 +12,16 @@ Licensed under the MIT license. See LICENSE.md file.
"""
from busco.BuscoAnalysis import BuscoAnalysis
from busco.Analysis import NucleotideAnalysis
-from busco.BuscoTools import ProdigalRunner, AugustusRunner, GFF2GBRunner, NewSpeciesRunner, ETrainingRunner, OptimizeAugustusRunner, NoGenesError
-from busco.BuscoConfig import BuscoConfigAuto
+from busco.BuscoTools import ProdigalRunner, AugustusRunner, GFF2GBRunner, NewSpeciesRunner, ETrainingRunner, \
+ OptimizeAugustusRunner, NoGenesError
import os
import shutil
from busco.BuscoLogger import BuscoLogger
from busco.BuscoLogger import LogDecorator as log
-from busco.Toolset import Tool
import time
from abc import ABCMeta, abstractmethod
from configparser import NoOptionError
-
-
logger = BuscoLogger.get_logger(__name__)
@@ -32,25 +29,13 @@ class GenomeAnalysis(NucleotideAnalysis, BuscoAnalysis, metaclass=ABCMeta):
_mode = "genome"
- def __init__(self, config):
- super().__init__(config)
+ def __init__(self):
+ super().__init__()
@abstractmethod
def run_analysis(self):
super().run_analysis()
-
- @abstractmethod
- def create_dirs(self):
- super().create_dirs()
-
- def check_tool_dependencies(self):
- """
- check dependencies on tools
- :raises SystemExit: if a Tool is not available
- """
- super().check_tool_dependencies()
-
def init_tools(self):
"""
Initialize tools needed for Genome Analysis.
@@ -58,7 +43,6 @@ class GenomeAnalysis(NucleotideAnalysis, BuscoAnalysis, metaclass=ABCMeta):
"""
super().init_tools()
-
# def _run_tarzip_augustus_output(self): # Todo: rewrite using tarfile
# """
# This function tarzips results folder
@@ -87,20 +71,12 @@ class GenomeAnalysis(NucleotideAnalysis, BuscoAnalysis, metaclass=ABCMeta):
# "%ssingle_copy_busco_sequences.tar.gz" % self.main_out,
# "single_copy_busco_sequences", "--remove-files"], "bash", shell=False)
- def set_rerun_busco_command(self, clargs):
- """
- This function sets the command line to call to reproduce this run
- """
- clargs.extend(["-sp", self._target_species])
- super().set_rerun_busco_command(clargs)
-
- def _write_full_table_header(self, out):
- """
- This function adds a header line to the full table file
- :param out: a full table file
- :type out: file
- """
- out.write("# Busco id\tStatus\tContig\tStart\tEnd\tScore\tLength\n")
+ # def set_rerun_busco_command(self, clargs):
+ # """
+ # This function sets the command line to call to reproduce this run
+ # """
+ # clargs.extend(["-sp", self._target_species])
+ # super().set_rerun_busco_command(clargs)
class GenomeAnalysisProkaryotes(GenomeAnalysis):
@@ -108,107 +84,32 @@ class GenomeAnalysisProkaryotes(GenomeAnalysis):
This class runs a BUSCO analysis on a genome.
"""
- def __init__(self, config):
+ def __init__(self):
"""
Initialize an instance.
- :param config: Values of all parameters that have to be defined
- :type config: PipeConfig
"""
- super().__init__(config)
- self.load_persistent_tools()
-
- # Get genetic_code from dataset.cfg file
- # bacteria/archaea=11; Entomoplasmatales,Mycoplasmatales=4
- try:
- self._genetic_code = self._config.get("prodigal", "prodigal_genetic_code").split(",")
- except NoOptionError:
- self._genetic_code = ["11"]
-
- if len(self._genetic_code) > 1:
- try:
- self.ambiguous_cd_range = [float(self._config.get("prodigal", "ambiguous_cd_range_lower")),
- float(self._config.get("prodigal", "ambiguous_cd_range_upper"))]
- except NoOptionError:
- raise SystemExit("Dataset config file does not contain required information. Please upgrade datasets.")
-
- else:
- self.ambiguous_cd_range = [None, 0]
-
- self.code_4_selected = False
- self.prodigal_output_dir = os.path.join(self.main_out, "prodigal_output")
+ super().__init__()
+ self.prodigal_runner = None
- def _cleanup(self):
- # tmp_path = os.path.join(self.prodigal_output_dir, "tmp")
- # if os.path.exists(tmp_path):
- # shutil.rmtree(tmp_path)
- super()._cleanup()
+ def cleanup(self):
+ super().cleanup()
def run_analysis(self):
"""
This function calls all needed steps for running the analysis.
"""
- # Initialize tools and check dependencies
super().run_analysis()
-
- if not os.path.exists(self.prodigal_output_dir): # If prodigal has already been run on the input, don't run it again
- os.makedirs(self.prodigal_output_dir)
- self._run_prodigal()
- self._config.persistent_tools.append(self.prodigal_runner)
-
- elif any(g not in self.prodigal_runner.genetic_code for g in self._genetic_code):
- self.prodigal_runner.genetic_code = self._genetic_code
- self.prodigal_runner.cd_lower, self.prodigal_runner.cd_upper = self.ambiguous_cd_range
- self._run_prodigal()
-
- else:
- # Prodigal has already been run on input. Don't run again, just load necessary params.
- # First determine which GC to use
- self.prodigal_runner.select_optimal_results(self._genetic_code, self.ambiguous_cd_range)
- tmp_file = self.prodigal_runner.gc_run_results[self.prodigal_runner.gc]["tmp_name"]
- log_file = self.prodigal_runner.gc_run_results[self.prodigal_runner.gc]["log_file"]
- self.prodigal_runner._organize_prodigal_files(tmp_file, log_file)
-
- self.code_4_selected = self.prodigal_runner.gc == "4"
- self.sequences_nt = self.prodigal_runner.gc_run_results[self.prodigal_runner.gc]["seqs_nt"]
- self.sequences_aa = self.prodigal_runner.gc_run_results[self.prodigal_runner.gc]["seqs_aa"]
- self._gene_details = self.prodigal_runner.gc_run_results[self.prodigal_runner.gc]["gene_details"]
+ self._run_prodigal()
self.run_hmmer(self.prodigal_runner.output_faa)
- self._write_buscos_to_file(self.sequences_aa, self.sequences_nt)
+ self.hmmer_runner.write_buscos_to_file(self.sequences_aa, self.sequences_nt)
return
- def load_persistent_tools(self):
- """
- For multiple runs, load Prodigal Runner in the same state as the previous run, to avoid having to run Prodigal
- on the input again.
- :return:
- """
- for tool in self._config.persistent_tools:
- if isinstance(tool, ProdigalRunner):
- self.prodigal_runner = tool
- else:
- raise SystemExit("Unrecognized persistent tool.")
-
- def create_dirs(self):
- super().create_dirs()
-
- def check_tool_dependencies(self):
- """
- check dependencies on tools
- :raises SystemExit: if a Tool is not available
- """
- super().check_tool_dependencies()
-
def init_tools(self):
"""
Init the tools needed for the analysis
"""
super().init_tools()
- try:
- assert(isinstance(self._prodigal_tool, Tool))
- except AttributeError:
- self._prodigal_tool = Tool("prodigal", self._config)
- except AssertionError:
- raise SystemExit("Prodigal should be a tool")
+ self.prodigal_runner = ProdigalRunner()
@log("***** Run Prodigal on input to predict and extract genes *****", logger)
def _run_prodigal(self):
@@ -216,69 +117,64 @@ class GenomeAnalysisProkaryotes(GenomeAnalysis):
Run Prodigal on input file to detect genes.
:return:
"""
- if not hasattr(self, "prodigal_runner"):
- self.prodigal_runner = ProdigalRunner(self._prodigal_tool, self._input_file, self.prodigal_output_dir,
- self._genetic_code, self.ambiguous_cd_range, self.log_folder)
- self.prodigal_runner.run()
- self.code_4_selected = self.prodigal_runner.code_4_selected
- return
+ if self.restart and self.prodigal_runner.check_previous_completed_run():
+ logger.info("Skipping Prodigal run as it has already completed")
+ self.prodigal_runner.get_gene_details()
+ else:
+ self.restart = False
+ self.config.set("busco_run", "restart", str(self.restart))
+ self.prodigal_runner.run()
+ self.gene_details = self.prodigal_runner.gene_details
+ self.sequences_nt = self.prodigal_runner.sequences_nt
+ self.sequences_aa = self.prodigal_runner.sequences_aa
- def _write_full_table_header(self, out):
- """
- This function adds a header line to the full table file
- :param out: a full table file
- :type out: file
- """
- out.write("# Busco id\tStatus\tContig\tStart\tEnd\tScore\tLength\n")
+ return
class GenomeAnalysisEukaryotes(GenomeAnalysis):
"""
- This class runs a BUSCO analysis on a euk_genome.
- Todo: reintroduce restart mode with checkpoints
+ This class runs a BUSCO analysis on a eukaryote genome.
"""
- def __init__(self, config):
- """
- Retrieve the augustus config path, mandatory for genome
- Cannot be specified through config because some augustus perl scripts use it as well
- BUSCO could export it if absent, but do not want to mess up with the user env,
- let's just tell the user to do it for now.
+ def __init__(self):
+ super().__init__()
- :param config: Values of all parameters that have to be defined
- :type config: PipeConfig
- """
- self._augustus_config_path = os.environ.get("AUGUSTUS_CONFIG_PATH")
+ self._long = self.config.getboolean("busco_run", "long")
try:
- self._target_species = config.get("busco_run", "augustus_species")
+ self._target_species = self.config.get("busco_run", "augustus_species")
except KeyError:
raise SystemExit("Something went wrong. Eukaryota datasets should specify an augustus species.")
try:
- self._augustus_parameters = config.get("busco_run", "augustus_parameters").replace(',', ' ')
+ self._augustus_parameters = self.config.get("busco_run", "augustus_parameters").replace(',', ' ')
except NoOptionError:
self._augustus_parameters = ""
- super().__init__(config)
- self._check_file_dependencies()
self.mkblast_runner = None
self.tblastn_runner = None
self.augustus_runner = None
+ self.gff2gb_runner = None
+ self.new_species_runner = None
+ self.etraining_runner = None
+ self.optimize_augustus_runner = None
+
self.sequences_nt = {}
self.sequences_aa = {}
+ self.gene_details = {}
- def create_dirs(self):
- super().create_dirs()
-
- def check_tool_dependencies(self):
- blast_version = self._get_blast_version()
- if blast_version not in ["2.2", "2.3"]: # Known problems with multithreading on BLAST 2.4-2.9.
- if blast_version == "2.9" and self._tblastn_tool.cmd.endswith(
- "tblastn_June13"): # NCBI sent a binary with this name that avoids the multithreading problems.
- pass
- else:
- logger.warning("You are using BLAST version {}. This is known to yield inconsistent results when "
- "multithreading. BLAST will run on a single core as a result. For performance improvement, "
- "please revert to BLAST 2.2 or 2.3.".format(blast_version))
- self.blast_cpus = 1
- super().check_tool_dependencies()
+ def cleanup(self):
+ """
+ This function cleans temporary files
+ """
+ try:
+ augustus_tmp = self.augustus_runner.tmp_dir # Should be already done if AugustusRunner ran correctly
+ if os.path.exists(augustus_tmp):
+ shutil.rmtree(augustus_tmp)
+ except OSError:
+ pass
+ try:
+ if self._target_species.startswith("BUSCO"):
+ self.augustus_runner.move_retraining_parameters()
+ except OSError:
+ pass
+ super().cleanup()
def init_tools(self):
"""
@@ -287,289 +183,131 @@ class GenomeAnalysisEukaryotes(GenomeAnalysis):
:return:
"""
super().init_tools()
- try:
- assert(isinstance(self._mkblast_tool, Tool))
- except AttributeError:
- self._mkblast_tool = Tool("makeblastdb", self._config)
- except AssertionError:
- raise SystemExit("mkblast should be a tool")
- try:
- assert(isinstance(self._tblastn_tool, Tool))
- except AttributeError:
- self._tblastn_tool = Tool("tblastn", self._config)
- except AssertionError:
- raise SystemExit("tblastn should be a tool")
- try:
- assert(isinstance(self._augustus_tool, Tool))
- except AttributeError:
- self._augustus_tool = Tool("augustus", self._config, augustus_out=True)
- # For some reason Augustus appears to send a return code before it writes to stdout, so we have to
- # sleep briefly to allow the output to be written to the file. Otherwise we have a truncated output which
- # will cause an error.
- # self._augustus_tool.sleep = 0.4
- except AssertionError:
- raise SystemExit("Augustus should be a tool")
-
- try:
- assert(isinstance(self._gff2gbSmallDNA_tool, Tool))
- except AttributeError:
- self._gff2gbSmallDNA_tool = Tool("gff2gbSmallDNA.pl", self._config)
- except AssertionError:
- raise SystemExit("gff2gbSmallDNA.pl should be a tool")
-
- try:
- assert(isinstance(self._new_species_tool, Tool))
- except AttributeError:
- self._new_species_tool = Tool("new_species.pl", self._config)
- except AssertionError:
- raise SystemExit("new_species.pl should be a tool")
-
- try:
- assert(isinstance(self._etraining_tool, Tool))
- except AttributeError:
- self._etraining_tool = Tool("etraining", self._config)
- except AssertionError:
- raise SystemExit("etraining should be a tool")
+ self.augustus_runner = AugustusRunner()
+ self.gff2gb_runner = GFF2GBRunner()
+ self.new_species_runner = NewSpeciesRunner()
+ self.etraining_runner = ETrainingRunner()
if self._long:
- try:
- assert (isinstance(self._optimize_augustus_tool, Tool))
- except AttributeError:
- self._optimize_augustus_tool = Tool("optimize_augustus.pl", self._config)
- except AssertionError:
- raise SystemExit("optimize_augustus should be a tool")
+ self.optimize_augustus_runner = OptimizeAugustusRunner()
return
- @log("Running Augustus gene predictor on BLAST search results.", logger)
- def _run_augustus(self, coords, rerun=False):
- output_dir = os.path.join(self.run_folder, "augustus_output")
- if not os.path.exists(output_dir): # TODO: consider grouping all create_dir calls into one function for all tools
- os.mkdir(output_dir)
- # if self.augustus_runner:
- # self.augustus_runner.coords = coords
- # self.augustus_runner.target_species = self._target_species
- # else:
- self.augustus_runner = AugustusRunner(self._augustus_tool, output_dir, self.tblastn_runner.output_seqs, self._target_species,
- self._lineage_dataset, self._augustus_parameters, coords,
- self._cpus, self.log_folder, self.sequences_aa, self.sequences_nt, rerun)
- self.augustus_runner.run()
- self.sequences_nt = self.augustus_runner.sequences_nt
- self.sequences_aa = self.augustus_runner.sequences_aa
+ def run_analysis(self):
+ """This function calls all needed steps for running the analysis."""
+ super().run_analysis()
+ self._run_mkblast()
+ self._run_tblastn()
+ self._run_augustus(self.tblastn_runner.coords)
+ self.gene_details = self.augustus_runner.gene_details
+ self.run_hmmer(self.augustus_runner.output_sequences)
+ self._rerun_analysis()
def _rerun_augustus(self, coords):
- self._augustus_tool.total = 0 # Reset job count
- self._augustus_tool.nb_done = 0
missing_and_fragmented_buscos = self.hmmer_runner.missing_buscos + list(
self.hmmer_runner.fragmented_buscos.keys())
logger.info("Re-running Augustus with the new metaparameters, number of target BUSCOs: {}".format(
len(missing_and_fragmented_buscos)))
- missing_and_fragmented_coords = {busco: coords[busco] for busco in coords if busco in missing_and_fragmented_buscos}
+ missing_and_fragmented_coords = {busco: coords[busco] for busco in coords if busco in
+ missing_and_fragmented_buscos}
logger.debug('Trained species folder is {}'.format(self._target_species))
self._run_augustus(missing_and_fragmented_coords, rerun=True)
return
- def _set_checkpoint(self, id=None):
- """
- This function update the checkpoint file with the provided id or delete
- it if none is provided
- :param id: the id of the checkpoint
- :type id: int
- """
- checkpoint_filename = os.path.join(self.run_folder, "checkpoint.tmp")
- if id:
- with open(checkpoint_filename, "w") as checkpt_file:
- checkpt_file.write("{}.{}".format(id, self._mode))
- else:
- if os.path.exists(checkpoint_filename):
- os.remove(checkpoint_filename)
- return
-
- def _run_gff2gb(self):
- self.gff2gb = GFF2GBRunner(self._gff2gbSmallDNA_tool, self.run_folder, self._input_file,
- self.hmmer_runner.single_copy_buscos, self._cpus)
- self.gff2gb.run()
- return
-
- def _run_new_species(self):
- new_species_name = "BUSCO_{}".format(os.path.basename(self.main_out))
- self.new_species_runner = NewSpeciesRunner(self._new_species_tool, self._domain, new_species_name, self._cpus)
- # create new species config file from template
- self.new_species_runner.run()
- return new_species_name
-
- def _run_etraining(self):
- # train on new training set (complete single copy buscos)
- self.etraining_runner = ETrainingRunner(self._etraining_tool, self.main_out, self.run_folder, self._cpus, self._augustus_config_path)
- self.etraining_runner.run()
- return
-
- def run_analysis(self):
- """
- This function calls all needed steps for running the analysis.
- Todo: reintroduce checkpoints and restart option.
- """
-
- super().run_analysis()
- self._run_mkblast()
- coords = self._run_tblastn()
- self._run_augustus(coords)
- self._gene_details = self.augustus_runner.gene_details
- self.run_hmmer(self.augustus_runner.output_sequences)
- self.rerun_analysis()
-
@log("Starting second step of analysis. The gene predictor Augustus is retrained using the results from the "
"initial run to yield more accurate results.", logger)
- def rerun_analysis(self):
-
- # self._fix_restart_augustus_folder() # todo: reintegrate this when checkpoints are restored
- coords = self._run_tblastn(missing_and_frag_only=True, ancestral_variants=self._has_variants_file)
-
- logger.info("Training Augustus using Single-Copy Complete BUSCOs:")
- logger.info("Converting predicted genes to short genbank files")
+ def _rerun_analysis(self):
+ self.augustus_runner.make_gff_files(self.hmmer_runner.single_copy_buscos)
+ self._run_tblastn(missing_and_frag_only=True, ancestral_variants=self._has_variants_file)
self._run_gff2gb()
-
- logger.info("All files converted to short genbank files, now running the training scripts")
- new_species_name = self._run_new_species()
- self._target_species = new_species_name # todo: check new species folder can be read/written - detect any silent Augustus issues
-
- self._merge_gb_files()
-
+ self._run_new_species()
+ self.config.set("busco_run", "augustus_species", self.new_species_runner.new_species_name)
+ self._target_species = self.new_species_runner.new_species_name
self._run_etraining()
if self._long:
- self._run_optimize_augustus(new_species_name)
+ self._run_optimize_augustus(self.new_species_runner.new_species_name)
self._run_etraining()
try:
- self._rerun_augustus(coords)
- self._gene_details.update(self.augustus_runner.gene_details)
+ self._rerun_augustus(self.tblastn_runner.coords)
+ self.gene_details.update(self.augustus_runner.gene_details)
self.run_hmmer(self.augustus_runner.output_sequences)
- self._write_buscos_to_file(self.sequences_aa, self.sequences_nt)
+ self.hmmer_runner.write_buscos_to_file(self.sequences_aa, self.sequences_nt)
except NoGenesError:
logger.warning("No genes found on Augustus rerun.")
- # self._move_retraining_parameters()
- # if self._tarzip:
+ # if self._tarzip: # todo: zip folders with a lot of output
# self._run_tarzip_augustus_output()
# self._run_tarzip_hmmer_output()
# remove the checkpoint, run is done
# self._set_checkpoint()
return
- def _check_file_dependencies(self): # todo: currently only implemented for GenomeAnalysisEukaryotes, checking Augustus dirs. Does it need to be rolled out for all analyses?
- """
- check dependencies on files and folders
- properly configured.
- :raises SystemExit: if Augustus config path is not writable or
- not set at all
- :raises SystemExit: if Augustus config path does not contain
- the needed species
- present
- """
- try:
- augustus_species_dir = os.path.join(self._augustus_config_path, "species")
- if not os.access(augustus_species_dir, os.W_OK):
- raise SystemExit("Cannot write to Augustus species folder, please make sure you have write "
- "permissions to {}".format(augustus_species_dir))
-
- except TypeError:
- raise SystemExit(
- "The environment variable AUGUSTUS_CONFIG_PATH is not set")
-
-
- if not os.path.exists(os.path.join(augustus_species_dir, self._target_species)):
- raise SystemExit(
- "Impossible to locate the species \"{0}\" in Augustus species folder"
- " ({1}), check that AUGUSTUS_CONFIG_PATH is properly set"
- " and contains this species. \n\t\tSee the help if you want "
- "to provide an alternative species".format(self._target_species, augustus_species_dir))
-
- def set_rerun_busco_command(self, clargs):
- """
- This function sets the command line to call to reproduce this run
- """
- clargs.extend(["-sp", self._target_species])
- if self._augustus_parameters:
- clargs.extend(["--augustus_parameters", "\"%s\"" % self._augustus_parameters])
- super().set_rerun_busco_command(clargs)
-
- def _cleanup(self):
- """
- This function cleans temporary files
- """
- try:
- augustus_tmp = self.augustus_runner.tmp_dir # Should be already done if AugustusRunner ran correctly
- if os.path.exists(augustus_tmp):
- shutil.rmtree(augustus_tmp)
- except:
- pass
- try:
- if self._target_species.startswith("BUSCO"):
- self._move_retraining_parameters()
- except:
- pass
- super()._cleanup()
-
-
- def _fix_restart_augustus_folder(self):
- """
- This function resets and checks the augustus folder to make a restart
- possible in phase 2
- :raises SystemExit: if it is not possible to fix the folders
- # Todo: reintegrate this when restart option is added back
- """
- if os.path.exists(os.path.join(self.augustus_runner.output_folder, "predicted_genes_run1")) \
- and os.path.exists(os.path.join(self.main_out, "hmmer_output_run1")):
- os.remove(os.path.join(self.main_out, "augustus_output", "predicted_genes", "*"))
- os.rmdir(os.path.join(self.main_out, "augustus_output", "predicted_genes"))
-
- os.rename(os.path.join(self.main_out, "augustus_output", "predicted_genes_run1"),
- os.path.join(self.main_out, "augustus_output", "predicted_genes"))
-
- os.remove(os.path.join(self.main_out, "hmmer_output", "*"))
- os.rmdir(os.path.join(self.main_out, "hmmer_output"))
-
- os.rename(os.path.join(self.main_out, "hmmer_output_run1"), os.path.join(self.main_out, "hmmer_output"))
+ @log("Running Augustus gene predictor on BLAST search results.", logger)
+ def _run_augustus(self, coords, rerun=False):
+ self.augustus_runner.configure_runner(self.tblastn_runner.output_seqs, coords, self.sequences_aa,
+ self.sequences_nt, rerun)
+ if self.restart and self.augustus_runner.check_previous_completed_run():
+ run = "2nd" if rerun else "1st"
+ logger.info("Skipping {} augustus run as output already processed".format(run))
+ else:
+ self.restart = False
+ self.config.set("busco_run", "restart", str(self.restart))
+ self.augustus_runner.run()
+ self.augustus_runner.process_output()
+ self.sequences_nt = self.augustus_runner.sequences_nt
+ self.sequences_aa = self.augustus_runner.sequences_aa
- elif (os.path.exists(os.path.join(self.main_out, "augustus_output", "predicted_genes"))
- and os.path.exists(os.path.join(self.main_out, "hmmer_output"))):
- pass
+ def _run_etraining(self):
+ """Train on new training set (complete single copy buscos)"""
+ self.etraining_runner.configure_runner(self.new_species_runner.new_species_name)
+ if self.restart and self.etraining_runner.check_previous_completed_run():
+ logger.info("Skipping etraining as it has already been done")
else:
- raise SystemExit("Impossible to restart the run, necessary folders are missing. Use the -f option instead of -r")
+ self.restart = False
+ self.config.set("busco_run", "restart", str(self.restart))
+ self.etraining_runner.run()
return
- def _move_retraining_parameters(self):
- """
- This function moves retraining parameters from augustus species folder
- to the run folder
- """
- augustus_species_path = os.path.join(self._augustus_config_path, "species", self._target_species)
- if os.path.exists(augustus_species_path):
- new_path = os.path.join(self.augustus_runner.output_folder, "retraining_parameters", self._target_species)
- shutil.move(augustus_species_path, new_path)
+ @log("Converting predicted genes to short genbank files", logger)
+ def _run_gff2gb(self):
+ self.gff2gb_runner.configure_runner(self.hmmer_runner.single_copy_buscos)
+ if self.restart and self.gff2gb_runner.check_previous_completed_run():
+ logger.info("Skipping gff2gb conversion as it has already been done")
else:
- logger.warning("Augustus did not produce a retrained species folder.")
+ self.restart = False
+ self.config.set("busco_run", "restart", str(self.restart))
+ self.gff2gb_runner.run()
return
- def _merge_gb_files(self):
- logger.debug("concat all gb files...")
- # Concatenate all GB files into one large file
- with open(os.path.join(self.augustus_runner.output_folder, "training_set.db"), "w") as outfile:
- gb_dir_path = os.path.join(self.augustus_runner.output_folder, "gb")
- for fname in os.listdir(gb_dir_path):
- with open(os.path.join(gb_dir_path, fname), "r") as infile:
- outfile.writelines(infile.readlines())
+ @log("All files converted to short genbank files, now training Augustus using Single-Copy Complete BUSCOs", logger)
+ def _run_new_species(self):
+ """Create new species config file from template"""
+ if self.restart and self.new_species_runner.check_previous_completed_run():
+ logger.info("Skipping new species creation as it has already been done")
+ else:
+ self.restart = False
+ self.config.set("busco_run", "restart", str(self.restart))
+ self.new_species_runner.run()
return
def _run_optimize_augustus(self, new_species_name):
- # long mode (--long) option - runs all the Augustus optimization
- # scripts (adds ~1 day of runtime)
+ """ long mode (--long) option - runs all the Augustus optimization scripts (adds ~1 day of runtime)"""
logger.warning("Optimizing augustus metaparameters, this may take a very long time, started at {}".format(
time.strftime("%m/%d/%Y %H:%M:%S")))
- self.optimize_augustus_runner = OptimizeAugustusRunner(self._optimize_augustus_tool, self.augustus_runner.output_folder, new_species_name, self._cpus)
+ self.optimize_augustus_runner.configure_runner(self.augustus_runner.output_folder, new_species_name)
self.optimize_augustus_runner.run()
return
+
+ # def set_rerun_busco_command(self, clargs):
+ # """
+ # This function sets the command line to call to reproduce this run
+ # """
+ # clargs.extend(["-sp", self._target_species])
+ # if self._augustus_parameters:
+ # clargs.extend(["--augustus_parameters", "\"%s\"" % self._augustus_parameters])
+ # super().set_rerun_busco_command(clargs)
=====================================
src/busco/Toolset.py
=====================================
@@ -13,9 +13,12 @@ Licensed under the MIT license. See LICENSE.md file.
"""
import os
import subprocess
-import threading
+from subprocess import TimeoutExpired
+# import threading
+from multiprocessing import Process, Pool, Value, Lock
import time
from shutil import which
+from abc import ABCMeta, abstractmethod
from busco.BuscoLogger import BuscoLogger, ToolLogger
from busco.BuscoLogger import LogDecorator as log
from busco.BuscoLogger import StreamLogger
@@ -23,12 +26,12 @@ import logging
logger = BuscoLogger.get_logger(__name__)
-class Job(threading.Thread):
+class Job(Process):#threading.Thread):
"""
Build and executes one work item in an external process
"""
- def __init__(self, tool_name, cmd, job_outlogger, job_errlogger, **kwargs):
+ def __init__(self, tool_name, cmd, job_outlogger, job_errlogger, timeout, **kwargs):
"""
:param name: a name of an executable / script ("a tool") to be run
:type cmd: list
@@ -42,6 +45,7 @@ class Job(threading.Thread):
self.cmd_line = [cmd]
self.job_outlogger = job_outlogger
self.job_errlogger = job_errlogger
+ self.timeout = timeout
self.kwargs = kwargs
def add_parameter(self, parameter):
@@ -60,14 +64,22 @@ class Job(threading.Thread):
"""
with StreamLogger(logging.DEBUG, self.job_outlogger, **self.kwargs) as out: # kwargs only provided to out to capture augustus stdout
with StreamLogger(logging.ERROR, self.job_errlogger) as err:
- # Stick with Popen(), communicate() and wait() instead of just run() to ensure compatibility with
- # Python versions < 3.5.
- p = subprocess.Popen(self.cmd_line, shell=False, stdout=out, stderr=err)
- p.wait()
+ try:
+ # Stick with Popen(), communicate() and wait() instead of just run() to ensure compatibility with
+ # Python versions < 3.5.
+ p = subprocess.Popen(self.cmd_line, shell=False, stdout=out, stderr=err)
+ p.wait(self.timeout)
+ except TimeoutExpired:
+ p.kill()
+ logger.warning("The following job was killed as it was taking too long (>1hr) to "
+ "complete.\n{}".format(" ".join(self.cmd_line)))
+
self.job_outlogger._file_hdlr.close()
self.job_outlogger.removeHandler(self.job_outlogger._file_hdlr)
self.job_errlogger._file_hdlr.close()
self.job_errlogger.removeHandler(self.job_errlogger._file_hdlr)
+ with cnt.get_lock():
+ cnt.value += 1
class ToolException(Exception):
"""
@@ -80,12 +92,12 @@ class ToolException(Exception):
return self.value
-class Tool:
+class Tool(metaclass=ABCMeta):
"""
Collection of utility methods used by all tools
"""
- def __init__(self, name, config, **kwargs):
+ def __init__(self):
"""
Initialize job list for a tool
:param name: the name of the tool to execute
@@ -94,51 +106,45 @@ class Tool:
:type config: configparser.ConfigParser
"""
- self.name = name
- self.config = config
self.cmd = None
- if not self.check_tool_available():
- raise ToolException("{} tool cannot be found. Please check the 'path' and 'command' parameters "
- "provided in the config file. Do not include the command in the path!".format(self.name))
-
- self.logfile_path_out = self._set_logfile_path()
- self.logfile_path_err = self.logfile_path_out.replace('_out.log', '_err.log')
- self.kwargs = kwargs
+ # self.name = name
+ # if not self.check_tool_available():
+ # raise ToolException("{} tool cannot be found. Please check the 'path' and 'command' parameters "
+ # "provided in the config file. Do not include the command in the path!".format(self.name))
+ if self.name == "augustus":
+ self.kwargs = {"augustus_out": True}
+ self.timeout = 3600
+ else:
+ self.kwargs = {}
+ self.timeout = None
self.jobs_to_run = []
self.jobs_running = []
self.nb_done = 0
self.total = 0
- self.count_jobs_created = True
- self.logged_header = False
+ self.cpus = None
+ self.chunksize = None
+ # self.count_jobs_created = True
+ # self.logged_header = False
- def check_tool_available(self):
- """
- Check tool's availability.
- 1. The section ['name'] is available in the config
- 2. This section contains keys 'path' and 'command'
- 3. The string resulted from contatination of values of these two keys
- represents the full path to the command
- :param name: the name of the tool to execute
- :type name: str
- :param config: initialized instance of ConfigParser
- :type config: configparser.ConfigParser
- :return: True if the tool can be run, False if it is not the case
- :rtype: bool
- """
- if not self.config.has_section(self.name):
- raise ToolException("Section for the tool [{}] is not present in the config file".format(self.name))
+ # self.logfile_path_out = os.path.join(self.config.get("busco_run", "main_out"), "logs", "{}_out.log".format(self.name))
+ # self.logfile_path_err = self.logfile_path_out.replace('_out.log', '_err.log')
- if not self.config.has_option(self.name, 'path'):
- raise ToolException("Key \'path\' in the section [{}] is not present in the config file".format(self.name))
+ @abstractmethod
+ def configure_job(self):
+ pass
- if self.config.has_option(self.name, 'command'):
- executable = self.config.get(self.name, 'command')
- else:
- executable = self.name
+ @abstractmethod
+ def generate_job_args(self):
+ pass
- self.cmd = os.path.join(self.config.get(self.name, 'path'), executable)
+ @property
+ @abstractmethod
+ def name(self):
+ raise NotImplementedError
- return which(self.cmd) is not None # True if tool available
+ @abstractmethod
+ def write_checkpoint_file(self):
+ pass
def create_job(self):
"""
@@ -146,10 +152,10 @@ class Tool:
"""
self.tool_outlogger = ToolLogger(self.logfile_path_out)
self.tool_errlogger = ToolLogger(self.logfile_path_err)
- job = Job(self.name, self.cmd[:], self.tool_outlogger, self.tool_errlogger, **self.kwargs)
+ job = Job(self.name, self.cmd[:], self.tool_outlogger, self.tool_errlogger, self.timeout, **self.kwargs)
self.jobs_to_run.append(job)
- if self.count_jobs_created:
- self.total += 1
+ # if self.count_jobs_created:
+ # self.total += 1
return job
def remove_job(self, job):
@@ -160,45 +166,44 @@ class Tool:
"""
self.jobs_to_run.remove(job)
- def _set_logfile_path(self):
- return os.path.join(self.config.get("busco_run", "main_out"), "logs", "{}_out.log".format(self.name))
-
- @log("Running {} job(s) on {}", logger, attr_name=['total', 'name'])
def log_jobs_to_run(self):
- self.logged_header = True
+ logger.info("Running {} job(s) on {}, starting at {}".format(self.total, self.name,
+ time.strftime('%m/%d/%Y %H:%M:%S')))
+ return
- def run_jobs(self, max_threads):
- """
- This method run all jobs created for the Tool and redirect
- the standard output and error to the current logger
- :param max_threads: the number or threads to run simultaneously
- :type max_threads: int
- :param log_it: whether to log the progress for the tasks. Default True
- :type log_it: boolean
- """
- if not self.logged_header:
+ @log("No jobs to run on {}", logger, attr_name="name", iswarn=True)
+ def log_no_jobs(self):
+ return
+
+ def run_jobs(self):
+ if self.total > 0:
self.log_jobs_to_run()
+ else:
+ self.log_no_jobs()
+ return
- # Wait for all threads to finish and log progress
- already_logged = 0
- while len(self.jobs_to_run) > 0 or len(self.jobs_running) > 0:
- time.sleep(0.001)
- for j in self.jobs_to_run:
- if len(self.jobs_running) < max_threads:
- self.jobs_running.append(j)
- self.jobs_to_run.remove(j)
- j.start()
- for j in self.jobs_running:
- # j.join()
- if not j.is_alive():
- self.jobs_running.remove(j)
- self.nb_done += 1
-
- if (self.nb_done == self.total or int(self.nb_done % float(self.total/10)) == 0) and self.nb_done != already_logged:
- already_logged = self._track_progress()
- # self.total = 0 # Reset for tools that are run twice (tblastn, augustus)
+ if self.cpus is None: # todo: need a different way to ensure self.cpus is nonzero number.
+ raise SystemExit("Number of CPUs not specified.")
+ with Pool(self.cpus, initializer=type(self).init_globals, initargs=(Value('i', 0),)) as job_pool:
+ job_pool.map(self.run_job, self.generate_job_args(), chunksize=self.chunksize)
+ self.write_checkpoint_file()
+
+ def run_job(self, args):
+ args = (args,) if isinstance(args, str) else tuple(args or (args,)) # Ensure args are tuples that can be unpacked. If no args, args=None, which is falsy, and this evaluates to (None,)
+ job = self.configure_job(*args)
+ job.run()
+ self.nb_done = cnt.value
+ if (self.nb_done == self.total or int(
+ self.nb_done % float(self.total / 10)) == 0):
+ self._track_progress()
@log('[{0}]\t{1} of {2} task(s) completed', logger, attr_name=['name', 'nb_done', 'total'], on_func_exit=True)
def _track_progress(self):
- return self.nb_done
\ No newline at end of file
+ return
+
+ @classmethod
+ def init_globals(cls, counter):
+ """Counter code adapted from the answer here: https://stackoverflow.com/a/53621343/4844311"""
+ global cnt
+ cnt = counter
=====================================
src/busco/TranscriptomeAnalysis.py
=====================================
@@ -11,8 +11,6 @@ Licensed under the MIT license. See LICENSE.md file.
"""
import os
-import time
-
from busco.BuscoAnalysis import BuscoAnalysis
from busco.BuscoLogger import BuscoLogger
from busco.BuscoLogger import LogDecorator as log
@@ -20,13 +18,13 @@ from Bio.Seq import reverse_complement, translate
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from busco.Analysis import NucleotideAnalysis
-from busco.Toolset import Tool
logger = BuscoLogger.get_logger(__name__)
# todo: catch multiple buscos on one transcript
+
class TranscriptomeAnalysis(NucleotideAnalysis, BuscoAnalysis):
"""
Analysis on a transcriptome.
@@ -34,14 +32,11 @@ class TranscriptomeAnalysis(NucleotideAnalysis, BuscoAnalysis):
_mode = "transcriptome"
- def __init__(self, config):
+ def __init__(self):
"""
Initialize an instance.
- :param config: Values of all parameters that have to be defined
- :type config: BuscoConfig
"""
- super().__init__(config)
-
+ super().__init__()
def run_analysis(self):
"""
@@ -58,57 +53,29 @@ class TranscriptomeAnalysis(NucleotideAnalysis, BuscoAnalysis):
# if checkpoint < 1:
self._run_mkblast()
- coords = self._run_tblastn(ancestral_variants=self._has_variants_file)
+ self._run_tblastn(ancestral_variants=self._has_variants_file)
- protein_seq_files = self._translate_seqs(coords)
+ protein_seq_files = self._translate_seqs(self.tblastn_runner.coords)
self.run_hmmer(protein_seq_files)
- # Note BUSCO matches are not written to file, as we have not yet developed a suitable protocol for Transcriptomes
- self._cleanup()
+ # Note BUSCO matches are not written to file, as we have not yet developed a suitable protocol for
+ # Transcriptomes
# if self._tarzip:
# self._run_tarzip_hmmer_output()
# self._run_tarzip_translated_proteins()
return
- def create_dirs(self): # todo: remove this as abstract method, review all abstract methods
- super().create_dirs()
-
- def init_tools(self): # todo: This should be an abstract method
-
+ def init_tools(self):
super().init_tools()
- try:
- assert(isinstance(self._mkblast_tool, Tool))
- except AttributeError:
- self._mkblast_tool = Tool("makeblastdb", self._config)
- except AssertionError:
- raise SystemExit("mkblast should be a tool")
-
- try:
- assert(isinstance(self._tblastn_tool, Tool))
- except AttributeError:
- self._tblastn_tool = Tool("tblastn", self._config)
- except AssertionError:
- raise SystemExit("tblastn should be a tool")
-
- def check_tool_dependencies(self):
- blast_version = self._get_blast_version()
- if blast_version not in ["2.2", "2.3"]: # Known problems with multithreading on BLAST 2.4-2.9.
- if blast_version == "2.9" and self._tblastn_tool.cmd.endswith(
- "tblastn_June13"): # NCBI sent a binary with this name that avoids the multithreading problems.
- pass
- else:
- logger.warning("You are using BLAST version {}. This is known to yield inconsistent results when "
- "multithreading. BLAST will run on a single core as a result. For performance improvement, "
- "please revert to BLAST 2.2 or 2.3.".format(blast_version))
- self.blast_cpus = 1
-
- def _cleanup(self):
+
+ def cleanup(self):
"""
This function cleans temporary files.
"""
- super()._cleanup()
+ super().cleanup()
- def six_frame_translation(self, seq):
+ @staticmethod
+ def six_frame_translation(seq):
"""
Gets the sixframe translation for the provided sequence
:param seq: the sequence to be translated
@@ -132,7 +99,8 @@ class TranscriptomeAnalysis(NucleotideAnalysis, BuscoAnalysis):
translated_seqs[descriptions[-(i+1)]] = (translate(anti[i:i + fragment_length], stop_symbol="X"))
return translated_seqs
- def _reformats_seq_id(self, seq_id):
+ @staticmethod
+ def _reformats_seq_id(seq_id):
"""
This function reformats the sequence id to its original values
:param seq_id: the seq id to reformats
@@ -160,8 +128,10 @@ class TranscriptomeAnalysis(NucleotideAnalysis, BuscoAnalysis):
protein_seq_files.append(output_filename)
translated_records = []
for contig_name in contig_info:
- tmp_filename = os.path.join(self.tblastn_runner.output_seqs, "{}.temp".format(contig_name[:100])) # Avoid very long filenames
- for record in SeqIO.parse(tmp_filename, "fasta"): # These files will only ever have one sequence, but BioPython examples always parse them in an iterator.
+ tmp_filename = os.path.join(self.tblastn_runner.output_seqs, "{}.temp".format(
+ contig_name[:100])) # Avoid very long filenames
+ for record in SeqIO.parse(tmp_filename, "fasta"): # These files will only ever have one sequence,
+ # but BioPython examples always parse them in an iterator.
translated_seqs = self.six_frame_translation(record.seq)
for desc_id in translated_seqs: # There are six possible translated sequences
prot_seq = translated_seqs[desc_id]
@@ -172,7 +142,6 @@ class TranscriptomeAnalysis(NucleotideAnalysis, BuscoAnalysis):
return protein_seq_files
-
# def _run_tarzip_translated_proteins(self):
# """
# This function tarzips results folder
=====================================
src/busco/ViralAnalysis.py deleted
=====================================
@@ -1,146 +0,0 @@
-#!/usr/bin/env python3
-# coding: utf-8
-"""
-.. module:: ViralAnalysis
- :synopsis: ViralAnalysis implements genome analysis specifics
-.. versionadded:: 3.0.0
-.. versionchanged:: 3.0.0
-
-Copyright (c) 2016-2020, Evgeny Zdobnov (ez at ezlab.org)
-Licensed under the MIT license. See LICENSE.md file.
-
-"""
-from busco.BuscoAnalysis import BuscoAnalysis
-from busco.BuscoLogger import BuscoLogger
-
-logger = BuscoLogger.get_logger(__name__)
-
-class ViralAnalysis(BuscoAnalysis):
- """
- This class runs a BUSCO analysis on a gene set.
- """
-
- _mode = "proteins"
-
-
- def __init__(self, params):
- """
- Initialize an instance.
- :param params: Values of all parameters that have to be defined
- :type params: PipeConfig
- """
- super().__init__(params)
- # data integrity checks not done by the parent class
- if self.check_protein_file():
- ViralAnalysis._logger.error("Please provide a genome file as input or run BUSCO in protein mode (--mode proteins)")
- raise SystemExit
-
- def run_analysis(self):
- """
- This function calls all needed steps for running the analysis.
- """
- super().run_analysis()
- self._translate_virus()
- self._sequences = self.translated_proteins
- self._run_hmmer()
- # if self._tarzip:
- # self._run_tarzip_hmmer_output()
-
- def _init_tools(self):
- """
- Init the tools needed for the analysis
- """
- super()._init_tools()
-
- def _run_hmmer(self):
- """
- This function runs hmmsearch.
- """
- super()._run_hmmer()
-
- def _sixpack(self, seq):
- """
- Gets the sixframe translation for the provided sequence
- :param seq: the sequence to be translated
- :type seq: str
- :return: the six translated sequences
- :rtype: list
- """
- s1 = seq
- s2 = seq[1:]
- s3 = seq[2:]
- rev = ""
- for letter in seq[::-1]:
- try:
- rev += BuscoAnalysis.COMP[letter]
- except KeyError:
- rev += BuscoAnalysis.COMP["N"]
- r1 = rev
- r2 = rev[1:]
- r3 = rev[2:]
- transc = []
- frames = [s1, s2, s3, r1, r3, r2]
- for sequence in frames:
- part = ""
- new = ""
- for letter in sequence:
- if len(part) == 3:
- try:
- new += BuscoAnalysis.CODONS[part]
- except KeyError:
- new += "X"
- part = ""
- part += letter
- else:
- part += letter
- if len(part) == 3:
- try:
- new += BuscoAnalysis.CODONS[part]
- except KeyError:
- new += "X"
- transc.append(new)
- return transc
-
- def _translate_virus(self):
- """
- Prepares viral genomes for a BUSCO
- protein analysis.
- 1) Translate any sequences in 6 frames
- 2) Split the sequences on the stops
- 3) Remove sequences shorter than 50aa
- :return: file name
- :rtype: string
- """
- with open(self._sequences, "r") as f1:
- with open(self.mainout + "translated_proteins.faa", "w") as o1:
- seqs = {}
- # parse file, retrieve all sequences
- for line in f1:
- if line.startswith(">"):
- header = line.strip()
- seqs[header] = ""
- else:
- seqs[header] += line.strip()
-
- # feed sequences to 6 frame translator
- # then split on STOP codons
- ctg_ct = 1
- for seqid in seqs:
- seq_6f = self._sixpack(seqs[seqid])
- nb_frame = 1
- for frame in seq_6f:
- valid_ts_ct = 1
- # chop at the stop codons
- chopped_seqs = frame.split("X")
- for short_seq in chopped_seqs:
- # must have at least 50 A.A. to be considered further
- if len(short_seq) >= 50:
- # ctg nb, frame nb, transcript nb
- o1.write(">seq_n%s_f%s_t%s\n" % (ctg_ct, nb_frame, valid_ts_ct))
- o1.write("%s\n" % short_seq)
- valid_ts_ct += 1
- else:
- pass
- nb_frame += 1
- ctg_ct += 1
- self.translated_proteins = self.main_out + "translated_proteins.faa"
=====================================
src/busco/_version.py
=====================================
@@ -6,4 +6,4 @@ Copyright (c) 2016-2020, Evgeny Zdobnov (ez at ezlab.org)
Licensed under the MIT license. See LICENSE.md file.
"""
-__version__ = "4.0.6"
+__version__ = "4.1.2"
=====================================
src/busco/run_BUSCO.py
=====================================
@@ -66,7 +66,8 @@ def _parse_args():
optional.add_argument(
'--out_path', dest='out_path', required=False, metavar='OUTPUT_PATH',
- help='Optional location for results folder, excluding results folder name. Default is current working directory.')
+ help='Optional location for results folder, excluding results folder name. '
+ 'Default is current working directory.')
optional.add_argument(
'-e', '--evalue', dest='evalue', required=False, metavar='N', type=float,
@@ -89,6 +90,10 @@ def _parse_args():
help='Force rewriting of existing files. '
'Must be used when output files with the provided name already exist.')
+ optional.add_argument(
+ '-r', '--restart', action='store_true', required=False, dest='restart',
+ help='Continue a run that had already partially completed.')
+
optional.add_argument(
'--limit', dest='limit', metavar='REGION_LIMIT', required=False,
type=int, help='How many candidate regions (contig or transcript) to consider per BUSCO (default: %s)'
@@ -117,7 +122,8 @@ def _parse_args():
# action="store_true")
optional.add_argument(
- '--auto-lineage', dest='auto-lineage', action="store_true", required=False, help='Run auto-lineage to find optimum lineage path')
+ '--auto-lineage', dest='auto-lineage', action="store_true", required=False,
+ help='Run auto-lineage to find optimum lineage path')
optional.add_argument(
'--auto-lineage-prok', dest='auto-lineage-prok', action="store_true", required=False,
@@ -144,7 +150,8 @@ def _parse_args():
optional.add_argument('-h', '--help', action=CleanHelpAction, help="Show this help message and exit")
- optional.add_argument('--list-datasets', action=ListLineagesAction, help="Print the list of available BUSCO datasets")
+ optional.add_argument('--list-datasets', action=ListLineagesAction,
+ help="Print the list of available BUSCO datasets")
return vars(parser.parse_args())
@@ -160,7 +167,9 @@ def main():
params = _parse_args()
run_BUSCO(params)
- at log('***** Start a BUSCO v{} analysis, current time: {} *****'.format(busco.__version__, time.strftime('%m/%d/%Y %H:%M:%S')), logger)
+
+ at log('***** Start a BUSCO v{} analysis, current time: {} *****'.format(busco.__version__,
+ time.strftime('%m/%d/%Y %H:%M:%S')), logger)
def run_BUSCO(params):
start_time = time.time()
@@ -173,14 +182,16 @@ def run_BUSCO(params):
lineage_basename = os.path.basename(config.get("busco_run", "lineage_dataset"))
main_out_folder = config.get("busco_run", "main_out")
- lineage_results_folder = os.path.join(main_out_folder, "auto_lineage", config.get("busco_run", "lineage_results_dir"))
+ lineage_results_folder = os.path.join(main_out_folder, "auto_lineage",
+ config.get("busco_run", "lineage_results_dir"))
if config.getboolean("busco_run", "auto-lineage"):
if lineage_basename.startswith(("bacteria", "archaea", "eukaryota")):
busco_run = config_manager.runner
- # It is possible that the following lineages were arrived at either by the Prodigal genetic code shortcut or by
- # BuscoPlacer. If the former, the run will have already been completed. If the latter it still needs to be done.
+ # It is possible that the following lineages were arrived at either by the Prodigal genetic code shortcut
+ # or by BuscoPlacer. If the former, the run will have already been completed. If the latter it still needs
+ # to be done.
elif lineage_basename.startswith(("mollicutes", "mycoplasmatales", "entomoplasmatales")) and \
os.path.exists(lineage_results_folder):
busco_run = config_manager.runner
@@ -190,13 +201,13 @@ def run_BUSCO(params):
busco_run = BuscoRunner(config)
if os.path.exists(lineage_results_folder):
- os.rename(lineage_results_folder, os.path.join(main_out_folder, config.get("busco_run", "lineage_results_dir")))
- busco_run.finish(time.time()-start_time, root_lineage=True)
+ os.rename(lineage_results_folder, os.path.join(main_out_folder,
+ config.get("busco_run", "lineage_results_dir")))
else:
busco_run.run_analysis()
- BuscoRunner.final_results.append(busco_run.analysis.hmmer_results_lines)
+ BuscoRunner.final_results.append(busco_run.analysis.hmmer_runner.hmmer_results_lines)
BuscoRunner.results_datasets.append(lineage_basename)
- busco_run.finish(time.time()-start_time)
+ busco_run.finish(time.time()-start_time)
except ToolException as e:
logger.error(e)
@@ -226,7 +237,8 @@ def run_BUSCO(params):
except BaseException:
exc_type, exc_value, exc_traceback = sys.exc_info()
- logger.critical("Unhandled exception occurred:\n{}\n".format("".join(traceback.format_exception(exc_type, exc_value, exc_traceback))))
+ logger.critical("Unhandled exception occurred:\n{}\n".format(
+ "".join(traceback.format_exception(exc_type, exc_value, exc_traceback))))
raise SystemExit
=====================================
test_data/bacteria/expected_log.txt
=====================================
@@ -1,4 +1,4 @@
-INFO: ***** Start a BUSCO v4.0.6 analysis, current time: 02/12/2020 16:38:00 *****
+INFO: ***** Start a BUSCO v4.1.2 analysis, current time: 07/01/2020 18:43:08 *****
INFO: Configuring BUSCO with /busco/config/config.ini
INFO: Mode is genome
INFO: Input file is genome.fna
@@ -14,12 +14,12 @@ INFO: Downloading file 'https://busco-data.ezlab.org/v4/data/lineages/archaea_od
INFO: Decompressing file '/busco_wd/busco_downloads/lineages/archaea_odb10.tar.gz'
INFO: Running BUSCO using lineage dataset archaea_odb10 (prokaryota, 2019-01-04)
INFO: ***** Run Prodigal on input to predict and extract genes *****
-INFO: Running prodigal with genetic code 11 in single mode
-INFO: Running 1 job(s) on prodigal
+INFO: Running Prodigal with genetic code 11 in single mode
+INFO: Running 1 job(s) on prodigal, starting at 07/01/2020 18:43:09
INFO: [prodigal] 1 of 1 task(s) completed
INFO: Genetic code 11 selected as optimal
INFO: ***** Run HMMER on gene sequences *****
-INFO: Running 194 job(s) on hmmsearch
+INFO: Running 194 job(s) on hmmsearch, starting at 07/01/2020 18:43:10
INFO: [hmmsearch] 20 of 194 task(s) completed
INFO: [hmmsearch] 39 of 194 task(s) completed
INFO: [hmmsearch] 59 of 194 task(s) completed
@@ -38,13 +38,14 @@ INFO: Running BUSCO using lineage dataset bacteria_odb10 (prokaryota, 2019-06-26
INFO: ***** Run Prodigal on input to predict and extract genes *****
INFO: Genetic code 11 selected as optimal
INFO: ***** Run HMMER on gene sequences *****
-INFO: Running 124 job(s) on hmmsearch
+INFO: Running 124 job(s) on hmmsearch, starting at 07/01/2020 18:43:13
INFO: [hmmsearch] 13 of 124 task(s) completed
INFO: [hmmsearch] 25 of 124 task(s) completed
INFO: [hmmsearch] 38 of 124 task(s) completed
INFO: [hmmsearch] 50 of 124 task(s) completed
INFO: [hmmsearch] 63 of 124 task(s) completed
INFO: [hmmsearch] 75 of 124 task(s) completed
+INFO: [hmmsearch] 87 of 124 task(s) completed
INFO: [hmmsearch] 100 of 124 task(s) completed
INFO: [hmmsearch] 112 of 124 task(s) completed
INFO: [hmmsearch] 124 of 124 task(s) completed
@@ -53,15 +54,15 @@ INFO: Results: C:21.0%[S:21.0%,D:0.0%],F:0.8%,M:78.2%,n:124
INFO: Downloading file 'https://busco-data.ezlab.org/v4/data/lineages/eukaryota_odb10.2019-11-20.tar.gz'
INFO: Decompressing file '/busco_wd/busco_downloads/lineages/eukaryota_odb10.tar.gz'
INFO: Running BUSCO using lineage dataset eukaryota_odb10 (eukaryota, 2019-11-20)
+INFO: Running 1 job(s) on makeblastdb, starting at 07/01/2020 18:43:16
INFO: Creating BLAST database with input file
-INFO: Running 1 job(s) on makeblastdb
INFO: [makeblastdb] 1 of 1 task(s) completed
INFO: Running a BLAST search for BUSCOs against created database
-INFO: Running 1 job(s) on tblastn
+INFO: Running 1 job(s) on tblastn, starting at 07/01/2020 18:43:16
INFO: [tblastn] 1 of 1 task(s) completed
INFO: Running Augustus gene predictor on BLAST search results.
INFO: Running Augustus prediction using fly as species:
-INFO: Running 10 job(s) on augustus
+INFO: Running 10 job(s) on augustus, starting at 07/01/2020 18:43:18
INFO: [augustus] 1 of 10 task(s) completed
INFO: [augustus] 2 of 10 task(s) completed
INFO: [augustus] 3 of 10 task(s) completed
@@ -74,7 +75,7 @@ INFO: [augustus] 9 of 10 task(s) completed
INFO: [augustus] 10 of 10 task(s) completed
INFO: Extracting predicted proteins...
INFO: ***** Run HMMER on gene sequences *****
-INFO: Running 4 job(s) on hmmsearch
+INFO: Running 4 job(s) on hmmsearch, starting at 07/01/2020 18:43:51
INFO: [hmmsearch] 1 of 4 task(s) completed
INFO: [hmmsearch] 2 of 4 task(s) completed
INFO: [hmmsearch] 3 of 4 task(s) completed
@@ -85,17 +86,19 @@ INFO: Results: C:0.0%[S:0.0%,D:0.0%],F:0.0%,M:100.0%,n:255
INFO: Starting second step of analysis. The gene predictor Augustus is retrained using the results from the initial run to yield more accurate results.
INFO: Extracting missing and fragmented buscos from the file ancestral_variants...
INFO: Running a BLAST search for BUSCOs against created database
+INFO: Running 1 job(s) on tblastn, starting at 07/01/2020 18:43:52
INFO: [tblastn] 1 of 1 task(s) completed
-INFO: Training Augustus using Single-Copy Complete BUSCOs:
INFO: Converting predicted genes to short genbank files
-INFO: All files converted to short genbank files, now running the training scripts
-INFO: Running 1 job(s) on new_species.pl
+WARNING: No jobs to run on gff2gbSmallDNA.pl
+INFO: All files converted to short genbank files, now training Augustus using Single-Copy Complete BUSCOs
+INFO: Running 1 job(s) on new_species.pl, starting at 07/01/2020 18:44:07
INFO: [new_species.pl] 1 of 1 task(s) completed
-INFO: Running 1 job(s) on etraining
+INFO: Running 1 job(s) on etraining, starting at 07/01/2020 18:44:08
INFO: [etraining] 1 of 1 task(s) completed
INFO: Re-running Augustus with the new metaparameters, number of target BUSCOs: 255
INFO: Running Augustus gene predictor on BLAST search results.
INFO: Running Augustus prediction using BUSCO_test_bacteria as species:
+INFO: Running 14 job(s) on augustus, starting at 07/01/2020 18:44:08
INFO: [augustus] 2 of 14 task(s) completed
INFO: [augustus] 3 of 14 task(s) completed
INFO: [augustus] 5 of 14 task(s) completed
@@ -108,12 +111,11 @@ INFO: [augustus] 13 of 14 task(s) completed
INFO: [augustus] 14 of 14 task(s) completed
INFO: Extracting predicted proteins...
INFO: ***** Run HMMER on gene sequences *****
-INFO: [hmmsearch] 1 of 3 task(s) completed
-INFO: [hmmsearch] 2 of 3 task(s) completed
-INFO: [hmmsearch] 3 of 3 task(s) completed
+WARNING: No jobs to run on hmmsearch
WARNING: BUSCO did not find any match. Make sure to check the log files if this is unexpected.
INFO: Results: C:0.0%[S:0.0%,D:0.0%],F:0.0%,M:100.0%,n:255
+WARNING: Augustus did not produce a retrained species folder.
INFO: bacteria_odb10 selected
INFO: ***** Searching tree for chosen lineage to find best taxonomic match *****
@@ -132,7 +134,7 @@ INFO: Decompressing file '/busco_wd/busco_downloads/placement_files/mapping_taxi
INFO: Downloading file 'https://busco-data.ezlab.org/v4/data/placement_files/mapping_taxid-lineage.bacteria_odb10.2019-12-16.txt.tar.gz'
INFO: Decompressing file '/busco_wd/busco_downloads/placement_files/mapping_taxid-lineage.bacteria_odb10.2019-12-16.txt.tar.gz'
INFO: Place the markers on the reference tree...
-INFO: Running 1 job(s) on sepp
+INFO: Running 1 job(s) on sepp, starting at 07/01/2020 18:44:10
INFO: [sepp] 1 of 1 task(s) completed
INFO: Not enough markers were placed on the tree (11). Root lineage bacteria is kept
INFO:
@@ -148,12 +150,15 @@ INFO:
|97 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------
-INFO: BUSCO analysis done with WARNING(s). Total running time: 81 seconds
+INFO: BUSCO analysis done with WARNING(s). Total running time: 127 seconds
***** Summary of warnings: *****
WARNING:busco.ConfigManager Running Auto Lineage Selector as no lineage dataset was specified. This will take a little longer than normal. If you know what lineage dataset you want to use, please specify this in the config file or using the -l (--lineage-dataset) flag in the command line.
WARNING:busco.BuscoTools BUSCO did not find any match. Make sure to check the log files if this is unexpected.
+WARNING:busco.Toolset No jobs to run on gff2gbSmallDNA.pl
+WARNING:busco.Toolset No jobs to run on hmmsearch
WARNING:busco.BuscoTools BUSCO did not find any match. Make sure to check the log files if this is unexpected.
+WARNING:busco.BuscoTools Augustus did not produce a retrained species folder.
INFO: Results written in /busco_wd/test_bacteria
=====================================
test_data/eukaryota/expected_log.txt
=====================================
@@ -1,4 +1,4 @@
-INFO: ***** Start a BUSCO v4.0.6 analysis, current time: 02/12/2020 16:40:52 *****
+INFO: ***** Start a BUSCO v4.1.2 analysis, current time: 07/01/2020 17:20:41 *****
INFO: Configuring BUSCO with /busco/config/config.ini
INFO: Mode is genome
INFO: Input file is genome.fna
@@ -14,12 +14,12 @@ INFO: Downloading file 'https://busco-data.ezlab.org/v4/data/lineages/archaea_od
INFO: Decompressing file '/busco_wd/busco_downloads/lineages/archaea_odb10.tar.gz'
INFO: Running BUSCO using lineage dataset archaea_odb10 (prokaryota, 2019-01-04)
INFO: ***** Run Prodigal on input to predict and extract genes *****
-INFO: Running prodigal with genetic code 11 in single mode
-INFO: Running 1 job(s) on prodigal
+INFO: Running Prodigal with genetic code 11 in single mode
+INFO: Running 1 job(s) on prodigal, starting at 07/01/2020 17:20:42
INFO: [prodigal] 1 of 1 task(s) completed
INFO: Genetic code 11 selected as optimal
INFO: ***** Run HMMER on gene sequences *****
-INFO: Running 194 job(s) on hmmsearch
+INFO: Running 194 job(s) on hmmsearch, starting at 07/01/2020 17:20:42
INFO: [hmmsearch] 20 of 194 task(s) completed
INFO: [hmmsearch] 39 of 194 task(s) completed
INFO: [hmmsearch] 59 of 194 task(s) completed
@@ -27,6 +27,8 @@ INFO: [hmmsearch] 78 of 194 task(s) completed
INFO: [hmmsearch] 97 of 194 task(s) completed
INFO: [hmmsearch] 117 of 194 task(s) completed
INFO: [hmmsearch] 136 of 194 task(s) completed
+INFO: [hmmsearch] 156 of 194 task(s) completed
+INFO: [hmmsearch] 156 of 194 task(s) completed
INFO: [hmmsearch] 175 of 194 task(s) completed
INFO: [hmmsearch] 194 of 194 task(s) completed
INFO: Results: C:1.0%[S:1.0%,D:0.0%],F:0.5%,M:98.5%,n:194
@@ -37,7 +39,7 @@ INFO: Running BUSCO using lineage dataset bacteria_odb10 (prokaryota, 2019-06-26
INFO: ***** Run Prodigal on input to predict and extract genes *****
INFO: Genetic code 11 selected as optimal
INFO: ***** Run HMMER on gene sequences *****
-INFO: Running 124 job(s) on hmmsearch
+INFO: Running 124 job(s) on hmmsearch, starting at 07/01/2020 17:20:45
INFO: [hmmsearch] 13 of 124 task(s) completed
INFO: [hmmsearch] 25 of 124 task(s) completed
INFO: [hmmsearch] 38 of 124 task(s) completed
@@ -54,15 +56,15 @@ INFO: Results: C:0.0%[S:0.0%,D:0.0%],F:0.0%,M:100.0%,n:124
INFO: Downloading file 'https://busco-data.ezlab.org/v4/data/lineages/eukaryota_odb10.2019-11-20.tar.gz'
INFO: Decompressing file '/busco_wd/busco_downloads/lineages/eukaryota_odb10.tar.gz'
INFO: Running BUSCO using lineage dataset eukaryota_odb10 (eukaryota, 2019-11-20)
+INFO: Running 1 job(s) on makeblastdb, starting at 07/01/2020 17:20:48
INFO: Creating BLAST database with input file
-INFO: Running 1 job(s) on makeblastdb
INFO: [makeblastdb] 1 of 1 task(s) completed
INFO: Running a BLAST search for BUSCOs against created database
-INFO: Running 1 job(s) on tblastn
+INFO: Running 1 job(s) on tblastn, starting at 07/01/2020 17:20:48
INFO: [tblastn] 1 of 1 task(s) completed
INFO: Running Augustus gene predictor on BLAST search results.
INFO: Running Augustus prediction using fly as species:
-INFO: Running 52 job(s) on augustus
+INFO: Running 52 job(s) on augustus, starting at 07/01/2020 17:20:48
INFO: [augustus] 6 of 52 task(s) completed
INFO: [augustus] 11 of 52 task(s) completed
INFO: [augustus] 16 of 52 task(s) completed
@@ -75,7 +77,7 @@ INFO: [augustus] 47 of 52 task(s) completed
INFO: [augustus] 52 of 52 task(s) completed
INFO: Extracting predicted proteins...
INFO: ***** Run HMMER on gene sequences *****
-INFO: Running 50 job(s) on hmmsearch
+INFO: Running 50 job(s) on hmmsearch, starting at 07/01/2020 17:21:44
INFO: [hmmsearch] 5 of 50 task(s) completed
INFO: [hmmsearch] 10 of 50 task(s) completed
INFO: [hmmsearch] 15 of 50 task(s) completed
@@ -91,10 +93,10 @@ INFO: Results: C:15.3%[S:15.3%,D:0.0%],F:1.2%,M:83.5%,n:255
INFO: Starting second step of analysis. The gene predictor Augustus is retrained using the results from the initial run to yield more accurate results.
INFO: Extracting missing and fragmented buscos from the file ancestral_variants...
INFO: Running a BLAST search for BUSCOs against created database
+INFO: Running 1 job(s) on tblastn, starting at 07/01/2020 17:21:45
INFO: [tblastn] 1 of 1 task(s) completed
-INFO: Training Augustus using Single-Copy Complete BUSCOs:
INFO: Converting predicted genes to short genbank files
-INFO: Running 39 job(s) on gff2gbSmallDNA.pl
+INFO: Running 39 job(s) on gff2gbSmallDNA.pl, starting at 07/01/2020 17:21:49
INFO: [gff2gbSmallDNA.pl] 4 of 39 task(s) completed
INFO: [gff2gbSmallDNA.pl] 8 of 39 task(s) completed
INFO: [gff2gbSmallDNA.pl] 12 of 39 task(s) completed
@@ -105,14 +107,15 @@ INFO: [gff2gbSmallDNA.pl] 28 of 39 task(s) completed
INFO: [gff2gbSmallDNA.pl] 32 of 39 task(s) completed
INFO: [gff2gbSmallDNA.pl] 36 of 39 task(s) completed
INFO: [gff2gbSmallDNA.pl] 39 of 39 task(s) completed
-INFO: All files converted to short genbank files, now running the training scripts
-INFO: Running 1 job(s) on new_species.pl
+INFO: All files converted to short genbank files, now training Augustus using Single-Copy Complete BUSCOs
+INFO: Running 1 job(s) on new_species.pl, starting at 07/01/2020 17:21:49
INFO: [new_species.pl] 1 of 1 task(s) completed
-INFO: Running 1 job(s) on etraining
+INFO: Running 1 job(s) on etraining, starting at 07/01/2020 17:21:50
INFO: [etraining] 1 of 1 task(s) completed
INFO: Re-running Augustus with the new metaparameters, number of target BUSCOs: 216
INFO: Running Augustus gene predictor on BLAST search results.
INFO: Running Augustus prediction using BUSCO_test_eukaryota as species:
+INFO: Running 39 job(s) on augustus, starting at 07/01/2020 17:21:50
INFO: [augustus] 4 of 39 task(s) completed
INFO: [augustus] 8 of 39 task(s) completed
INFO: [augustus] 12 of 39 task(s) completed
@@ -125,17 +128,18 @@ INFO: [augustus] 36 of 39 task(s) completed
INFO: [augustus] 39 of 39 task(s) completed
INFO: Extracting predicted proteins...
INFO: ***** Run HMMER on gene sequences *****
-INFO: [hmmsearch] 4 of 37 task(s) completed
-INFO: [hmmsearch] 8 of 37 task(s) completed
-INFO: [hmmsearch] 12 of 37 task(s) completed
-INFO: [hmmsearch] 15 of 37 task(s) completed
-INFO: [hmmsearch] 19 of 37 task(s) completed
-INFO: [hmmsearch] 23 of 37 task(s) completed
-INFO: [hmmsearch] 26 of 37 task(s) completed
-INFO: [hmmsearch] 30 of 37 task(s) completed
-INFO: [hmmsearch] 34 of 37 task(s) completed
-INFO: [hmmsearch] 37 of 37 task(s) completed
-INFO: Results: C:18.8%[S:18.8%,D:0.0%],F:0.4%,M:80.8%,n:255
+INFO: Running 34 job(s) on hmmsearch, starting at 07/01/2020 17:22:01
+INFO: [hmmsearch] 4 of 34 task(s) completed
+INFO: [hmmsearch] 7 of 34 task(s) completed
+INFO: [hmmsearch] 11 of 34 task(s) completed
+INFO: [hmmsearch] 14 of 34 task(s) completed
+INFO: [hmmsearch] 17 of 34 task(s) completed
+INFO: [hmmsearch] 21 of 34 task(s) completed
+INFO: [hmmsearch] 24 of 34 task(s) completed
+INFO: [hmmsearch] 28 of 34 task(s) completed
+INFO: [hmmsearch] 31 of 34 task(s) completed
+INFO: [hmmsearch] 34 of 34 task(s) completed
+INFO: Results: C:18.8%[S:18.8%,D:0.0%],F:1.2%,M:80.0%,n:255
INFO: eukaryota_odb10 selected
@@ -155,18 +159,18 @@ INFO: Decompressing file '/busco_wd/busco_downloads/placement_files/mapping_taxi
INFO: Downloading file 'https://busco-data.ezlab.org/v4/data/placement_files/mapping_taxid-lineage.eukaryota_odb10.2019-12-16.txt.tar.gz'
INFO: Decompressing file '/busco_wd/busco_downloads/placement_files/mapping_taxid-lineage.eukaryota_odb10.2019-12-16.txt.tar.gz'
INFO: Place the markers on the reference tree...
-INFO: Running 1 job(s) on sepp
+INFO: Running 1 job(s) on sepp, starting at 07/01/2020 17:22:02
INFO: [sepp] 1 of 1 task(s) completed
INFO: Lineage saccharomycetes is selected, supported by 16 markers out of 17
INFO: Downloading file 'https://busco-data.ezlab.org/v4/data/lineages/saccharomycetes_odb10.2019-11-20.tar.gz'
INFO: Decompressing file '/busco_wd/busco_downloads/lineages/saccharomycetes_odb10.tar.gz'
INFO: Running BUSCO using lineage dataset saccharomycetes_odb10 (eukaryota, 2019-11-20)
INFO: Running a BLAST search for BUSCOs against created database
-INFO: Running 1 job(s) on tblastn
+INFO: Running 1 job(s) on tblastn, starting at 07/01/2020 17:25:10
INFO: [tblastn] 1 of 1 task(s) completed
INFO: Running Augustus gene predictor on BLAST search results.
INFO: Running Augustus prediction using aspergillus_nidulans as species:
-INFO: Running 98 job(s) on augustus
+INFO: Running 98 job(s) on augustus, starting at 07/01/2020 17:25:14
INFO: [augustus] 10 of 98 task(s) completed
INFO: [augustus] 20 of 98 task(s) completed
INFO: [augustus] 30 of 98 task(s) completed
@@ -179,7 +183,7 @@ INFO: [augustus] 89 of 98 task(s) completed
INFO: [augustus] 98 of 98 task(s) completed
INFO: Extracting predicted proteins...
INFO: ***** Run HMMER on gene sequences *****
-INFO: Running 63 job(s) on hmmsearch
+INFO: Running 63 job(s) on hmmsearch, starting at 07/01/2020 17:25:54
INFO: [hmmsearch] 7 of 63 task(s) completed
INFO: [hmmsearch] 13 of 63 task(s) completed
INFO: [hmmsearch] 19 of 63 task(s) completed
@@ -193,10 +197,10 @@ INFO: [hmmsearch] 63 of 63 task(s) completed
INFO: Starting second step of analysis. The gene predictor Augustus is retrained using the results from the initial run to yield more accurate results.
INFO: Extracting missing and fragmented buscos from the file ancestral_variants...
INFO: Running a BLAST search for BUSCOs against created database
+INFO: Running 1 job(s) on tblastn, starting at 07/01/2020 17:26:02
INFO: [tblastn] 1 of 1 task(s) completed
-INFO: Training Augustus using Single-Copy Complete BUSCOs:
INFO: Converting predicted genes to short genbank files
-INFO: Running 29 job(s) on gff2gbSmallDNA.pl
+INFO: Running 29 job(s) on gff2gbSmallDNA.pl, starting at 07/01/2020 17:27:08
INFO: [gff2gbSmallDNA.pl] 3 of 29 task(s) completed
INFO: [gff2gbSmallDNA.pl] 6 of 29 task(s) completed
INFO: [gff2gbSmallDNA.pl] 9 of 29 task(s) completed
@@ -207,14 +211,16 @@ INFO: [gff2gbSmallDNA.pl] 21 of 29 task(s) completed
INFO: [gff2gbSmallDNA.pl] 24 of 29 task(s) completed
INFO: [gff2gbSmallDNA.pl] 27 of 29 task(s) completed
INFO: [gff2gbSmallDNA.pl] 29 of 29 task(s) completed
-INFO: All files converted to short genbank files, now running the training scripts
-INFO: Running 1 job(s) on new_species.pl
+INFO: [gff2gbSmallDNA.pl] 29 of 29 task(s) completed
+INFO: All files converted to short genbank files, now training Augustus using Single-Copy Complete BUSCOs
+INFO: Running 1 job(s) on new_species.pl, starting at 07/01/2020 17:27:09
INFO: [new_species.pl] 1 of 1 task(s) completed
-INFO: Running 1 job(s) on etraining
+INFO: Running 1 job(s) on etraining, starting at 07/01/2020 17:27:09
INFO: [etraining] 1 of 1 task(s) completed
INFO: Re-running Augustus with the new metaparameters, number of target BUSCOs: 2108
INFO: Running Augustus gene predictor on BLAST search results.
INFO: Running Augustus prediction using BUSCO_test_eukaryota as species:
+INFO: Running 147 job(s) on augustus, starting at 07/01/2020 17:27:10
INFO: [augustus] 15 of 147 task(s) completed
INFO: [augustus] 30 of 147 task(s) completed
INFO: [augustus] 45 of 147 task(s) completed
@@ -227,42 +233,45 @@ INFO: [augustus] 133 of 147 task(s) completed
INFO: [augustus] 147 of 147 task(s) completed
INFO: Extracting predicted proteins...
INFO: ***** Run HMMER on gene sequences *****
-INFO: [hmmsearch] 15 of 144 task(s) completed
-INFO: [hmmsearch] 29 of 144 task(s) completed
-INFO: [hmmsearch] 44 of 144 task(s) completed
-INFO: [hmmsearch] 58 of 144 task(s) completed
-INFO: [hmmsearch] 73 of 144 task(s) completed
-INFO: [hmmsearch] 116 of 144 task(s) completed
-INFO: [hmmsearch] 130 of 144 task(s) completed
-INFO: [hmmsearch] 144 of 144 task(s) completed
-INFO: Results: C:2.0%[S:2.0%,D:0.0%],F:0.1%,M:97.9%,n:2137
+INFO: Running 140 job(s) on hmmsearch, starting at 07/01/2020 17:27:58
+INFO: [hmmsearch] 14 of 140 task(s) completed
+INFO: [hmmsearch] 28 of 140 task(s) completed
+INFO: [hmmsearch] 42 of 140 task(s) completed
+INFO: [hmmsearch] 56 of 140 task(s) completed
+INFO: [hmmsearch] 70 of 140 task(s) completed
+INFO: [hmmsearch] 84 of 140 task(s) completed
+INFO: [hmmsearch] 98 of 140 task(s) completed
+INFO: [hmmsearch] 112 of 140 task(s) completed
+INFO: [hmmsearch] 126 of 140 task(s) completed
+INFO: [hmmsearch] 140 of 140 task(s) completed
+INFO: Results: C:2.0%[S:2.0%,D:0.0%],F:0.3%,M:97.7%,n:2137
INFO:
--------------------------------------------------
|Results from generic domain eukaryota_odb10 |
--------------------------------------------------
- |C:18.8%[S:18.8%,D:0.0%],F:0.4%,M:80.8%,n:255 |
+ |C:18.8%[S:18.8%,D:0.0%],F:1.2%,M:80.0%,n:255 |
|48 Complete BUSCOs (C) |
|48 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
- |1 Fragmented BUSCOs (F) |
- |206 Missing BUSCOs (M) |
+ |3 Fragmented BUSCOs (F) |
+ |204 Missing BUSCOs (M) |
|255 Total BUSCO groups searched |
--------------------------------------------------
--------------------------------------------------
|Results from dataset saccharomycetes_odb10 |
--------------------------------------------------
- |C:2.0%[S:2.0%,D:0.0%],F:0.1%,M:97.9%,n:2137 |
+ |C:2.0%[S:2.0%,D:0.0%],F:0.3%,M:97.7%,n:2137 |
|42 Complete BUSCOs (C) |
|42 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
- |3 Fragmented BUSCOs (F) |
- |2092 Missing BUSCOs (M) |
+ |6 Fragmented BUSCOs (F) |
+ |2089 Missing BUSCOs (M) |
|2137 Total BUSCO groups searched |
--------------------------------------------------
-INFO: BUSCO analysis done with WARNING(s). Total running time: 212 seconds
+INFO: BUSCO analysis done with WARNING(s). Total running time: 440 seconds
***** Summary of warnings: *****
WARNING:busco.ConfigManager Running Auto Lineage Selector as no lineage dataset was specified. This will take a little longer than normal. If you know what lineage dataset you want to use, please specify this in the config file or using the -l (--lineage-dataset) flag in the command line.
View it on GitLab: https://salsa.debian.org/med-team/busco/-/commit/796d64b186a8ef1c25ffe9d60614420eab3bd6bc
--
View it on GitLab: https://salsa.debian.org/med-team/busco/-/commit/796d64b186a8ef1c25ffe9d60614420eab3bd6bc
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20200727/2d5d7e40/attachment-0001.html>
More information about the debian-med-commit
mailing list