[med-svn] [Git][med-team/snakemake][master] 4 commits: New upstream version 5.23.0

Rebecca N. Palmer gitlab at salsa.debian.org
Sat Aug 29 11:48:56 BST 2020



Rebecca N. Palmer pushed to branch master at Debian Med / snakemake


Commits:
c8348291 by Rebecca N. Palmer at 2020-08-28T07:55:08+01:00
New upstream version 5.23.0
- - - - -
0ce8679f by Rebecca N. Palmer at 2020-08-28T07:55:30+01:00
Update upstream source from tag 'upstream/5.23.0'

Update to upstream version '5.23.0'
with Debian dir c2da5c7512d9b4bc66f9609b3bab3c168d6038c4
- - - - -
19b32418 by Rebecca N. Palmer at 2020-08-28T08:40:34+01:00
Add new dependency python3-pulp.

- - - - -
a056d475 by Rebecca N. Palmer at 2020-08-29T11:48:20+01:00
Remove unnecessary test skips, and document remaining ones.

- - - - -


24 changed files:

- CHANGELOG.rst
- README.md
- debian/changelog
- debian/control
- debian/rules
- debian/tests/control
- debian/tests/run-unit-test
- docs/project_info/faq.rst
- docs/snakefiles/configuration.rst
- docs/snakefiles/deployment.rst
- docs/snakefiles/rules.rst
- setup.py
- snakemake/__init__.py
- snakemake/_version.py
- snakemake/parser.py
- snakemake/remote/AzBlob.py
- snakemake/remote/GS.py
- snakemake/scheduler.py
- snakemake/workflow.py
- test-environment.yml
- + tests/test_peppy/pep/config.yaml
- + tests/test_peppy/pep/sample_table.csv
- + tests/test_peppy/workflow/Snakefile
- + tests/test_peppy/workflow/schemas/pep.yaml


Changes:

=====================================
CHANGELOG.rst
=====================================
@@ -1,3 +1,16 @@
+[5.23.0] - 2020-08-24
+=====================
+Added
+-----
+- Support for workflow configuration via portable encapsulated projects (PEPs, https://pep.databio.org).
+- A new ILP based default scheduler now ensures that temporary files are deleted as fast as possible (@FelixMoelder, @johanneskoester).
+
+Changed
+-------
+- Fixed bug in modification date comparison for files in google storage (@vsoch).
+- Various small documentation improvements (@dcroote, @erjel, @dlaehnemann, @goi42).
+
+
 [5.22.1] - 2020-08-14
 =====================
 Changed


=====================================
README.md
=====================================
@@ -6,7 +6,7 @@
 [![Stack Overflow](https://img.shields.io/badge/stack-overflow-orange.svg)](https://stackoverflow.com/questions/tagged/snakemake)
 [![Twitter](https://img.shields.io/twitter/follow/johanneskoester.svg?style=social&label=Follow)](https://twitter.com/search?l=&q=%23snakemake%20from%3Ajohanneskoester)
 [![Github stars](https://img.shields.io/github/stars/snakemake/snakemake?style=social)](https://github.com/snakemake/snakemake/stargazers)
-[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-v2.0%20adopted-ff69b4.svg)](code_of_conduct.md) 
+[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-v2.0%20adopted-ff69b4.svg)](CODE_OF_CONDUCT.md) 
 
 # Snakemake
 


=====================================
debian/changelog
=====================================
@@ -1,3 +1,11 @@
+snakemake (5.23.0-1) unstable; urgency=medium
+
+  * New upstream release.
+  * Add new dependency python3-pulp.
+  * Remove unnecessary test skips, and document remaining ones.
+
+ -- Rebecca N. Palmer <rebecca_palmer at zoho.com>  Sat, 29 Aug 2020 11:46:20 +0100
+
 snakemake (5.22.1-1) unstable; urgency=medium
 
   * New upstream release.  Refresh patches.


=====================================
debian/control
=====================================
@@ -7,6 +7,7 @@ Build-Depends: ca-certificates,
                cwltool,
                debhelper-compat (= 13),
                dh-python,
+               environment-modules,
                imagemagick,
                python3,
                python3-appdirs,
@@ -26,6 +27,7 @@ Build-Depends: ca-certificates,
                python3-pandas,
                python3-pkg-resources,
                python3-psutil,
+               python3-pulp,
                python3-pygments,
                python3-pygraphviz,
                python3-pytest,
@@ -68,6 +70,7 @@ Depends: ca-certificates,
          python3-nbformat,
          python3-pkg-resources,
          python3-psutil,
+         python3-pulp,
          python3-ratelimiter,
          python3-requests,
          python3-toposort,


=====================================
debian/rules
=====================================
@@ -6,21 +6,21 @@
 export PYBUILD_NAME=snakemake
 export PYBUILD_DESTDIR_python3=debian/snakemake
 export PYBUILD_BEFORE_TEST_python3=chmod +x {dir}/bin/snakemake; cp -r {dir}/bin {dir}/tests {build_dir}
-export PYBUILD_TEST_ARGS=python{version} -m pytest -v tests/test*.py -k 'not report and not ancient and not test_script and not default_remote and not issue635 and not convert_to_cwl and not issue1083 and not issue1092 and not issue1093 and not test_remote and not test_default_resources and not test_tibanna and not test_github_issue78 and not test_output_file_cache_remote and not test_env_modules and not test_archive and not test_container and not test_jupyter_notebook and not test_conda and not test_upstream_conda and not test_conda_custom_prefix and not test_wrapper'
-
-# test_report
-# test_ancient
-# test_script requires conda; manually disabling conda show the need for the binary 'julia'
-# test_default_remote requires network access
-# test_issue634 requires conda, but passes when conda is turned off manually
-# test_convert_to_cwl tries to build a singularity format software image from docker://quay.io/snakemake/snakemake:v5.5.4
-# test_issue1083 tries to build a singularity format software image from docker://bash
-# test_issue1093 fails due to conda usage; commenting that out and installing bwa produces a different ordering than desired
-# test_default_resources and test_remote needs moto to be packaged https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=777089
-# test_env_modules relies on "module load" which is not packaged for Debian
-# test_archive, test_jupyter_notebook, test_conda, test_upstream_conda, test_conda_custom_prefix use conda
-# test_wrapper uses conda and a downloaded file
-# test_container uses docker://bash
+export PYBUILD_TEST_ARGS=python{version} -m pytest -v tests/test*.py -k 'not test_cwl and not test_cwl_singularity and not test_url_include and not test_wrapper and not test_issue1083 and not test_github_issue78 and not test_container and not test_singularity and not test_singularity_conda and not test_convert_to_cwl and not test_report and not test_report_zip and not test_archive and not test_jupyter_notebook and not test_conda and not test_upstream_conda and not test_conda_custom_prefix and not test_script and not test_issue635 and not test_issue1093 and not test_default_resources and not test_default_remote and not test_remote and not test_output_file_cache_remote and not test_tibanna and not test_ancient'
+
+# Skipped because they download executables (which is forbidden in Debian buildd and debci)
+# Snakefile/snakemake-wrapper/CWL scripts: test_cwl, test_cwl_singularity, test_url_include, test_wrapper
+# Docker images: test_issue1083, test_github_issue78, test_container, test_singularity, test_singularity_conda, test_cwl_singularity, test_convert_to_cwl, test_report, test_report_zip
+# Conda packages: test_archive, test_jupyter_notebook, test_conda, test_upstream_conda, test_conda_custom_prefix, test_wrapper, test_script, test_issue635, test_issue1093, test_singularity_conda, test_report, test_report_zip
+# Note that even if conda were added to Debian (ITP #926416), using it normally would still be downloading executables = forbidden
+# Modification to run without conda was previously attempted on test_issue635 (succeeded) and test_issue1093 (failed as current bwa gives a different ordering), but is not in current use
+
+# Skipped due to not-in-Debian dependencies
+# python3-moto (RFP #777089): test_default_resources, test_default_remote, test_remote, test_output_file_cache_remote
+# python3-tibanna: test_tibanna
+
+# Skipped in build due to network use, but run in autopkgtest: test_ancient
+# Tests marked @connected skip themselves in this case
 
 export PYBUILD_AFTER_TEST_python3=rm -fr {build_dir}/bin {build_dir}/tests {dir}/tests/test_filegraph/.snakemake/ {dir}/tests/linting/*/.snakemake/
 
@@ -44,5 +44,6 @@ override_dh_auto_install:
 	find . -name '*.pyc' | xargs rm -Rf
 	find debian -name '.gitignore' | xargs rm -Rf
 
+# the modules.sh is a workaround for #928826; putting it in PYBUILD_BEFORE_TEST_python3 doesn't work
 override_dh_auto_test:
-	PYBUILD_SYSTEM=custom dh_auto_test
+	. /etc/profile.d/modules.sh && PYBUILD_SYSTEM=custom dh_auto_test


=====================================
debian/tests/control
=====================================
@@ -5,5 +5,10 @@ Restrictions: allow-stderr
 
 # upstream's tests
 Tests: run-unit-test
-Depends: snakemake, python3-pandas, python3-pytest, python3-pygraphviz, stress
+Depends: snakemake,
+         environment-modules,
+         python3-pygraphviz,
+         python3-pandas,
+         python3-pytest,
+         stress
 Restrictions: allow-stderr


=====================================
debian/tests/run-unit-test
=====================================
@@ -17,6 +17,9 @@ cd "${AUTOPKGTEST_TMP}"
 
 export HOME="${AUTOPKGTEST_TMP}"
 
+# workaround for #928826
+. /etc/profile.d/modules.sh
+
 #See debian/rules for why these are excluded
-python3 -m pytest -v ${ROOT}/tests/test*.py -k 'not report and not ancient and not test_script and not default_remote and not issue635 and not convert_to_cwl and not issue1083 and not issue1092 and not issue1093 and not test_remote and not test_default_resources and not test_singularity and not test_singularity_conda and not test_cwl_singularity and not test_cwl and not test_url_include and not test_tibanna and not test_github_issue78 and not test_output_file_cache_remote and not test_env_modules and not test_archive and not test_container and not test_jupyter_notebook and not test_conda and not test_upstream_conda and not test_conda_custom_prefix and not test_wrapper'
+python3 -m pytest -v ${ROOT}/tests/test*.py -k 'not test_cwl and not test_cwl_singularity and not test_url_include and not test_wrapper and not test_issue1083 and not test_github_issue78 and not test_container and not test_singularity and not test_singularity_conda and not test_convert_to_cwl and not test_report and not test_report_zip and not test_archive and not test_jupyter_notebook and not test_conda and not test_upstream_conda and not test_conda_custom_prefix and not test_script and not test_issue635 and not test_issue1093 and not test_default_resources and not test_default_remote and not test_remote and not test_output_file_cache_remote and not test_tibanna'
 


=====================================
docs/project_info/faq.rst
=====================================
@@ -515,7 +515,23 @@ If you are just interested in the final summary, you can use the ``--quiet`` fla
 Git is messing up the modification times of my input files, what can I do?
 --------------------------------------------------------------------------
 
-When you checkout a git repository, the modification times of updated files are set to the time of the checkout. If you rely on these files as input **and** output files in your workflow, this can cause trouble. For example, Snakemake could think that a certain (git-tracked) output has to be re-executed, just because its input has been checked out a bit later. In such cases, it is advisable to set the file modification dates to the last commit date after an update has been pulled. See `here <https://stackoverflow.com/questions/2458042/restore-files-modification-time-in-git/22638823#22638823>`_ for a solution to achieve this.
+When you checkout a git repository, the modification times of updated files are set to the time of the checkout. If you rely on these files as input **and** output files in your workflow, this can cause trouble. For example, Snakemake could think that a certain (git-tracked) output has to be re-executed, just because its input has been checked out a bit later. In such cases, it is advisable to set the file modification dates to the last commit date after an update has been pulled. One solution is to add the following lines to your ``.bashrc`` (or similar):
+
+.. code-block:: bash
+
+    gitmtim(){
+        local f
+        for f; do
+            touch -d @0`git log --pretty=%at -n1 -- "$f"` "$f"
+        done
+    }
+    gitmodtimes(){
+        for f in $(git ls-tree -r $(git rev-parse --abbrev-ref HEAD) --name-only); do
+            gitmtim $f
+        done
+    }
+
+(inspired by the answer `here <https://stackoverflow.com/questions/2458042/restore-files-modification-time-in-git/22638823#22638823>`_). You can then run ``gitmodtimes`` to update the modification times of all tracked files on the current branch to their last commit time in git; BE CAREFUL--this does not account for local changes that have not been commited.
 
 How do I exit a running Snakemake workflow?
 -------------------------------------------


=====================================
docs/snakefiles/configuration.rst
=====================================
@@ -158,6 +158,41 @@ the schema for validating the samples data frame looks like this:
 Here, in case the case column is missing, the validate function will
 populate it with True for all entries.
 
+.. _snakefiles-peps:
+
+-------------------------------------------
+Configuring scientific experiments via PEPs
+-------------------------------------------
+
+Often scientific experiments consist of a set of samples (with optional subsamples), for which raw data and metainformation is known.
+Instead of writing custom sample sheets as shown above, Snakemake allows to use `portable encapsulated project (PEP) <http://pep.databio.org>`_ definitions to configure a workflow.
+This is done via a special directive `pepfile`, that can optionally complemented by a schema for validation (which is recommended for production workflows):
+
+.. code-block:: python
+
+    pepfile: "pep/config.yaml"
+    pepschema: "schemas/pep.yaml"
+
+    rule all:
+        input:
+            expand("{sample}.txt", sample=pep.sample_table["sample_name"])
+
+    rule a:
+        output:
+            "{sample}.txt"
+        shell:
+            "touch {output}"
+
+Using the ``pepfile`` directive leads to parsing of the provided PEP with `peppy <http://peppy.databio.org>`_.
+The resulting project object is made globally available under the name ``pep``.
+Here, we use it to aggregate over the set of sample names that is defined in the corresponding PEP.
+
+**Importantly**, note that PEPs are meant to contain sample metadata and any global information about a project or experiment. 
+They should **not** be used to encode workflow specific configuration options.
+For those, one should always complement the pepfile with an ordinary :ref:`config file <snakefiles_standard_configuration>`.
+The rationale is that PEPs should be portable between different data analysis workflows (that could be applied to the same data) and even between workflow management systems.
+In other words, a PEP should describe everything needed about the data, while a workflow and its configuration should describe everything needed about the analysis that is applied to it.
+
 .. _snakefiles-cluster_configuration:
 
 ----------------------------------


=====================================
docs/snakefiles/deployment.rst
=====================================
@@ -112,7 +112,7 @@ The path to the environment definition is interpreted as **relative to the Snake
 Snakemake will store the environment persistently in ``.snakemake/conda/$hash`` with ``$hash`` being the MD5 hash of the environment definition file content. This way, updates to the environment definition are automatically detected.
 Note that you need to clean up environments manually for now. However, in many cases they are lightweight and consist of symlinks to your central conda installation.
 
-Conda deployment also works well for offline or air-gapped environments. Running ``snakemake -n --use-conda --create-envs-only`` will only install the required conda environments without running the full workflow. Subsequent runs with ``--use-conda`` will make use of the local environments without requiring internet access.
+Conda deployment also works well for offline or air-gapped environments. Running ``snakemake --use-conda --conda-create-envs-only`` will only install the required conda environments without running the full workflow. Subsequent runs with ``--use-conda`` will make use of the local environments without requiring internet access.
 
 .. _singularity:
 


=====================================
docs/snakefiles/rules.rst
=====================================
@@ -263,7 +263,11 @@ In particular, it should be noted that the specified threads have to be seen as
 
 Hardcoding a particular maximum number of threads like above is useful when a certain tool has a natural maximum beyond which parallelization won't help to further speed it up.
 This is often the case, and should be evaluated carefully for production workflows.
-If it is certain that no such maximum exists for a tool, one can instead define threads as a function of the number of cores given to Snakemake:
+Also, setting a ``threads:`` maximum is required to achieve parallelism in tools that (often implicitly and without the user knowing) rely on an environment variable for the maximum of cores to use.
+For example, this is the case for many linear algebra libraries and for OpenMP.
+Snakemake limits the respective environment variables to one core by default, to avoid unexpected and unlimited core-grabbing, but will override this with the ``threads:`` you specify in a rule (the parameters set to ``threads:``, or defaulting to ``1``, are: ``OMP_NUM_THREADS``, ``GOTO_NUM_THREADS``, ``OPENBLAS_NUM_THREADS``, ``MKL_NUM_THREADS``, ``VECLIB_MAXIMUM_THREADS``, ``NUMEXPR_NUM_THREADS``).
+
+If it is certain that no maximum for efficient parallelism exists for a tool, one can instead define threads as a function of the number of cores given to Snakemake:
 
 .. code-block:: python
 


=====================================
setup.py
=====================================
@@ -67,6 +67,7 @@ setup(
         "psutil",
         "nbformat",
         "toposort",
+        "pulp",
     ],
     extras_require={
         "reports": ["jinja2", "networkx", "pygments", "pygraphviz"],
@@ -77,6 +78,10 @@ setup(
             "google-api-python-client",
             "google-cloud-storage",
         ],
+        "pep": [
+            "peppy",
+            "eido",
+        ]
     },
     classifiers=[
         "Development Status :: 5 - Production/Stable",


=====================================
snakemake/__init__.py
=====================================
@@ -137,6 +137,7 @@ def snakemake(
     list_conda_envs=False,
     singularity_prefix=None,
     shadow_prefix=None,
+    scheduler=None,
     conda_create_envs_only=False,
     mode=Mode.default,
     wrapper_prefix=None,
@@ -276,6 +277,7 @@ def snakemake(
         keep_incomplete (bool):     keep incomplete output files of failed jobs
         edit_notebook (object):     "notebook.Listen" object to configuring notebook server for interactive editing of a rule notebook. If None, do not edit.
         log_handler (list):         redirect snakemake output to this list of custom log handler, each a function that takes a log message dictionary (see below) as its only argument (default []). The log message dictionary for the log handler has to following entries:
+        scheduler (str):            Select scheduling algorithm (default ilp)
 
             :level:
                 the log level ("info", "error", "debug", "progress", "job_info")
@@ -517,6 +519,7 @@ def snakemake(
             singularity_prefix=singularity_prefix,
             shadow_prefix=shadow_prefix,
             singularity_args=singularity_args,
+            scheduler_type=scheduler,
             mode=mode,
             wrapper_prefix=wrapper_prefix,
             printshellcmds=printshellcmds,
@@ -605,6 +608,7 @@ def snakemake(
                     singularity_prefix=singularity_prefix,
                     shadow_prefix=shadow_prefix,
                     singularity_args=singularity_args,
+                    scheduler=scheduler,
                     list_conda_envs=list_conda_envs,
                     kubernetes=kubernetes,
                     container_image=container_image,
@@ -628,6 +632,7 @@ def snakemake(
                     targets=targets,
                     dryrun=dryrun,
                     touch=touch,
+                    scheduler_type=scheduler,
                     local_cores=local_cores,
                     forcetargets=forcetargets,
                     forceall=forceall,
@@ -1147,6 +1152,16 @@ def get_argument_parser(profile=None):
             "If not supplied, the value is set to the '.snakemake' directory relative "
             "to the working directory."
         ),
+    ),
+    group_exec.add_argument(
+        "--scheduler",
+        default="ilp",
+        nargs="?",
+        choices=["ilp", "greedy"],
+        help=(
+            "Specifies if jobs are selected by a greedy algorithm or by solving an ilp. "
+            "The ilp scheduler aims to reduce runtime and hdd usage by best possible use of resources."
+        ),
     )
 
     group_report = parser.add_argument_group("REPORTS")
@@ -2371,6 +2386,7 @@ def main(argv=None):
             singularity_prefix=args.singularity_prefix,
             shadow_prefix=args.shadow_prefix,
             singularity_args=args.singularity_args,
+            scheduler=args.scheduler,
             conda_create_envs_only=args.conda_create_envs_only,
             mode=args.mode,
             wrapper_prefix=args.wrapper_prefix,


=====================================
snakemake/_version.py
=====================================
@@ -22,9 +22,9 @@ def get_keywords():
     # setup.py/versioneer.py will grep for the variable names, so they must
     # each be defined on a line of their own. _version.py will just call
     # get_keywords().
-    git_refnames = " (HEAD -> master, tag: v5.22.1)"
-    git_full = "982f1d1f5bb55eda1caa8f15853041922ef4bea0"
-    git_date = "2020-08-14 08:41:41 +0200"
+    git_refnames = " (HEAD -> master, tag: v5.23.0)"
+    git_full = "5c2f2b133fff051a8386625ead1c4a4bf04dbae0"
+    git_date = "2020-08-24 15:07:04 +0200"
     keywords = {"refnames": git_refnames, "full": git_full, "date": git_date}
     return keywords
 


=====================================
snakemake/parser.py
=====================================
@@ -212,9 +212,7 @@ class RuleKeywordState(KeywordState):
         yield "@workflow.{keyword}(".format(keyword=self.keyword)
 
 
-class SubworkflowKeywordState(KeywordState):
-    prefix = "Subworkflow"
-
+class SectionKeywordState(KeywordState):
     def start(self):
         yield ", {keyword}=".format(keyword=self.keyword)
 
@@ -244,6 +242,17 @@ class Configfile(GlobalKeywordState):
     pass
 
 
+# PEPs
+
+
+class Pepfile(GlobalKeywordState):
+    pass
+
+
+class Pepschema(GlobalKeywordState):
+    pass
+
+
 class Report(GlobalKeywordState):
     pass
 
@@ -283,6 +292,10 @@ class GlobalContainer(GlobalKeywordState):
 # subworkflows
 
 
+class SubworkflowKeywordState(SectionKeywordState):
+    prefix = "Subworkflow"
+
+
 class SubworkflowSnakefile(SubworkflowKeywordState):
     pass
 
@@ -777,6 +790,8 @@ class Python(TokenAutomaton):
         include=Include,
         workdir=Workdir,
         configfile=Configfile,
+        pepfile=Pepfile,
+        pepschema=Pepschema,
         report=Report,
         ruleorder=Ruleorder,
         rule=Rule,


=====================================
snakemake/remote/AzBlob.py
=====================================
@@ -292,9 +292,7 @@ class AzureStorageHelper(object):
             # just create an empty file with the right timestamps
             ts = b.get_blob_properties().last_modified.timestamp()
             with open(destination_path, "wb") as fp:
-                os.utime(
-                    fp.name, (ts, ts),
-                )
+                os.utime(fp.name, (ts, ts))
         return destination_path
 
     def delete_from_container(self, container_name, blob_name):


=====================================
snakemake/remote/GS.py
=====================================
@@ -143,7 +143,7 @@ class RemoteObject(AbstractRemoteObject):
             # By way of being listed, it exists. mtime is a datetime object
             name = "{}/{}".format(blob.bucket.name, blob.name)
             cache.exists_remote[name] = True
-            cache.mtime[name] = blob.updated
+            cache.mtime[name] = blob.updated.timestamp()
             cache.size[name] = blob.size
 
         # Mark bucket and prefix as having an inventory, such that this method is


=====================================
snakemake/scheduler.py
=====================================
@@ -6,11 +6,14 @@ __license__ = "MIT"
 import os, signal, sys
 import threading
 import operator
+import time
+import math
+
 from functools import partial
 from collections import defaultdict
-from itertools import chain, accumulate
+from itertools import chain, accumulate, product
 from contextlib import ContextDecorator
-import time
+
 
 from snakemake.executors import DryrunExecutor, TouchExecutor, CPUExecutor
 from snakemake.executors import (
@@ -83,6 +86,7 @@ class JobScheduler:
         force_use_threads=False,
         assume_shared_fs=True,
         keepincomplete=False,
+        scheduler_type=None,
     ):
         """ Create a new instance of KnapsackJobScheduler. """
         from ratelimiter import RateLimiter
@@ -102,6 +106,7 @@ class JobScheduler:
         self.greediness = 1
         self.max_jobs_per_second = max_jobs_per_second
         self.keepincomplete = keepincomplete
+        self.scheduler_type = scheduler_type
 
         self.global_resources = {
             name: (sys.maxsize if res is None else res)
@@ -391,7 +396,11 @@ class JobScheduler:
                         "Ready jobs ({}):\n\t".format(len(needrun))
                         + "\n\t".join(map(str, needrun))
                     )
-                    run = self.job_selector(needrun)
+                    run = (
+                        self.job_selector_greedy(needrun)
+                        if self.scheduler_type == "greedy"
+                        else self.job_selector_ilp(needrun)
+                    )
                     logger.debug(
                         "Selected jobs ({}):\n\t".format(len(run))
                         + "\n\t".join(map(str, run))
@@ -522,11 +531,98 @@ class JobScheduler:
             self._user_kill = "graceful"
         self._open_jobs.release()
 
-    def job_selector(self, jobs):
+    def job_selector_ilp(self, jobs):
+        """
+        Job scheduling by optimization of resource usage by solving ILP using pulp 
+        """
+        import pulp
+        from pulp import lpSum
+
+        # assert self.resources["_cores"] > 0
+        scheduled_jobs = {
+            job: pulp.LpVariable(
+                f"job_{job}_{idx}", lowBound=0, upBound=1, cat=pulp.LpInteger
+            )
+            for idx, job in enumerate(jobs)
+        }
+
+        temp_files = {
+            temp_file for job in jobs for temp_file in self.dag.temp_input(job)
+        }
+
+        temp_job_improvement = {
+            temp_file: pulp.LpVariable(
+                temp_file, lowBound=0, upBound=1, cat="Continuous"
+            )
+            for temp_file in temp_files
+        }
+        prob = pulp.LpProblem("Job scheduler", pulp.LpMaximize)
+
+        total_temp_size = max(sum([temp_file.size for temp_file in temp_files]), 1)
+        total_core_requirement = sum(
+            [job.resources.get("_cores", 1) + 1 for job in jobs]
+        )
+        # Objective function
+        # Job priority > Core load
+        # Core load > temp file removal
+        # Instant removal > temp size
+        # temp file size > fast removal?!
+        prob += (
+            total_core_requirement
+            * total_temp_size
+            * lpSum([job.priority * scheduled_jobs[job] for job in jobs])
+            + total_temp_size
+            * lpSum(
+                [
+                    (job.resources.get("_cores", 1) + 1) * scheduled_jobs[job]
+                    for job in jobs
+                ]
+            )
+            + lpSum(
+                [
+                    temp_job_improvement[temp_file] * temp_file.size
+                    for temp_file in temp_files
+                ]
+            )
+        )
+
+        # Constraints:
+        for name in self.workflow.global_resources:
+            prob += (
+                lpSum(
+                    [scheduled_jobs[job] * job.resources.get(name, 0) for job in jobs]
+                )
+                <= self.resources[name],
+                f"Limitation of resource: {name}",
+            )
+
+        # Choose jobs that lead to "fastest" (minimum steps) removal of existing temp file
+        for temp_file in temp_files:
+            prob += temp_job_improvement[temp_file] <= lpSum(
+                [
+                    scheduled_jobs[job] * self.required_by_job(temp_file, job)
+                    for job in jobs
+                ]
+            ) / lpSum([self.required_by_job(temp_file, job) for job in jobs])
+
+        prob.solve()
+        selected_jobs = [
+            job for job, variable in scheduled_jobs.items() if variable.value() == 1.0
+        ]
+        for name in self.workflow.global_resources:
+            self.resources[name] -= sum(
+                [job.resources.get(name, 0) for job in selected_jobs]
+            )
+        return selected_jobs
+
+    def required_by_job(self, temp_file, job):
+        return 1 if temp_file in self.dag.temp_input(job) else 0
+
+    def job_selector_greedy(self, jobs):
         """
         Using the greedy heuristic from
         "A Greedy Algorithm for the General Multidimensional Knapsack
-Problem", Akcay, Li, Xu, Annals of Operations Research, 2012
+        Problem", Akcay, Li, Xu, Annals of Operations Research, 2012
 
         Args:
             jobs (list):    list of jobs


=====================================
snakemake/workflow.py
=====================================
@@ -91,6 +91,7 @@ class Workflow:
         singularity_prefix=None,
         singularity_args="",
         shadow_prefix=None,
+        scheduler_type=None,
         mode=Mode.default,
         wrapper_prefix=None,
         printshellcmds=False,
@@ -155,6 +156,7 @@ class Workflow:
         self.singularity_prefix = singularity_prefix
         self.singularity_args = singularity_args
         self.shadow_prefix = shadow_prefix
+        self.scheduler_type = scheduler_type
         self.global_container_img = None
         self.mode = mode
         self.wrapper_prefix = wrapper_prefix
@@ -454,6 +456,7 @@ class Workflow:
         targets=None,
         dryrun=False,
         touch=False,
+        scheduler_type=None,
         local_cores=1,
         forcetargets=False,
         forceall=False,
@@ -891,6 +894,7 @@ class Workflow:
             force_use_threads=force_use_threads,
             assume_shared_fs=assume_shared_fs,
             keepincomplete=keepincomplete,
+            scheduler_type=scheduler_type,
         )
 
         if not dryrun:
@@ -1085,6 +1089,32 @@ class Workflow:
         update_config(config, c)
         update_config(config, self.overwrite_config)
 
+    def pepfile(self, path):
+        global pep
+
+        try:
+            import peppy
+        except ImportError:
+            raise WorkflowError("For PEP support, please install peppy.")
+
+        self.pepfile = path
+        pep = peppy.Project(self.pepfile)
+
+    def pepschema(self, schema):
+        global pep
+
+        try:
+            import eido
+        except ImportError:
+            raise WorkflowError("For PEP schema support, please install eido.")
+
+        if urlparse(schema).scheme == "" and not os.path.isabs(schema):
+            # schema is relative to current Snakefile
+            schema = os.path.join(self.current_basedir, schema)
+        if self.pepfile is None:
+            raise WorkflowError("Please specify a PEP with the pepfile directive.")
+        eido.validate_project(project=pep, schema=schema, exclude_case=True)
+
     def report(self, path):
         """ Define a global report description in .rst format."""
         self.report_text = os.path.join(self.current_basedir, path)


=====================================
test-environment.yml
=====================================
@@ -42,9 +42,12 @@ dependencies:
   - pygraphviz
   - python-kubernetes
   - gitpython
+  - pulp
   - tibanna
   - environment-modules
   - nbformat
   - toposort
   - mamba
   - crc32c
+  - peppy
+  - eido


=====================================
tests/test_peppy/pep/config.yaml
=====================================
@@ -0,0 +1,2 @@
+pep_version: "2.0.0"
+sample_table: sample_table.csv


=====================================
tests/test_peppy/pep/sample_table.csv
=====================================
@@ -0,0 +1,3 @@
+sample_name,protocol
+a,test
+b,test2


=====================================
tests/test_peppy/workflow/Snakefile
=====================================
@@ -0,0 +1,12 @@
+pepfile: "pep/config.yaml"
+pepschema: "schemas/pep.yaml"
+
+rule all:
+    input:
+        expand("{sample}.txt", sample=pep.sample_table["sample_name"])
+
+rule a:
+    output:
+        "{sample}.txt"
+    shell:
+        "touch {output}"


=====================================
tests/test_peppy/workflow/schemas/pep.yaml
=====================================
@@ -0,0 +1,68 @@
+description: "Schema for a minimal PEP"
+version: "2.0.0"
+properties:
+  name: 
+    type: string
+    pattern: "^\\S*$"
+    description: "Project name with no whitespace"
+  config:
+    pep_version:
+      description: "Version of the PEP Schema this PEP follows"
+      type: string
+    sample_table:
+      type: string
+      description: "Path to the sample annotation table with one row per sample"
+    subsample_table:
+      type: string
+      description: "Path to the subsample annotation table with one row per subsample and sample_name attribute matching an entry in the sample table"
+    sample_modifiers:
+      type: object
+      properties:
+        append:
+          type: object
+        duplicate:
+          type: object
+        imply:
+          type: array
+          items:
+            type: object
+            properties:
+              if:
+                type: object
+              then:
+                type: object
+        derive:
+          type: object
+          properties:
+            attributes:
+              type: array
+              items:
+                type: string
+            sources:
+              type: object
+    project_modifiers:
+      type: object
+      properties:
+        amend:
+          description: "Object overwriting original project attributes"
+          type: object
+        import:
+          description: "List of external PEP project config files to import"
+          type: array
+          items:
+            type: string
+    required:
+      - pep_version
+  samples:
+    type: array
+    items:
+      type: object
+      properties:
+        sample_name: 
+          type: string
+          pattern: "^\\S*$"
+          description: "Unique name of the sample with no whitespace"
+      required:
+        - sample_name
+required:
+  - samples



View it on GitLab: https://salsa.debian.org/med-team/snakemake/-/compare/4b95c63220e2d1360913804c11116ddd7a1a7fa6...a056d4757cee7ceeb9fa894d04268a56381678a2

-- 
View it on GitLab: https://salsa.debian.org/med-team/snakemake/-/compare/4b95c63220e2d1360913804c11116ddd7a1a7fa6...a056d4757cee7ceeb9fa894d04268a56381678a2
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20200829/3ff181a9/attachment-0001.html>


More information about the debian-med-commit mailing list