[med-svn] [Git][med-team/augur][master] 4 commits: routine-update: New upstream version

Sun Jan 16 07:24:27 GMT 2022


Andreas Tille pushed to branch master at Debian Med / augur


Commits:
0a814899 by Andreas Tille at 2022-01-16T08:12:38+01:00
routine-update: New upstream version

- - - - -
f4634e10 by Andreas Tille at 2022-01-16T08:12:39+01:00
New upstream version 13.1.0
- - - - -
6ad295d8 by Andreas Tille at 2022-01-16T08:15:08+01:00
Update upstream source from tag 'upstream/13.1.0'

Update to upstream version '13.1.0'
with Debian dir 7caae5506f58643b1b515862f46eb1912536d77d
- - - - -
f8d595fa by Andreas Tille at 2022-01-16T08:21:50+01:00
routine-update: Ready to upload to unstable

- - - - -


22 changed files:

- + .travis.yml
- CHANGES.md
- README.md
- augur/__version__.py
- augur/ancestral.py
- augur/data/lat_longs.tsv
- augur/data/schema-auspice-config-v2.json
- augur/data/schema-export-v1-meta.json
- augur/data/schema-export-v1-tree.json
- augur/data/schema-export-v2.json
- augur/filter.py
- augur/tree.py
- debian/changelog
- debian/control
- docs/contribute/DEV_DOCS.md
- tests/builds/zika.t
- tests/functional/filter.t
- tests/functional/tree.t
- + tests/functional/tree/aligned.fasta.xz
- + tests/functional/tree/excluded_sites.txt
- tests/test_align.py
- + tests/test_filter_groupby.py


Changes:

=====================================
.travis.yml
=====================================
@@ -0,0 +1,46 @@
+version: ~> 1.0
+language: generic
+
+# See <https://docs.travis-ci.com/user/build-stages/> for more information on
+# how build stages work.
+stages:
+  - test
+
+  # See <https://docs.travis-ci.com/user/conditions-v1> for more on the "if" syntax.
+  - name: deploy
+    if: branch = release and type != pull_request
+
+jobs:
+  include:
+    - &test
+      stage: test
+      language: python
+      python: 3.6
+      before_install:
+        - wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh;
+        - bash miniconda.sh -b -p $HOME/miniconda
+        - export PATH="$HOME/miniconda/bin:$PATH"
+        - hash -r
+        - conda config --set always_yes yes --set changeps1 no
+        - conda update -q conda
+        - conda info -a
+        - conda create -n augur -c bioconda python=$TRAVIS_PYTHON_VERSION mafft raxml fasttree iqtree vcftools pip numpy
+        - source activate augur
+        - pip install biopython==1.67
+      install:
+        - pip install -e .[dev]
+      script:
+        - (pytest -c pytest.python3.ini  --cov-report= --cov=augur)
+        - (cram --shell=/bin/bash tests/functional/*.t tests/builds/*.t)
+        - (bash tests/builds/runner.sh)
+      after_success:
+        # upload to codecov
+        - bash <(curl -s https://codecov.io/bash) -f "!*.gcov" -X gcov -e TRAVIS_PYTHON_VERSION -y ci/codecov.yml|| echo "Codecov did not collect coverage reports"
+
+    - <<: *test
+      python: 3.7
+    - <<: *test
+      python: 3.8
+
+    - stage: deploy
+      script: ./devel/travis-rebuild-docker-image


=====================================
CHANGES.md
=====================================
@@ -3,6 +3,47 @@
 ## __NEXT__
 
 
+## 13.1.0 (10 December 2021)
+
+### Features
+
+* schemas: Add "$id" key to Auspice config schemas so we have a way of referring to these. [#806][] (@tsibley)
+
+### Bug Fixes
+
+* filter: Fix groupby with incomplete dates. [#808][] (@victorlin)
+
+[#806]: https://github.com/nextstrain/augur/pull/806
+[#808]: https://github.com/nextstrain/augur/pull/808
+
+## 13.0.4 (8 December 2021)
+
+### Bug Fixes
+
+* dependencies: Replace deprecated mutable sequence interface for BioPython. [#788][] (@Carlosbogo)
+* dependencies: Fix backward compatibility with BioPython. [#801][] (@huddlej)
+* data: Add latitude and longitude details for "Reunion". [#791][] (@corneliusroemer)
+* filter: Use pandas functions to determine subsample groups. [#794][] and [#797][] (@victorlin)
+* filter: Add clarity to help message and output of probabilistic sampling. [#792][] (@victorlin)
+
+[#788]: https://github.com/nextstrain/augur/pull/788
+[#791]: https://github.com/nextstrain/augur/pull/791
+[#792]: https://github.com/nextstrain/augur/pull/792
+[#794]: https://github.com/nextstrain/augur/pull/794
+[#797]: https://github.com/nextstrain/augur/pull/797
+[#801]: https://github.com/nextstrain/augur/pull/801
+
+## 13.0.3 (19 November 2021)
+
+### Bug Fixes
+
+* tree: Handle compressed alignment when excluding sites. [#786][] (@huddlej)
+* docs: Fix typos ([ce0834c][]) and clarify exclude sites inputs ([5ad1574][]). (@corneliusroemer)
+
+[#786]: https://github.com/nextstrain/augur/pull/786
+[ce0834c]: https://github.com/nextstrain/augur/commit/ce0834c476abc9ee99785fa930608218b7d78990
+[5ad1574]: https://github.com/nextstrain/augur/commit/5ad157485015623883c6b637d247459f906b63cb
+
 ## 13.0.2 (12 October 2021)
 
 ### Bug Fixes


=====================================
README.md
=====================================
@@ -1,7 +1,7 @@
 [![Build Status](https://travis-ci.com/nextstrain/augur.svg?branch=master)](https://travis-ci.com/nextstrain/augur)
 [![PyPI version](https://badge.fury.io/py/nextstrain-augur.svg)](https://pypi.org/project/nextstrain-augur/)
 [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/augur/README.html)
-[![Documentation Status](https://readthedocs.org/projects/nextstrain-augur/badge/?version=latest)](https://nextstrain-augur.readthedocs.io/en/stable/?badge=latest)
+[![Documentation Status](https://readthedocs.org/projects/nextstrain-augur/badge/?version=latest)](https://docs.nextstrain.org/projects/augur/en/stable/)
 [![License: AGPL v3](https://img.shields.io/badge/License-AGPL%20v3-blue.svg)](https://www.gnu.org/licenses/agpl-3.0)
 [![DOI](https://joss.theoj.org/papers/10.21105/joss.02906/status.svg)](https://doi.org/10.21105/joss.02906)
 
@@ -24,14 +24,14 @@ The output of augur is a series of JSONs that can be used to visualize your resu
 
 ## Quickstart
 
-[Follow instructions to install Augur](https://nextstrain-augur.readthedocs.io/en/stable/installation/installation.html).
+[Follow instructions to install Augur](https://docs.nextstrain.org/projects/augur/en/stable/installation/installation.html).
 Try out an analysis of real virus data by [completing the Zika tutorial](https://nextstrain.org/docs/tutorials/zika).
 
 ## Documentation
 
 * [Overview of how Augur fits together with other Nextstrain tools](https://nextstrain.org/docs/getting-started/introduction#open-source-tools-for-the-community)
 * [Overview of Augur usage](https://nextstrain.org/docs/bioinformatics/introduction-to-augur)
-* [Technical documentation for Augur](https://nextstrain-augur.readthedocs.io/en/stable/installation/installation.html)
+* [Technical documentation for Augur](https://docs.nextstrain.org/projects/augur/en/stable/installation/installation.html)
 * [Contributor guide](https://github.com/nextstrain/.github/blob/master/CONTRIBUTING.md)
 * [Project board with available issues](https://github.com/orgs/nextstrain/projects/6)
 * [Developer docs for Augur](./docs/contribute/DEV_DOCS.md)


=====================================
augur/__version__.py
=====================================
@@ -1,4 +1,4 @@
-__version__ = '13.0.2'
+__version__ = '13.1.0'
 
 
 def is_augur_version_compatible(version):


=====================================
augur/ancestral.py
=====================================
@@ -168,7 +168,7 @@ def run(args):
     else:
         aln = args.alignment
 
-    # Enfore treetime 0.7 or later
+    # Enforce treetime 0.7 or later
     from distutils.version import StrictVersion
     import treetime
     if StrictVersion(treetime.version) < StrictVersion('0.7.0'):


=====================================
augur/data/lat_longs.tsv
=====================================
@@ -187,6 +187,7 @@ country	portugal	39.6945	-8.13057
 country	puerto_rico	18.2459121	-66.4164147
 country	puerto rico	18.2459121	-66.4164147
 country	qatar	25.27932	51.52245
+country	reunion	-21.1151	55.5364
 country	romania	46.0	25.0
 country	russia	64.6863136	97.7453061
 country	rwanda	-2.0	30.0


=====================================
augur/data/schema-auspice-config-v2.json
=====================================
@@ -1,7 +1,7 @@
 {
-    "type" : "object",
-    "version": "v2",
     "$schema": "http://json-schema.org/draft-06/schema#",
+    "$id": "https://nextstrain.org/schemas/auspice/config/v2",
+    "type": "object",
     "title": "Auspice config file to be supplied to `augur export v2`",
     "$comment": "This schema includes deprecated-but-handled-by-augur-export-v1 properties, but their schema definitions are somewhat incomplete",
     "additionalProperties": false,


=====================================
augur/data/schema-export-v1-meta.json
=====================================
@@ -1,7 +1,7 @@
 {
-    "type" : "object",
     "$schema": "http://json-schema.org/draft-06/schema#",
-    "version": "0.1",
+    "$id": "https://nextstrain.org/schemas/dataset/v1/meta",
+    "type": "object",
     "title": "Nextstrain minimal metadata JSON schema",
     "description": "This is the validation schema for the augur produced metadata JSON, for consumption in Auspice. Note that every field is optional, but excluding fields may disable certain features in Auspice.",
     "additionalProperties": true,


=====================================
augur/data/schema-export-v1-tree.json
=====================================
@@ -1,6 +1,7 @@
 {
-    "type" : "object",
     "$schema": "http://json-schema.org/draft-06/schema#",
+    "$id": "https://nextstrain.org/schemas/dataset/v1/tree",
+    "type": "object",
     "title": "Nextstrain tree JSON schema",
     "additionalProperties": false,
     "required": ["attr", "strain"],


=====================================
augur/data/schema-export-v2.json
=====================================
@@ -1,7 +1,7 @@
 {
-    "type" : "object",
     "$schema": "http://json-schema.org/draft-06/schema#",
-    "version": "2.0",
+    "$id": "https://nextstrain.org/schemas/dataset/v2",
+    "type": "object",
     "title": "Nextstrain metadata JSON schema proposal (meta + tree together)",
     "additionalProperties": false,
     "required": ["version", "meta", "tree"],


=====================================
augur/filter.py
=====================================
@@ -884,91 +884,76 @@ def get_groups_for_subsampling(strains, metadata, group_by=None):
     [{'strain': 'strain1', 'filter': 'skip_group_by_with_ambiguous_month', 'kwargs': ''}]
 
     """
-    if group_by:
-        groups = group_by
-    else:
-        groups = ("_dummy",)
-
+    metadata = metadata.loc[strains]
     group_by_strain = {}
     skipped_strains = []
-    for strain in strains:
-        skip_strain = False
-        group = []
-        m = metadata.loc[strain].to_dict()
-        # collect group specifiers
-        for c in groups:
-            if c == "_dummy":
-                group.append(c)
-            elif c in m:
-                group.append(m[c])
-            elif c in ['month', 'year'] and 'date' in m:
-                try:
-                    year = int(m["date"].split('-')[0])
-                except:
+
+    if metadata.empty:
+        return group_by_strain, skipped_strains
+
+    if not group_by or group_by == ('_dummy',):
+        group_by_strain = {strain: ('_dummy',) for strain in strains}
+        return group_by_strain, skipped_strains
+
+    group_by_set = set(group_by)
+
+    # If we could not find any requested categories, we cannot complete subsampling.
+    if 'date' not in metadata and group_by_set <= {'year', 'month'}:
+        raise FilterException(f"The specified group-by categories ({group_by}) were not found. No sequences-per-group sampling will be done. Note that using 'year' or 'year month' requires a column called 'date'.")
+    if not group_by_set & (set(metadata.columns) | {'year', 'month'}):
+        raise FilterException(f"The specified group-by categories ({group_by}) were not found. No sequences-per-group sampling will be done.")
+
+    # date requested
+    if 'year' in group_by_set or 'month' in group_by_set:
+        if 'date' not in metadata:
+            # set year/month/day = unknown
+            print(f"WARNING: A 'date' column could not be found to group-by year or month.", file=sys.stderr)
+            print(f"Filtering by group may behave differently than expected!", file=sys.stderr)
+            df_dates = pd.DataFrame({'year': 'unknown', 'month': 'unknown'}, index=metadata.index)
+            metadata = pd.concat([metadata, df_dates], axis=1)
+        else:
+            # replace date with year/month/day as nullable ints
+            date_cols = ['year', 'month', 'day']
+            df_dates = metadata['date'].str.split('-', n=2, expand=True)
+            df_dates = df_dates.set_axis(date_cols[:len(df_dates.columns)], axis=1)
+            missing_date_cols = set(date_cols) - set(df_dates.columns)
+            for col in missing_date_cols:
+                df_dates[col] = pd.NA
+            for col in date_cols:
+                df_dates[col] = pd.to_numeric(df_dates[col], errors='coerce').astype(pd.Int64Dtype())
+            metadata = pd.concat([metadata.drop('date', axis=1), df_dates], axis=1)
+            if 'year' in group_by_set:
+                # skip ambiguous years
+                df_skip = metadata[metadata['year'].isnull()]
+                metadata.dropna(subset=['year'], inplace=True)
+                for strain in df_skip.index:
                     skipped_strains.append({
                         "strain": strain,
                         "filter": "skip_group_by_with_ambiguous_year",
                         "kwargs": "",
                     })
-                    skip_strain = True
-                    break
-                if c=='month':
-                    try:
-                        month = int(m["date"].split('-')[1])
-                    except:
-                        skipped_strains.append({
-                            "strain": strain,
-                            "filter": "skip_group_by_with_ambiguous_month",
-                            "kwargs": "",
-                        })
-                        skip_strain = True
-                        break
-
-                    group.append((year, month))
-                else:
-                    group.append(year)
-            else:
-                group.append('unknown')
-
-        if not skip_strain:
-            group_by_strain[strain] = tuple(group)
-
-    # If we could not find any requested categories, we cannot complete subsampling.
-    distinct_groups = set(group_by_strain.values())
-    if len(distinct_groups) == 1 and ('unknown' in distinct_groups or ('unknown',) in distinct_groups):
-        error_message = f"The specified group-by categories ({groups}) were not found. No sequences-per-group sampling will be done."
-
-        if any(x in groups for x in ('year', 'month')):
-            error_message += " Note that using 'year' or 'year month' requires a column called 'date'."
-
-        # Raise an exception, since we cannot find the requested groups.
-        raise FilterException(error_message)
-
-    # Check to see if some categories are missing to warn the user
-    group_by = {
-        'date' if cat in ('year', 'month') else cat
-        for cat in groups
-    }
-    missing_cats = [cat for cat in group_by if cat not in metadata.columns.values and cat != "_dummy"]
-    if missing_cats:
-        error_message = []
-
-        if any(cat != 'date' for cat in missing_cats):
-            error_message.append(
-                "Some of the specified group-by categories couldn't be found: %s" % ", ".join([str(cat) for cat in missing_cats if cat != 'date'])
-            )
-
-        if any(cat == 'date' for cat in missing_cats):
-            error_message.append("A 'date' column could not be found to group-by year or month.")
-
-        error_message.append("Filtering by group may behave differently than expected!")
-
-        # Print a warning message, but allow grouping to continue.
-        print(
-            "WARNING: %s" % "\n".join(error_message),
-            file=sys.stderr,
-        )
-
+            if 'month' in group_by_set:
+                # skip ambiguous months
+                df_skip = metadata[metadata['month'].isnull()]
+                metadata.dropna(subset=['month'], inplace=True)
+                for strain in df_skip.index:
+                    skipped_strains.append({
+                        "strain": strain,
+                        "filter": "skip_group_by_with_ambiguous_month",
+                        "kwargs": "",
+                    })
+                # month = (year, month)
+                metadata['month'] = list(zip(metadata['year'], metadata['month']))
+            # TODO: support group by day
+
+    unknown_groups = group_by_set - set(metadata.columns)
+    if unknown_groups:
+        print(f"WARNING: Some of the specified group-by categories couldn't be found: {', '.join(unknown_groups)}", file=sys.stderr)
+        print("Filtering by group may behave differently than expected!", file=sys.stderr)
+        for group in unknown_groups:
+            metadata[group] = 'unknown'
+
+    group_by_strain = dict(zip(metadata.index, metadata[group_by].apply(tuple, axis=1)))
     return group_by_strain, skipped_strains
 
 
@@ -1143,7 +1128,7 @@ def register_arguments(parser):
     subsample_limits_group.add_argument('--sequences-per-group', type=int, help="subsample to no more than this number of sequences per category")
     subsample_limits_group.add_argument('--subsample-max-sequences', type=int, help="subsample to no more than this number of sequences; can be used without the group_by argument")
     probabilistic_sampling_group = subsample_group.add_mutually_exclusive_group()
-    probabilistic_sampling_group.add_argument('--probabilistic-sampling', action='store_true', help="Enable probabilistic sampling during subsampling. This is useful when there are more groups than requested sequences. This option only applies when `--subsample-max-sequences` is provided.")
+    probabilistic_sampling_group.add_argument('--probabilistic-sampling', action='store_true', help="Allow probabilistic sampling during subsampling. This is useful when there are more groups than requested sequences. This option only applies when `--subsample-max-sequences` is provided.")
     probabilistic_sampling_group.add_argument('--no-probabilistic-sampling', action='store_false', dest='probabilistic_sampling')
     subsample_group.add_argument('--priority', type=str, help="""tab-delimited file with list of priority scores for strains (e.g., "<strain>\\t<priority>") and no header.
     When scores are provided, Augur converts scores to floating point values, sorts strains within each subsampling group from highest to lowest priority, and selects the top N strains per group where N is the calculated or requested number of strains per group.
@@ -1494,16 +1479,20 @@ def run(args):
         # sequences requested, sequences per group will be a floating point
         # value and subsampling will be probabilistic.
         try:
-            sequences_per_group = calculate_sequences_per_group(
+            sequences_per_group, probabilistic_used = calculate_sequences_per_group(
                 args.subsample_max_sequences,
                 records_per_group.values(),
                 args.probabilistic_sampling,
             )
-            print(f"Sampling at {sequences_per_group} per group.")
         except TooManyGroupsError as error:
             print(f"ERROR: {error}", file=sys.stderr)
             sys.exit(1)
 
+        if (probabilistic_used):
+            print(f"Sampling probabilistically at {sequences_per_group:0.4f} sequences per group, meaning it is possible to have more than the requested maximum of {args.subsample_max_sequences} sequences after filtering.")
+        else:
+            print(f"Sampling at {sequences_per_group} per group.")
+
         if queues_by_group is None:
             # We know all of the possible groups now from the first pass through
             # the metadata, so we can create queues for all groups at once.
@@ -1711,7 +1700,7 @@ def numeric_date(date):
         return treetime.utils.numeric_date(datetime.date(*map(int, date.split("-", 2))))
 
 
-def calculate_sequences_per_group(target_max_value, counts_per_group, probabilistic=True):
+def calculate_sequences_per_group(target_max_value, counts_per_group, allow_probabilistic=True):
     """Calculate the number of sequences per group for a given maximum number of
     sequences to be returned and the number of sequences in each requested
     group. Optionally, allow the result to be probabilistic such that the mean
@@ -1725,7 +1714,7 @@ def calculate_sequences_per_group(target_max_value, counts_per_group, probabilis
         number of sequences per group for the given counts per group.
     counts_per_group : list[int]
         A list with the number of sequences in each requested group.
-    probabilistic : bool
+    allow_probabilistic : bool
         Whether to allow probabilistic subsampling when the number of groups
         exceeds the requested maximum.
 
@@ -1733,29 +1722,35 @@ def calculate_sequences_per_group(target_max_value, counts_per_group, probabilis
     ------
     TooManyGroupsError :
         When there are more groups than sequences per group and probabilistic
-        subsampling is not enabled.
+        subsampling is not allowed.
 
     Returns
     -------
     int or float :
         Number of sequences per group.
+    bool :
+        Whether probabilistic subsampling was used.
 
     """
+    probabilistic_used = False
+
     try:
         sequences_per_group = _calculate_sequences_per_group(
             target_max_value,
             counts_per_group,
         )
     except TooManyGroupsError as error:
-        if probabilistic:
+        if allow_probabilistic:
+            print(f"WARNING: {error}")
             sequences_per_group = _calculate_fractional_sequences_per_group(
                 target_max_value,
                 counts_per_group,
             )
+            probabilistic_used = True
         else:
             raise error
 
-    return sequences_per_group
+    return sequences_per_group, probabilistic_used
 
 
 class TooManyGroupsError(ValueError):


=====================================
augur/tree.py
=====================================
@@ -8,11 +8,13 @@ import sys
 import time
 import uuid
 import Bio
+from Bio.Seq import MutableSeq
 from Bio import Phylo
 import numpy as np
 from treetime.vcf_utils import read_vcf
 from pathlib import Path
 
+from .io import read_sequences
 from .utils import run_shell_command, nthreads_value, shquote, load_mask_sites
 
 def find_executable(names, default = None):
@@ -315,7 +317,7 @@ def mask_sites_in_multiple_sequence_alignment(alignment_file, excluded_sites_fil
 
     # Load alignment as FASTA generator to prevent loading the whole alignment
     # into memory.
-    alignment = Bio.SeqIO.parse(alignment_file, "fasta")
+    alignment = read_sequences(alignment_file)
 
     # Write the masked alignment to disk one record at a time.
     alignment_file_path = Path(alignment_file)
@@ -323,7 +325,7 @@ def mask_sites_in_multiple_sequence_alignment(alignment_file, excluded_sites_fil
     with open(masked_alignment_file, "w", encoding='utf-8') as oh:
         for record in alignment:
             # Convert to a mutable sequence to enable masking with Ns.
-            sequence = record.seq.tomutable()
+            sequence = MutableSeq(str(record.seq))
 
             # Replace all excluded sites with Ns.
             for site in excluded_sites:
@@ -345,7 +347,7 @@ def register_arguments(parser):
     parser.add_argument('--nthreads', type=nthreads_value, default=1,
                                 help="number of threads to use; specifying the value 'auto' will cause the number of available CPU cores on your system, if determinable, to be used")
     parser.add_argument('--vcf-reference', type=str, help='fasta file of the sequence the VCF was mapped to')
-    parser.add_argument('--exclude-sites', type=str, help='file name of one-based sites to exclude for raw tree building (BED format in .bed files, DRM format in tab-delimited files, or one position per line)')
+    parser.add_argument('--exclude-sites', type=str, help='file name of one-based sites to exclude for raw tree building (BED format in .bed files, second column in tab-delimited files, or one position per line)')
     parser.add_argument('--tree-builder-args', type=str, default='', help='extra arguments to be passed directly to the executable of the requested tree method (e.g., --tree-builder-args="-czb")')
 
 


=====================================
debian/changelog
=====================================
@@ -1,3 +1,9 @@
+augur (13.1.0-1) unstable; urgency=medium
+
+  * New upstream version
+
+ -- Andreas Tille <tille at debian.org>  Sun, 16 Jan 2022 08:15:28 +0100
+
 augur (13.0.2-2) unstable; urgency=medium
 
   * d/control: Depend on python3-boto3 instead of python3-boto


=====================================
debian/control
=====================================
@@ -1,6 +1,7 @@
 Source: augur
 Maintainer: Debian Med Packaging Team <debian-med-packaging at lists.alioth.debian.org>
-Uploaders: Andreas Tille <tille at debian.org>, Nilesh Patra <nilesh at debian.org>
+Uploaders: Andreas Tille <tille at debian.org>,
+           Nilesh Patra <nilesh at debian.org>
 Section: science
 Priority: optional
 Build-Depends: debhelper-compat (= 13),
@@ -33,11 +34,11 @@ Depends: ${python3:Depends},
          python3-cvxopt,
          python3-dendropy,
          python3-jsonschema,
-         python3-matplotlib (>= 1.5.1),
+         python3-matplotlib,
          python3-packaging,
-         python3-pandas (>= 0.16.2),
+         python3-pandas,
          python3-schedule,
-         python3-seaborn (>= 0.6.0),
+         python3-seaborn,
          python3-treetime,
          seqmagick,
          python3-ipdb,


=====================================
docs/contribute/DEV_DOCS.md
=====================================
@@ -230,7 +230,7 @@ It can sometimes be useful to verify the config is parsed as you expect using
 
 ## Contributing documentation
 
-[Documentation](https://nextstrain-augur.readthedocs.io) is built using [Sphinx](http://sphinx-doc.org/) and hosted on [Read The Docs](https://readthedocs.org/).
+[Documentation](https://docs.nextstrain.org/projects/augur) is built using [Sphinx](http://sphinx-doc.org/) and hosted on [Read The Docs](https://readthedocs.org/).
 Versions of the documentation for each augur release and git branch are available and preserved.
 Read The Docs is updated automatically from commits and releases on GitHub.
 


=====================================
tests/builds/zika.t
=====================================
@@ -54,7 +54,7 @@ Align filtered sequences to a specific reference sequence and fill any gaps.
   >  --output "$TMP/out/aligned.fasta" \
   >  --fill-gaps > /dev/null
 
-  $ diff -u "results/aligned.fasta" "$TMP/out/aligned.fasta"
+  $ diff --ignore-matching-lines=".*KX369547.1.*" -u "results/aligned.fasta" "$TMP/out/aligned.fasta"
 
 Build a tree from the multiple sequence alignment.
 


=====================================
tests/functional/filter.t
=====================================
@@ -119,6 +119,39 @@ By setting the subsample seed above, we should get the same results for both run
   $ diff -u <(sort "$TMP/filtered_strains_probabilistic.txt") <(sort "$TMP/filtered_strains_default.txt")
   $ rm -f "$TMP/filtered_strains_probabilistic.txt" "$TMP/filtered_strains_default.txt"
 
+Check output of probabilistic sampling.
+
+  $ ${AUGUR} filter \
+  >  --metadata filter/metadata.tsv \
+  >  --group-by region year month \
+  >  --subsample-max-sequences 3 \
+  >  --probabilistic-sampling \
+  >  --subsample-seed 314159 \
+  >  --output-metadata "$TMP/filtered_metadata.tsv"
+  WARNING: Asked to provide at most 3 sequences, but there are 8 groups.
+  Sampling probabilistically at 0.3633 sequences per group, meaning it is possible to have more than the requested maximum of 3 sequences after filtering.
+  10 strains were dropped during filtering
+  \t1 were dropped during grouping due to ambiguous year information (esc)
+  \t1 were dropped during grouping due to ambiguous month information (esc)
+  \t8 of these were dropped because of subsampling criteria, using seed 314159 (esc)
+  2 strains passed all filters
+
+Ensure probabilistic sampling is not used when unnecessary.
+
+  $ ${AUGUR} filter \
+  >  --metadata filter/metadata.tsv \
+  >  --group-by region year month \
+  >  --subsample-max-sequences 10 \
+  >  --probabilistic-sampling \
+  >  --subsample-seed 314159 \
+  >  --output-metadata "$TMP/filtered_metadata.tsv"
+  Sampling at 10 per group.
+  2 strains were dropped during filtering
+  \t1 were dropped during grouping due to ambiguous year information (esc)
+  \t1 were dropped during grouping due to ambiguous month information (esc)
+  \t0 of these were dropped because of subsampling criteria, using seed 314159 (esc)
+  10 strains passed all filters
+
 Filter using only metadata without sequence input or output and save results as filtered metadata.
 
   $ ${AUGUR} filter \


=====================================
tests/functional/tree.t
=====================================
@@ -29,6 +29,13 @@ Try building a tree with IQ-TREE using its ModelTest functionality, by supplying
   >  --output "$TMP/tree_raw.nwk" \
   >  --nthreads 1 > /dev/null
 
+Build a tree with excluded sites using a compressed input file.
+
+  $ ${AUGUR} tree \
+  >  --alignment tree/aligned.fasta.xz \
+  >  --exclude-sites tree/excluded_sites.txt \
+  >  --output "$TMP/tree_raw.nwk" &> /dev/null
+
 Clean up tree log files.
 
   $ rm -f tree/*.log


=====================================
tests/functional/tree/aligned.fasta.xz
=====================================
Binary files /dev/null and b/tests/functional/tree/aligned.fasta.xz differ


=====================================
tests/functional/tree/excluded_sites.txt
=====================================
@@ -0,0 +1,2 @@
+1000
+2000


=====================================
tests/test_align.py
=====================================
@@ -10,6 +10,7 @@ from shlex import quote
 from Bio import SeqIO
 from Bio.Align import MultipleSeqAlignment
 from Bio.Seq import Seq
+from Bio.Seq import MutableSeq
 from Bio.SeqRecord import SeqRecord
 
 from augur import align
@@ -227,13 +228,13 @@ class TestAlign:
         
     def test_read_sequences(self):
         data_file = pathlib.Path('tests/data/align/test_aligned_sequences.fasta')
-        result = align.read_sequences(data_file)
+        result = align.read_sequences(str(data_file))
         assert len(result) == 4
 
     def test_read_seq_compare(self):
         data_file = pathlib.Path("tests/data/align/aa-seq_h3n2_ha_2y_2HA1_dup.fasta")
         with pytest.raises(align.AlignmentError):
-            assert align.read_sequences(data_file)
+            assert align.read_sequences(str(data_file))
 
     def test_prepare_no_alignment_or_ref(self, test_file, test_seqs, out_file):
         _, output, _ = align.prepare([test_file,], None, out_file, None, None)
@@ -366,7 +367,7 @@ class TestAlign:
     def test_postprocess_strip_non_reference(self, tmpdir, ref_seq, ref_file):
         """Postprocess should strip gaps in the reference sequence from other sequences, but not gaps in those sequences"""
         expected_length = len(ref_seq.seq) - ref_seq.seq.count("-")
-        gapped_seq = ref_seq.seq.tomutable()
+        gapped_seq = MutableSeq(str(ref_seq.seq))
         gapped_seq[1] = "-"
         gapped = SeqRecord(gapped_seq, "GAP")
         gap_file = write_strains(tmpdir, "gaps", [ref_seq, gapped])


=====================================
tests/test_filter_groupby.py
=====================================
@@ -0,0 +1,216 @@
+import pytest
+import pandas as pd
+from augur.filter import get_groups_for_subsampling, FilterException
+
+ at pytest.fixture
+def valid_metadata() -> pd.DataFrame:
+    columns = ['strain', 'date', 'country']
+    data = [
+        ("SEQ_1","2020-01-XX","A"),
+        ("SEQ_2","2020-02-01","A"),
+        ("SEQ_3","2020-03-01","B"),
+        ("SEQ_4","2020-04-01","B"),
+        ("SEQ_5","2020-05-01","B")
+    ]
+    return pd.DataFrame.from_records(data, columns=columns).set_index('strain')
+
+class TestFilterGroupBy:
+    def test_filter_groupby_strain_subset(self, valid_metadata: pd.DataFrame):
+        metadata = valid_metadata.copy()
+        strains = ['SEQ_1', 'SEQ_3', 'SEQ_5']
+        group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata)
+        assert group_by_strain == {
+            'SEQ_1': ('_dummy',),
+            'SEQ_3': ('_dummy',),
+            'SEQ_5': ('_dummy',)
+        }
+        assert skipped_strains == []
+
+    def test_filter_groupby_dummy(self, valid_metadata: pd.DataFrame):
+        metadata = valid_metadata.copy()
+        strains = metadata.index.tolist()
+        group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata)
+        assert group_by_strain == {
+            'SEQ_1': ('_dummy',),
+            'SEQ_2': ('_dummy',),
+            'SEQ_3': ('_dummy',),
+            'SEQ_4': ('_dummy',),
+            'SEQ_5': ('_dummy',)
+        }
+        assert skipped_strains == []
+
+    def test_filter_groupby_invalid_error(self, valid_metadata: pd.DataFrame):
+        groups = ['invalid']
+        metadata = valid_metadata.copy()
+        strains = metadata.index.tolist()
+        with pytest.raises(FilterException) as e_info:
+            get_groups_for_subsampling(strains, metadata, group_by=groups)
+        assert str(e_info.value) == "The specified group-by categories (['invalid']) were not found. No sequences-per-group sampling will be done."
+
+    def test_filter_groupby_invalid_warn(self, valid_metadata: pd.DataFrame, capsys):
+        groups = ['country', 'year', 'month', 'invalid']
+        metadata = valid_metadata.copy()
+        strains = metadata.index.tolist()
+        group_by_strain, _ = get_groups_for_subsampling(strains, metadata, group_by=groups)
+        assert group_by_strain == {
+            'SEQ_1': ('A', 2020, (2020, 1), 'unknown'),
+            'SEQ_2': ('A', 2020, (2020, 2), 'unknown'),
+            'SEQ_3': ('B', 2020, (2020, 3), 'unknown'),
+            'SEQ_4': ('B', 2020, (2020, 4), 'unknown'),
+            'SEQ_5': ('B', 2020, (2020, 5), 'unknown')
+        }
+        captured = capsys.readouterr()
+        assert captured.err == "WARNING: Some of the specified group-by categories couldn't be found: invalid\nFiltering by group may behave differently than expected!\n"
+
+    def test_filter_groupby_skip_ambiguous_year(self, valid_metadata: pd.DataFrame):
+        groups = ['country', 'year', 'month']
+        metadata = valid_metadata.copy()
+        metadata.at["SEQ_2", "date"] = "XXXX-02-01"
+        strains = metadata.index.tolist()
+        group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata, group_by=groups)
+        assert group_by_strain == {
+            'SEQ_1': ('A', 2020, (2020, 1)),
+            'SEQ_3': ('B', 2020, (2020, 3)),
+            'SEQ_4': ('B', 2020, (2020, 4)),
+            'SEQ_5': ('B', 2020, (2020, 5))
+        }
+        assert skipped_strains == [{'strain': 'SEQ_2', 'filter': 'skip_group_by_with_ambiguous_year', 'kwargs': ''}]
+
+    def test_filter_groupby_skip_missing_date(self, valid_metadata: pd.DataFrame):
+        groups = ['country', 'year', 'month']
+        metadata = valid_metadata.copy()
+        metadata.at["SEQ_2", "date"] = None
+        strains = metadata.index.tolist()
+        group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata, group_by=groups)
+        assert group_by_strain == {
+            'SEQ_1': ('A', 2020, (2020, 1)),
+            'SEQ_3': ('B', 2020, (2020, 3)),
+            'SEQ_4': ('B', 2020, (2020, 4)),
+            'SEQ_5': ('B', 2020, (2020, 5))
+        }
+        assert skipped_strains == [{'strain': 'SEQ_2', 'filter': 'skip_group_by_with_ambiguous_year', 'kwargs': ''}]
+
+    def test_filter_groupby_skip_ambiguous_month(self, valid_metadata: pd.DataFrame):
+        groups = ['country', 'year', 'month']
+        metadata = valid_metadata.copy()
+        metadata.at["SEQ_2", "date"] = "2020-XX-01"
+        strains = metadata.index.tolist()
+        group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata, group_by=groups)
+        assert group_by_strain == {
+            'SEQ_1': ('A', 2020, (2020, 1)),
+            'SEQ_3': ('B', 2020, (2020, 3)),
+            'SEQ_4': ('B', 2020, (2020, 4)),
+            'SEQ_5': ('B', 2020, (2020, 5))
+        }
+        assert skipped_strains == [{'strain': 'SEQ_2', 'filter': 'skip_group_by_with_ambiguous_month', 'kwargs': ''}]
+
+    def test_filter_groupby_skip_missing_month(self, valid_metadata: pd.DataFrame):
+        groups = ['country', 'year', 'month']
+        metadata = valid_metadata.copy()
+        metadata.at["SEQ_2", "date"] = "2020"
+        strains = metadata.index.tolist()
+        group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata, group_by=groups)
+        assert group_by_strain == {
+            'SEQ_1': ('A', 2020, (2020, 1)),
+            'SEQ_3': ('B', 2020, (2020, 3)),
+            'SEQ_4': ('B', 2020, (2020, 4)),
+            'SEQ_5': ('B', 2020, (2020, 5))
+        }
+        assert skipped_strains == [{'strain': 'SEQ_2', 'filter': 'skip_group_by_with_ambiguous_month', 'kwargs': ''}]
+
+    def test_filter_groupby_missing_year_error(self, valid_metadata: pd.DataFrame):
+        groups = ['year']
+        metadata = valid_metadata.copy()
+        metadata = metadata.drop('date', axis='columns')
+        strains = metadata.index.tolist()
+        with pytest.raises(FilterException) as e_info:
+            get_groups_for_subsampling(strains, metadata, group_by=groups)
+        assert str(e_info.value) == "The specified group-by categories (['year']) were not found. No sequences-per-group sampling will be done. Note that using 'year' or 'year month' requires a column called 'date'."
+
+    def test_filter_groupby_missing_month_error(self, valid_metadata: pd.DataFrame):
+        groups = ['month']
+        metadata = valid_metadata.copy()
+        metadata = metadata.drop('date', axis='columns')
+        strains = metadata.index.tolist()
+        with pytest.raises(FilterException) as e_info:
+            get_groups_for_subsampling(strains, metadata, group_by=groups)
+        assert str(e_info.value) == "The specified group-by categories (['month']) were not found. No sequences-per-group sampling will be done. Note that using 'year' or 'year month' requires a column called 'date'."
+
+    def test_filter_groupby_missing_year_and_month_error(self, valid_metadata: pd.DataFrame):
+        groups = ['year', 'month']
+        metadata = valid_metadata.copy()
+        metadata = metadata.drop('date', axis='columns')
+        strains = metadata.index.tolist()
+        with pytest.raises(FilterException) as e_info:
+            get_groups_for_subsampling(strains, metadata, group_by=groups)
+        assert str(e_info.value) == "The specified group-by categories (['year', 'month']) were not found. No sequences-per-group sampling will be done. Note that using 'year' or 'year month' requires a column called 'date'."
+
+    def test_filter_groupby_missing_date_warn(self, valid_metadata: pd.DataFrame, capsys):
+        groups = ['country', 'year', 'month']
+        metadata = valid_metadata.copy()
+        metadata = metadata.drop('date', axis='columns')
+        strains = metadata.index.tolist()
+        group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata, group_by=groups)
+        assert group_by_strain == {
+            'SEQ_1': ('A', 'unknown', 'unknown'),
+            'SEQ_2': ('A', 'unknown', 'unknown'),
+            'SEQ_3': ('B', 'unknown', 'unknown'),
+            'SEQ_4': ('B', 'unknown', 'unknown'),
+            'SEQ_5': ('B', 'unknown', 'unknown')
+        }
+        captured = capsys.readouterr()
+        assert captured.err == "WARNING: A 'date' column could not be found to group-by year or month.\nFiltering by group may behave differently than expected!\n"
+        assert skipped_strains == []
+
+    def test_filter_groupby_no_strains(self, valid_metadata: pd.DataFrame):
+        groups = ['country', 'year', 'month']
+        metadata = valid_metadata.copy()
+        strains = []
+        group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata, group_by=groups)
+        assert group_by_strain == {}
+        assert skipped_strains == []
+
+    def test_filter_groupby_only_year_provided(self, valid_metadata: pd.DataFrame):
+        groups = ['country', 'year']
+        metadata = valid_metadata.copy()
+        metadata['date'] = '2020'
+        strains = metadata.index.tolist()
+        group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata, group_by=groups)
+        assert group_by_strain == {
+            'SEQ_1': ('A', 2020),
+            'SEQ_2': ('A', 2020),
+            'SEQ_3': ('B', 2020),
+            'SEQ_4': ('B', 2020),
+            'SEQ_5': ('B', 2020)
+        }
+        assert skipped_strains == []
+
+    def test_filter_groupby_month_with_only_year_provided(self, valid_metadata: pd.DataFrame):
+        groups = ['country', 'year', 'month']
+        metadata = valid_metadata.copy()
+        metadata['date'] = '2020'
+        strains = metadata.index.tolist()
+        group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata, group_by=groups)
+        assert group_by_strain == {}
+        assert skipped_strains == [
+            {'strain': 'SEQ_1', 'filter': 'skip_group_by_with_ambiguous_month', 'kwargs': ''},
+            {'strain': 'SEQ_2', 'filter': 'skip_group_by_with_ambiguous_month', 'kwargs': ''},
+            {'strain': 'SEQ_3', 'filter': 'skip_group_by_with_ambiguous_month', 'kwargs': ''},
+            {'strain': 'SEQ_4', 'filter': 'skip_group_by_with_ambiguous_month', 'kwargs': ''},
+            {'strain': 'SEQ_5', 'filter': 'skip_group_by_with_ambiguous_month', 'kwargs': ''}
+        ]
+
+    def test_filter_groupby_only_year_month_provided(self, valid_metadata: pd.DataFrame):
+        groups = ['country', 'year', 'month']
+        metadata = valid_metadata.copy()
+        metadata['date'] = '2020-01'
+        strains = metadata.index.tolist()
+        group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata, group_by=groups)
+        assert group_by_strain == {
+            'SEQ_1': ('A', 2020, (2020, 1)),
+            'SEQ_2': ('A', 2020, (2020, 1)),
+            'SEQ_3': ('B', 2020, (2020, 1)),
+            'SEQ_4': ('B', 2020, (2020, 1)),
+            'SEQ_5': ('B', 2020, (2020, 1))
+        }
+        assert skipped_strains == []



View it on GitLab: https://salsa.debian.org/med-team/augur/-/compare/839b08377c59d906c81d9b3d9059a0346e287ce1...f8d595fac24a28248d8f3d5647c970c64f46cd2b

-- 
View it on GitLab: https://salsa.debian.org/med-team/augur/-/compare/839b08377c59d906c81d9b3d9059a0346e287ce1...f8d595fac24a28248d8f3d5647c970c64f46cd2b
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20220116/0a813089/attachment-0001.htm>